Counting Values

Counting unique elements in a sequence in various ways (absolute and cumulative, count and percentage).

source

value_counts_plus

 value_counts_plus (data, dropna=False, show_top=10, sort_others=False,
                    style=True, size=10, thousands=',', decimal='.',
                    name='data', background_gradient='cividis')

Provide a few ways of showing counts of values of items in series.

Type Default Details
data
dropna bool False Whether or not to drop missing values.
show_top int 10 How many of the top rows to display.
sort_others bool False Whether or not to place “Others” in the bottom (default) or in its
sorted order position.
style bool True Whether or not to style values for easier reading. If set to True
the result would not be a DataFrame, and cannot be further manipulated.
Set the value to False to get aDataFrame as the return value.
size int 10 The size in points of the font of the table. This results in the whole
table being resized.
thousands str , The character to use to separate thousands if style=True. Defaults to
, but you can change to . or space, or any oher character you want.
decimal str . The character to use to display decimal number if style=True. Defaults to
. but you can change to ,or any oher character you want.
name str data The name of the column that you want displayed in the final table. It
appears in the caption and defaults to “data”.
background_gradient str cividis
Returns pandas.io.formats.style.Styler A DataFrame showing counts based on the provided arguments

Counting a list of random colors - default

import pandas as pd
import plotly.express as px
import random
import numpy as np
import matplotlib as mpl
import plotly.express as px
colors = list(mpl.colors.cnames.keys())
colors = random.choices(colors, weights=[0.9, 0.04, 0.05, 0.09]*37, k=10_000)
colors += [np.nan for i in range(240)]
colors[:20]
['goldenrod',
 'darkmagenta',
 'cyan',
 'olivedrab',
 'navajowhite',
 'papayawhip',
 'salmon',
 'mediumslateblue',
 'azure',
 'plum',
 'indigo',
 'peru',
 'goldenrod',
 'red',
 'brown',
 'lightcoral',
 'palegoldenrod',
 'dodgerblue',
 'dodgerblue',
 'cyan']
value_counts_plus(colors)

Counts of data

  data count cum. count % cum. %
1 lightseagreen 258 258 2.5% 2.5%
2 burlywood 255 513 2.5% 5.0%
3 cyan 252 765 2.5% 7.5%
4 slategray 252 1,017 2.5% 9.9%
5 lightsteelblue 251 1,268 2.5% 12.4%
6 aliceblue 245 1,513 2.4% 14.8%
7 steelblue 243 1,756 2.4% 17.1%
8 azure 242 1,998 2.4% 19.5%
9 sienna 242 2,240 2.4% 21.9%
10 indigo 241 2,481 2.4% 24.2%
11 Others: 7,759 10,240 75.8% 100.0%

Changing the number of displayed rows with show_top

value_counts_plus(colors, show_top=15)

Counts of data

  data count cum. count % cum. %
1 lightseagreen 258 258 2.5% 2.5%
2 burlywood 255 513 2.5% 5.0%
3 cyan 252 765 2.5% 7.5%
4 slategray 252 1,017 2.5% 9.9%
5 lightsteelblue 251 1,268 2.5% 12.4%
6 aliceblue 245 1,513 2.4% 14.8%
7 steelblue 243 1,756 2.4% 17.1%
8 azure 242 1,998 2.4% 19.5%
9 sienna 242 2,240 2.4% 21.9%
10 indigo 241 2,481 2.4% 24.2%
11 nan 240 2,721 2.3% 26.6%
12 goldenrod 237 2,958 2.3% 28.9%
13 dodgerblue 232 3,190 2.3% 31.2%
14 mediumblue 231 3,421 2.3% 33.4%
15 lightgreen 231 3,652 2.3% 35.7%
16 Others: 6,588 10,240 64.3% 100.0%

Sorting “Others:”

value_counts_plus(colors, sort_others=True)

Counts of data

  data count cum. count % cum. %
1 Others: 7,759 7,759 75.8% 75.8%
2 lightseagreen 258 8,017 2.5% 78.3%
3 burlywood 255 8,272 2.5% 80.8%
4 cyan 252 8,524 2.5% 83.2%
5 slategray 252 8,776 2.5% 85.7%
6 lightsteelblue 251 9,027 2.5% 88.2%
7 aliceblue 245 9,272 2.4% 90.5%
8 steelblue 243 9,515 2.4% 92.9%
9 azure 242 9,757 2.4% 95.3%
10 sienna 242 9,999 2.4% 97.6%
11 indigo 241 10,240 2.4% 100.0%

Changing the name of the data and caption title with name

value_counts_plus(
    colors,
    name='Status codes')

Counts of Status codes

  Status codes count cum. count % cum. %
1 lightseagreen 258 258 2.5% 2.5%
2 burlywood 255 513 2.5% 5.0%
3 cyan 252 765 2.5% 7.5%
4 slategray 252 1,017 2.5% 9.9%
5 lightsteelblue 251 1,268 2.5% 12.4%
6 aliceblue 245 1,513 2.4% 14.8%
7 steelblue 243 1,756 2.4% 17.1%
8 azure 242 1,998 2.4% 19.5%
9 sienna 242 2,240 2.4% 21.9%
10 indigo 241 2,481 2.4% 24.2%
11 Others: 7,759 10,240 75.8% 100.0%

Use the function on DataFrames to get multi-level counts

Explore the URL structure of apple.com (~300k URLs):

urldf = adv.url_to_df(pd.read_csv('data/apple_url_list.csv')['url'])
urldf.filter(regex='netloc|dir_\d+')
netloc dir_1 dir_2 dir_3 dir_4 dir_5 dir_6 dir_7 dir_8
0 www.apple.com ae shop accessories all NaN NaN NaN NaN
1 www.apple.com ae shop accessories all accessibility NaN NaN NaN
2 www.apple.com ae shop accessories all airtag NaN NaN NaN
3 www.apple.com ae shop accessories all beats NaN NaN NaN
4 www.apple.com ae shop accessories all beats-featured NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
296005 www.apple.com today event skills-getting-started-with-iphone 7015126954609578141 NaN NaN NaN NaN
296006 www.apple.com today event skills-getting-started-with-iphone 7015489343309862057 NaN NaN NaN NaN
296007 www.apple.com today event skills-getting-started-with-iphone 7016938895023916237 NaN NaN NaN NaN
296008 www.apple.com today event skills-getting-started-with-iphone 7017663669723312297 NaN NaN NaN NaN
296009 www.apple.com today event skills-getting-started-with-ipad 7018026058515849077 NaN NaN NaN NaN

296010 rows × 9 columns

value_counts_plus(urldf[['netloc']], name='apple.com URL directories')

Counts of apple.com URL directories

  netloc count cum. count % cum. %
1 www.apple.com 289,093 289,093 97.7% 97.7%
2 www.apple.com.cn 6,917 296,010 2.3% 100.0%
value_counts_plus(urldf[['netloc', 'dir_1']], name='apple.com URL directories <code>domain/dir_1')

Counts of apple.com URL directories domain/dir_1

  netloc dir_1 count cum. count % cum. %
1 www.apple.com today 13,229 13,229 4.5% 4.5%
2 www.apple.com ca 8,432 21,661 2.8% 7.3%
3 www.apple.com shop 8,424 30,085 2.8% 10.2%
4 www.apple.com de 7,677 37,762 2.6% 12.8%
5 www.apple.com uk 7,589 45,351 2.6% 15.3%
6 www.apple.com xf 7,540 52,891 2.5% 17.9%
7 www.apple.com au 7,410 60,301 2.5% 20.4%
8 www.apple.com hk 7,378 67,679 2.5% 22.9%
9 www.apple.com it 7,355 75,034 2.5% 25.3%
10 www.apple.com es 7,257 82,291 2.5% 27.8%
11 Others: 213,719 296,010 72.2% 100.0%

Check counts of the first two directories’ combinations

value_counts_plus(urldf[['dir_1', 'dir_2']], name='apple.com URL directories<code>dir_1/dir_2</code>')

Counts of apple.com URL directoriesdir_1/dir_2

  dir_1 dir_2 count cum. count % cum. %
1 today event 12,118 12,118 4.1% 4.1%
2 xf shop 7,539 19,657 2.5% 6.6%
3 ca shop 7,450 27,107 2.5% 9.2%
4 de shop 7,201 34,308 2.4% 11.6%
5 uk shop 7,080 41,388 2.4% 14.0%
6 au shop 6,915 48,303 2.3% 16.3%
7 it shop 6,875 55,178 2.3% 18.6%
8 es shop 6,789 61,967 2.3% 20.9%
9 fr shop 6,716 68,683 2.3% 23.2%
10 sg shop 6,714 75,397 2.3% 25.5%
11 Others: 220,613 296,010 74.5% 100.0%

Filter for a country (de) and show counts of combinations of dir_2 and dir_3

value_counts_plus(urldf[urldf['dir_1'].eq('de')][['dir_2', 'dir_3']], name='apple.com/de URL directories', show_top=20)

Counts of apple.com/de URL directories

  dir_2 dir_3 count cum. count % cum. %
1 shop accessories 1,964 1,964 25.6% 25.6%
2 shop product 1,866 3,830 24.3% 49.9%
3 shop buy-watch 865 4,695 11.3% 61.2%
4 shop refurbished 464 5,159 6.0% 67.2%
5 shop mac 461 5,620 6.0% 73.2%
6 shop ipad 424 6,044 5.5% 78.7%
7 shop iphone 419 6,463 5.5% 84.2%
8 shop watch 299 6,762 3.9% 88.1%
9 shop buy-iphone 131 6,893 1.7% 89.8%
10 tv-pr originals 106 6,999 1.4% 91.2%
11 shop buy-ipad 106 7,105 1.4% 92.5%
12 shop buy-mac 91 7,196 1.2% 93.7%
13 shop smart-home 56 7,252 0.7% 94.5%
14 business success-stories 26 7,278 0.3% 94.8%
15 legal sales-support 23 7,301 0.3% 95.1%
16 shop help 16 7,317 0.2% 95.3%
17 legal internet-services 10 7,327 0.1% 95.4%
18 legal intellectual-property 10 7,337 0.1% 95.6%
19 support products 9 7,346 0.1% 95.7%
20 education k12 7 7,353 0.1% 95.8%
21 Others: 324 7,677 4.2% 100.0%

Change the size of table: size

value_counts_plus(colors, size=5)

Counts of data

  data count cum. count % cum. %
1 lightseagreen 258 258 2.5% 2.5%
2 burlywood 255 513 2.5% 5.0%
3 cyan 252 765 2.5% 7.5%
4 slategray 252 1,017 2.5% 9.9%
5 lightsteelblue 251 1,268 2.5% 12.4%
6 aliceblue 245 1,513 2.4% 14.8%
7 steelblue 243 1,756 2.4% 17.1%
8 azure 242 1,998 2.4% 19.5%
9 sienna 242 2,240 2.4% 21.9%
10 indigo 241 2,481 2.4% 24.2%
11 Others: 7,759 10,240 75.8% 100.0%
value_counts_plus(colors, size=20)

Counts of data

  data count cum. count % cum. %
1 lightseagreen 258 258 2.5% 2.5%
2 burlywood 255 513 2.5% 5.0%
3 cyan 252 765 2.5% 7.5%
4 slategray 252 1,017 2.5% 9.9%
5 lightsteelblue 251 1,268 2.5% 12.4%
6 aliceblue 245 1,513 2.4% 14.8%
7 steelblue 243 1,756 2.4% 17.1%
8 azure 242 1,998 2.4% 19.5%
9 sienna 242 2,240 2.4% 21.9%
10 indigo 241 2,481 2.4% 24.2%
11 Others: 7,759 10,240 75.8% 100.0%

Completely change the caption using an HTML string with set_caption

caption = '<h4>Status codes</h4>Top 5 values <a href="https://example.com">raw data</a>'
value_counts_plus(
    colors,
    name='Statuses',
    show_top=5).set_caption(caption)

Status codes

Top 5 values raw data
  Statuses count cum. count % cum. %
1 lightseagreen 258 258 2.5% 2.5%
2 burlywood 255 513 2.5% 5.0%
3 cyan 252 765 2.5% 7.5%
4 slategray 252 1,017 2.5% 9.9%
5 lightsteelblue 251 1,268 2.5% 12.4%
6 Others: 8,972 10,240 87.6% 100.0%

Removing table styling if you want a pure DataFrame:

  • Counting in non-styled DataFrames is 0-based in case you want to further process it
  • Columns are displayed in a slightly different manner
value_counts_plus(colors, style=False)
data count cum_count perc cum_perc
0 lightseagreen 258 258 0.025195 0.025195
1 burlywood 255 513 0.024902 0.050098
2 cyan 252 765 0.024609 0.074707
3 slategray 252 1017 0.024609 0.099316
4 lightsteelblue 251 1268 0.024512 0.123828
5 aliceblue 245 1513 0.023926 0.147754
6 steelblue 243 1756 0.023730 0.171484
7 azure 242 1998 0.023633 0.195117
8 sienna 242 2240 0.023633 0.218750
9 indigo 241 2481 0.023535 0.242285
0 Others: 7759 10240 0.757715 1.000000

Counting continuous data

We can easily make continuous data discrete by binning them. One easy way to do that is the pandas.cut function.

For example, if we want to count countries’ populations, we can first bin them under 50M intervals, and run value_counts_plus.

gm = px.data.gapminder().query('year == 2007')
gm['pop_interval'] = pd.cut(gm['pop'], range(0, 1_500_000_000, 50_000_000))
gm.sort_values('pop', ascending=False).head(15)
country continent year lifeExp pop gdpPercap iso_alpha iso_num pop_interval
299 China Asia 2007 72.961 1318683096 4959.114854 CHN 156 (1300000000, 1350000000]
707 India Asia 2007 64.698 1110396331 2452.210407 IND 356 (1100000000, 1150000000]
1619 United States Americas 2007 78.242 301139947 42951.653090 USA 840 (300000000, 350000000]
719 Indonesia Asia 2007 70.650 223547000 3540.651564 IDN 360 (200000000, 250000000]
179 Brazil Americas 2007 72.390 190010647 9065.800825 BRA 76 (150000000, 200000000]
1175 Pakistan Asia 2007 65.483 169270617 2605.947580 PAK 586 (150000000, 200000000]
107 Bangladesh Asia 2007 64.062 150448339 1391.253792 BGD 50 (150000000, 200000000]
1139 Nigeria Africa 2007 46.859 135031164 2013.977305 NGA 566 (100000000, 150000000]
803 Japan Asia 2007 82.603 127467972 31656.068060 JPN 392 (100000000, 150000000]
995 Mexico Americas 2007 76.195 108700891 11977.574960 MEX 484 (100000000, 150000000]
1223 Philippines Asia 2007 71.688 91077287 3190.481016 PHL 608 (50000000, 100000000]
1655 Vietnam Asia 2007 74.249 85262356 2441.576404 VNM 704 (50000000, 100000000]
575 Germany Europe 2007 79.406 82400996 32170.374420 DEU 276 (50000000, 100000000]
467 Egypt Africa 2007 71.338 80264543 5581.180998 EGY 818 (50000000, 100000000]
515 Ethiopia Africa 2007 52.947 76511887 690.805576 ETH 231 (50000000, 100000000]
value_counts_plus(
    gm['pop_interval'],
    name='Population intervals - 2007',
    background_gradient='RdBu')

Counts of Population intervals - 2007

  pop_interval count cum. count % cum. %
1 (0, 50000000] 120 120 84.5% 84.5%
2 (50000000, 100000000] 12 132 8.5% 93.0%
3 (100000000, 150000000] 3 135 2.1% 95.1%
4 (150000000, 200000000] 3 138 2.1% 97.2%
5 (1100000000, 1150000000] 1 139 0.7% 97.9%
6 (200000000, 250000000] 1 140 0.7% 98.6%
7 (300000000, 350000000] 1 141 0.7% 99.3%
8 (1300000000, 1350000000] 1 142 0.7% 100.0%
9 (950000000, 1000000000] 0 142 0.0% 100.0%
10 (1000000000, 1050000000] 0 142 0.0% 100.0%
11 Others: 0 142 0.0% 100.0%

Now we can see that there are 120 countries with a population less than 50M. You can play around with the bin interval to get an appropriate value for your data.

Changing the theme used with theme

value_counts_plus(colors, background_gradient='Greens')

Counts of data

  data count cum. count % cum. %
1 lightseagreen 258 258 2.5% 2.5%
2 burlywood 255 513 2.5% 5.0%
3 cyan 252 765 2.5% 7.5%
4 slategray 252 1,017 2.5% 9.9%
5 lightsteelblue 251 1,268 2.5% 12.4%
6 aliceblue 245 1,513 2.4% 14.8%
7 steelblue 243 1,756 2.4% 17.1%
8 azure 242 1,998 2.4% 19.5%
9 sienna 242 2,240 2.4% 21.9%
10 indigo 241 2,481 2.4% 24.2%
11 Others: 7,759 10,240 75.8% 100.0%
value_counts_plus(colors, background_gradient='cool')

Counts of data

  data count cum. count % cum. %
1 lightseagreen 258 258 2.5% 2.5%
2 burlywood 255 513 2.5% 5.0%
3 cyan 252 765 2.5% 7.5%
4 slategray 252 1,017 2.5% 9.9%
5 lightsteelblue 251 1,268 2.5% 12.4%
6 aliceblue 245 1,513 2.4% 14.8%
7 steelblue 243 1,756 2.4% 17.1%
8 azure 242 1,998 2.4% 19.5%
9 sienna 242 2,240 2.4% 21.9%
10 indigo 241 2,481 2.4% 24.2%
11 Others: 7,759 10,240 75.8% 100.0%

Convert the table to raw HTML for embedding in a blog or web page

print(value_counts_plus(colors).to_html()[:700])
<style type="text/css">
#T_c9396_row0_col1, #T_c9396_row0_col2, #T_c9396_row0_col3, #T_c9396_row0_col4, #T_c9396_row1_col1, #T_c9396_row1_col3, #T_c9396_row2_col1, #T_c9396_row2_col3, #T_c9396_row3_col1, #T_c9396_row3_col3, #T_c9396_row4_col1, #T_c9396_row4_col3, #T_c9396_row5_col1, #T_c9396_row5_col3, #T_c9396_row6_col1, #T_c9396_row6_col3, #T_c9396_row7_col1, #T_c9396_row7_col3, #T_c9396_row8_col1, #T_c9396_row8_col3, #T_c9396_row9_col1, #T_c9396_row9_col3 {
  background-color: #00224e;
  color: #f1f1f1;
}
#T_c9396_row1_col2, #T_c9396_row1_col4 {
  background-color: #002758;
  color: #f1f1f1;
}
#T_c9396_row2_col2, #T_c9396_row2_col4 {
  background-color: #002c64;
  color: #f1f1f1;
}
#T_c93