2. Classifying by One Variable¶

Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. For example, in the example using Galton’s data on heights, we saw that it was useful to classify families according to the parents’ midparent heights, and then find the average height of the children in each group.

This section is about classifying individuals into categories that are not numerical. We begin by recalling the basic use of group.

2.1. Counting the Number in Each Category¶

The group method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.

Here is a small table of data on ice cream cones. The group method can be used to list the distinct flavors and provide the counts of each flavor.

cones = pd.DataFrame({
    'Flavor':np.array(['strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate']),
    'Price':np.array([3.55, 4.75, 6.55, 5.25, 5.25])}
)
cones

	Flavor	Price
0	strawberry	3.55
1	chocolate	4.75
2	chocolate	6.55
3	strawberry	5.25
4	chocolate	5.25

df_grouped = cones.groupby(["Flavor"]).agg(
    count=pd.NamedAgg(column="Flavor", aggfunc="count")
)

df_grouped

	count
Flavor
chocolate	3
strawberry	2

There are two distinct categories, chocolate and strawberry. When we call groupby we must state what we want to do with the group data e.g. count(). Applying the count() method will create a column of counts whcih takes the names of the first column in the df by default, and contains the number of rows in each category. To make this easier to read we could change the count column to ‘count’.

Notice that this can all be worked out from just the Flavor column. Only the Price column name has been used, the data has not been used.

But what if we wanted the total price of the cones of each different flavor? In this case we can apply a different method e.g. sum(), to groupby.

2.1.1. Finding a Characteristic of Each Category¶

The optional second argument of group names the function that will be used to aggregate values in other columns for all of those rows. For instance, sum will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.

To find the total price of each flavor, we call group again, with Flavor as its first argument as before. But this time there is a second argument: the function name sum.

df_grouped_by_price = cones.groupby(["Flavor"]).agg(
    Price_sum=pd.NamedAgg(column="Price", aggfunc="sum")
)

df_grouped_by_price

	Price_sum
Flavor
chocolate	16.55
strawberry	8.80

To create this new table, groupby has calculated the sum of the Price entries in all the rows corresponding to each distinct flavor. The prices in the three chocolate rows add up to 16.55 (in whatever currency). The prices in the two strawberry rows have a total of 8.80.

Using pandas groupby aggregation we can compute a summary statistic (or statistics). The label of the newly created column is Price sum, which is created by taking the label of the column being summed, and appending the word sum in the aggregation pipeline.

In this insatnce there are only two columns so when group finds the sum of all columns other than the one with the categories, there is no need to specify that it has to sum the prices. Using the Pandas NamedAgg function we can name the column contining the results of the aggregation.

To see in more detail what group is doing, notice that you could have figured out the total prices yourself, not only by mental arithmetic but also using code. For example, to find the total price of all the chocolate cones, you could start by creating a new table consisting of only the chocolate cones, and then accessing the column of prices:

cones[cones['Flavor'] == 'chocolate']['Price']

  4.75
  6.55
  5.25
Name: Price, dtype: float64

cones[cones['Flavor'] == 'chocolate'][['Price']]

	Price
1	4.75
2	6.55
4	5.25

np.sum(cones[cones['Flavor'] == 'chocolate'][['Price']])

Price    16.55
dtype: float64

np.sum([cones[cones['Flavor'] == 'chocolate'][['Price']]])

16.55

This is what groupby is doing for each distinct value in Flavor.

# For each distinct value in `Flavor, access all the rows
# and create an array of `Price`

cones_choc = cones[cones['Flavor'] == 'chocolate']['Price']

cones_strawb = cones[cones['Flavor'] =='strawberry']['Price']

# Display the arrays in a table

cones_choc = np.array(cones_choc)
cones_strawb = np.array(cones_strawb)

grouped_cones = pd.DataFrame({
    'Flavor':np.array(['chocolate', 'strawberry']),
    'Array of All the Prices':[cones_choc, cones_strawb]}
)

#priceTotals

# Append a column with the sum of the `Price` values in each array

price_totals = grouped_cones

price_totals['Sum of the Array'] = np.array([sum(cones_choc), sum(cones_strawb)])

price_totals

	Flavor	Array of All the Prices	Sum of the Array
0	chocolate	[4.75, 6.55, 5.25]	16.55
1	strawberry	[3.55, 5.25]	8.80

You can replace sum by any other functions that work on arrays. For example, you could use max to find the largest price in each category:

cones.groupby('Flavor').max()

	Price
Flavor
chocolate	6.55
strawberry	5.25

2.1.2. Or¶

price_max = cones.groupby(["Flavor"]).agg(
    Price_Max=pd.NamedAgg(column="Price", aggfunc="max")
)

price_max

	Price_Max
Flavor
chocolate	6.55
strawberry	5.25

Once again, groupby creates arrays of the prices in each Flavor category. But now it finds the max of each array:

price_max = grouped_cones.copy()

price_max['Max of the Array'] = np.array([max(cones_choc), max(cones_strawb)])

price_max

	Flavor	Array of All the Prices	Sum of the Array	Max of the Array
0	chocolate	[4.75, 6.55, 5.25]	16.55	6.55
1	strawberry	[3.55, 5.25]	8.80	5.25

Indeed, the original call to group with just one argument has the same effect as using len as the function and then cleaning up the table.

array_length = grouped_cones.copy()

array_length['Length of the Array'] = np.array([len(cones_choc), len(cones_strawb)])

array_length

	Flavor	Array of All the Prices	Sum of the Array	Length of the Array
0	chocolate	[4.75, 6.55, 5.25]	16.55	3
1	strawberry	[3.55, 5.25]	8.80	2

2.1.3. Example: NBA Salaries¶

The table nba contains data on the 2015-2016 players in the National Basketball Association. We have examined these data earlier. Recall that salaries are measured in millions of dollars.

nba1 = pd.read_csv(path_data + 'nba_salaries.csv')

nba = nba1.rename(columns={"'15-'16 SALARY": 'SALARY'})

nba

	PLAYER	POSITION	TEAM	SALARY
0	Paul Millsap	PF	Atlanta Hawks	18.671659
1	Al Horford	C	Atlanta Hawks	12.000000
2	Tiago Splitter	C	Atlanta Hawks	9.756250
3	Jeff Teague	PG	Atlanta Hawks	8.000000
4	Kyle Korver	SG	Atlanta Hawks	5.746479
...	...	...	...	...
412	Gary Neal	PG	Washington Wizards	2.139000
413	DeJuan Blair	C	Washington Wizards	2.000000
414	Kelly Oubre Jr.	SF	Washington Wizards	1.920240
415	Garrett Temple	SG	Washington Wizards	1.100602
416	Jarell Eddie	SG	Washington Wizards	0.561716

417 rows × 4 columns

1. How much money did each team pay for its players’ salaries?

The only columns involved are TEAM and SALARY. We have to group the rows by TEAM and then sum the salaries of the groups.

teams_and_money = nba[['TEAM', 'SALARY']]

teams_and_money.groupby('TEAM').sum()

	SALARY
TEAM
Atlanta Hawks	69.573103
Boston Celtics	50.285499
Brooklyn Nets	57.306976
Charlotte Hornets	84.102397
Chicago Bulls	78.820890
Cleveland Cavaliers	102.312412
Dallas Mavericks	65.762559
Denver Nuggets	62.429404
Detroit Pistons	42.211760
Golden State Warriors	94.085137
Houston Rockets	85.285837
Indiana Pacers	62.695023
Los Angeles Clippers	66.074113
Los Angeles Lakers	68.607944
Memphis Grizzlies	93.796439
Miami Heat	81.528667
Milwaukee Bucks	52.258355
Minnesota Timberwolves	65.847421
New Orleans Pelicans	80.514606
New York Knicks	69.404994
Oklahoma City Thunder	96.832165
Orlando Magic	77.623940
Philadelphia 76ers	42.481345
Phoenix Suns	50.520815
Portland Trail Blazers	45.446878
Sacramento Kings	68.384890
San Antonio Spurs	84.652074
Toronto Raptors	74.672620
Utah Jazz	52.631878
Washington Wizards	90.047498

2. How many NBA players were there in each of the five positions?

We have to classify by POSITION, and count. This can be achieved by applying the count() method to a groupby or and the aggregation method with aggfunc="count".

#nba.groupby('POSITION').count()

# -- or

position_count = nba.groupby(["POSITION"]).agg(
    count=pd.NamedAgg(column="PLAYER", aggfunc="count")
)

position_count

	count
POSITION
C	69
PF	85
PG	85
SF	82
SG	96

3. What was the average salary of the players at each of the five positions?

This time, we have to group by POSITION and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries.

positions_and_money = nba[['POSITION', 'SALARY']]

positions_and_money.groupby('POSITION').mean()

	SALARY
POSITION
C	6.082913
PF	4.951344
PG	5.165487
SF	5.532675
SG	3.988195

Center was the most highly paid position, at an average of over 6 million dollars.

If we had not selected the two columns as our first step, group would not attempt to “average” the categorical columns in nba. (It is impossible to average two strings like “Atlanta Hawks” and “Boston Celtics”.) It performs arithmetic only on numerical columns and leaves the rest blank.

nba_mean = nba.groupby('POSITION').mean()

nba_mean

nba_mean = nba.groupby(["POSITION"]).agg(
    SALARY_mean=pd.NamedAgg(column="SALARY", aggfunc="mean")
)

nba_mean

	SALARY_mean
POSITION
C	6.082913
PF	4.951344
PG	5.165487
SF	5.532675
SG	3.988195

Fundamentals of Data Science

2. Classifying by One Variable¶

2.1. Counting the Number in Each Category¶

2.1.1. Finding a Characteristic of Each Category¶

2.1.2. Or¶

2.1.3. Example: NBA Salaries¶