2. Selecting Rows

Often, we would like to extract just those rows that correspond to entries with a particular feature. For example, we might want only the rows corresponding to the Warriors, or to players who earned more than \(\$10\) million. Or we might just want the top five earners.

2.1. Specified Rows

The fact that a DataFrame creates an index by default startts to become very useful here as we can specify which rows (by default) we wish to inspect by stating an index or an index range. The argument used a row index or array of indices, and it creates a new DataFrame consisting of only those rows.

For example, if we wanted just the first row of nba, we could use df.iloc[] as follows.

nba
PLAYER POSITION TEAM SALARY
0 Paul Millsap PF Atlanta Hawks 18.671659
1 Al Horford C Atlanta Hawks 12.000000
2 Tiago Splitter C Atlanta Hawks 9.756250
3 Jeff Teague PG Atlanta Hawks 8.000000
4 Kyle Korver SG Atlanta Hawks 5.746479
... ... ... ... ...
412 Gary Neal PG Washington Wizards 2.139000
413 DeJuan Blair C Washington Wizards 2.000000
414 Kelly Oubre Jr. SF Washington Wizards 1.920240
415 Garrett Temple SG Washington Wizards 1.100602
416 Jarell Eddie SG Washington Wizards 0.561716

417 rows × 4 columns

nba.iloc[0]
PLAYER       Paul Millsap
POSITION               PF
TEAM        Atlanta Hawks
SALARY          18.671659
Name: 0, dtype: object
nba.iloc[[0]]
PLAYER POSITION TEAM SALARY
0 Paul Millsap PF Atlanta Hawks 18.671659

This is a new table with just the single row that we specified.

We could also get the fourth, fifth, and sixth rows by specifying a range of indices as the argument.

nba.iloc[np.arange(3, 6)]
PLAYER POSITION TEAM SALARY
3 Jeff Teague PG Atlanta Hawks 8.000000
4 Kyle Korver SG Atlanta Hawks 5.746479
5 Thabo Sefolosha SF Atlanta Hawks 4.000000

If we want a table of the top 5 highest paid players, we can first sort the list by salary and then df.iloc[] the first five rows:

nba.sort_values('SALARY', ascending=False).iloc[(np.arange(5))]
PLAYER POSITION TEAM SALARY
169 Kobe Bryant SF Los Angeles Lakers 25.000000
29 Joe Johnson SF Brooklyn Nets 24.894863
72 LeBron James SF Cleveland Cavaliers 22.970500
255 Carmelo Anthony SF New York Knicks 22.875000
131 Dwight Howard C Houston Rockets 22.359364

2.1.1. Rows Corresponding to a Specified Feature

More often, we will want to access data in a set of rows that have a certain feature, but whose indices we don’t know ahead of time. For example, we might want data on all the players who made more than \(\$10\) million, but we don’t want to spend time counting rows in the sorted table.

Array version - if we wish to work with an array we can use np.where(df['column'] criteria).

np.where()

DataFrame version - to implement a selection criteria the df is called with selection criteria being applied to the df.col i.e. df[df['column_name']criteria].

In the first example, we extract the data for all those who earned more than \(\$10\) million.

np.where(nba['SALARY'] > 10)

# or - this is an example of alternatives being available to select,
# this may depend upon preference, the task at hand, the impact of processing time or export requirements

#nba[nba['SALARY'] >10]
(array([  0,   1,  29,  30,  42,  43,  44,  60,  61,  62,  72,  73,  74,
         75,  76,  82,  83,  93,  94,  95, 107, 117, 118, 119, 120, 121,
        131, 132, 133, 143, 144, 156, 157, 169, 170, 180, 201, 202, 203,
        204, 213, 226, 227, 239, 240, 241, 255, 256, 268, 269, 270, 271,
        284, 285, 298, 311, 312, 342, 343, 353, 354, 355, 366, 367, 368,
        383, 400, 401, 402]),)

The use of the argument df[df[col] > 10] ensured that each selected row had a value of SALARY that was greater than 10.

There are 69 rows in the new table, corresponding to the 69 players who made more than \(10\) million dollars. Arranging these rows in order makes the data easier to analyze. DeMar DeRozan of the Toronto Raptors was the “poorest” of this group, at a salary of just over \(10\) million dollars.

nba[nba['SALARY'] >10].sort_values('SALARY')
PLAYER POSITION TEAM SALARY
368 DeMar DeRozan SG Toronto Raptors 10.050000
298 Gerald Wallace SF Philadelphia 76ers 10.105855
204 Luol Deng SF Miami Heat 10.151612
144 Monta Ellis SG Indiana Pacers 10.300000
95 Wilson Chandler SF Denver Nuggets 10.449438
... ... ... ... ...
131 Dwight Howard C Houston Rockets 22.359364
255 Carmelo Anthony SF New York Knicks 22.875000
72 LeBron James SF Cleveland Cavaliers 22.970500
29 Joe Johnson SF Brooklyn Nets 24.894863
169 Kobe Bryant SF Los Angeles Lakers 25.000000

69 rows × 4 columns

How much did Stephen Curry make? For the answer, we have to access the row where the value of PLAYER is equal to Stephen Curry. That is placed a table consisting of just one line:

nba[nba['PLAYER'] == 'Stephen Curry']
PLAYER POSITION TEAM SALARY
121 Stephen Curry PG Golden State Warriors 11.370786

Curry made just under \(\$11.4\) million dollars. That’s a lot of money, but it’s less than half the salary of LeBron James. You’ll find that salary in the “Top 5” table earlier in this section, or you could find it replacing 'Stephen Curry' by 'LeBron James' in the line of code above.

Thus for example you can get a DataFrame where the ‘TEAM’ is exactly equal to ‘Golden State Warriors’:

nba[nba['TEAM'] == 'Golden State Warriors']
PLAYER POSITION TEAM SALARY
117 Klay Thompson SG Golden State Warriors 15.501000
118 Draymond Green PF Golden State Warriors 14.260870
119 Andrew Bogut C Golden State Warriors 13.800000
120 Andre Iguodala SF Golden State Warriors 11.710456
121 Stephen Curry PG Golden State Warriors 11.370786
122 Jason Thompson PF Golden State Warriors 7.008475
123 Shaun Livingston PG Golden State Warriors 5.543725
124 Harrison Barnes SF Golden State Warriors 3.873398
125 Marreese Speights C Golden State Warriors 3.815000
126 Leandro Barbosa SG Golden State Warriors 2.500000
127 Festus Ezeli C Golden State Warriors 2.008748
128 Brandon Rush SF Golden State Warriors 1.270964
129 Kevon Looney SF Golden State Warriors 1.131960
130 Anderson Varejao PF Golden State Warriors 0.289755

This portion of the table is already sorted by salary, because the original table listed players sorted by salary within the same team. By not using .head() at the end of the line all rows are shown, not just the first 10.

2.1.2. Multiple Features

You can access rows that have multiple specified features, by using the boolean & operator. For example, here is a way to extract all the Point Guards whose salaries were over \(\$15\) million.

nba[(nba['POSITION'] == 'PG') & (nba['SALARY'] > 15)]
PLAYER POSITION TEAM SALARY
60 Derrick Rose PG Chicago Bulls 20.093064
74 Kyrie Irving PG Cleveland Cavaliers 16.407501
156 Chris Paul PG Los Angeles Clippers 21.468695
269 Russell Westbrook PG Oklahoma City Thunder 16.744218
400 John Wall PG Washington Wizards 15.851950

2.1.3. General Form

By now you will have realized that the general way to create a new df by selecting rows with a given feature is to use & or OR with the appropriate condition:

df[df['column_label_string'] condition(<, >, ==, =>, etc) criteria]

nba[(nba['SALARY'] >= 10) & (nba['SALARY'] <=10.3)]
PLAYER POSITION TEAM SALARY
144 Monta Ellis SG Indiana Pacers 10.300000
204 Luol Deng SF Miami Heat 10.151612
298 Gerald Wallace SF Philadelphia 76ers 10.105855
356 Danny Green SG San Antonio Spurs 10.000000
368 DeMar DeRozan SG Toronto Raptors 10.050000

If we specify a condition that isn’t satisfied by any row, we get a table with column labels but no rows.

nba[nba['PLAYER'] == 'Barack Obama']
PLAYER POSITION TEAM SALARY

We end the section with a series of examples.

The use of are.containing can help save some typing. For example, you can just specify Warriors instead of Golden State Warriors:

nba[nba['TEAM'].str.contains('Warriors')]
PLAYER POSITION TEAM SALARY
117 Klay Thompson SG Golden State Warriors 15.501000
118 Draymond Green PF Golden State Warriors 14.260870
119 Andrew Bogut C Golden State Warriors 13.800000
120 Andre Iguodala SF Golden State Warriors 11.710456
121 Stephen Curry PG Golden State Warriors 11.370786
122 Jason Thompson PF Golden State Warriors 7.008475
123 Shaun Livingston PG Golden State Warriors 5.543725
124 Harrison Barnes SF Golden State Warriors 3.873398
125 Marreese Speights C Golden State Warriors 3.815000
126 Leandro Barbosa SG Golden State Warriors 2.500000
127 Festus Ezeli C Golden State Warriors 2.008748
128 Brandon Rush SF Golden State Warriors 1.270964
129 Kevon Looney SF Golden State Warriors 1.131960
130 Anderson Varejao PF Golden State Warriors 0.289755

You can extract data for all the guards, both Point Guards and Shooting Guards:

nba[nba['POSITION'].str.contains('G')]
PLAYER POSITION TEAM SALARY
3 Jeff Teague PG Atlanta Hawks 8.000000
4 Kyle Korver SG Atlanta Hawks 5.746479
8 Dennis Schroder PG Atlanta Hawks 1.763400
9 Tim Hardaway Jr. SG Atlanta Hawks 1.304520
11 Jason Richardson SG Atlanta Hawks 0.947276
... ... ... ... ...
409 Alan Anderson SG Washington Wizards 4.000000
411 Ramon Sessions PG Washington Wizards 2.170465
412 Gary Neal PG Washington Wizards 2.139000
415 Garrett Temple SG Washington Wizards 1.100602
416 Jarell Eddie SG Washington Wizards 0.561716

181 rows × 4 columns

You can get all the players who were not Cleveland Cavaliers and had a salary of no less than \(\$20\) million:

other_than_Cavs = nba[nba['TEAM'] != 'Cleveland Cavaliers']
other_than_Cavs[other_than_Cavs['SALARY']  > 20]
PLAYER POSITION TEAM SALARY
29 Joe Johnson SF Brooklyn Nets 24.894863
60 Derrick Rose PG Chicago Bulls 20.093064
131 Dwight Howard C Houston Rockets 22.359364
156 Chris Paul PG Los Angeles Clippers 21.468695
169 Kobe Bryant SF Los Angeles Lakers 25.000000
201 Chris Bosh PF Miami Heat 22.192730
255 Carmelo Anthony SF New York Knicks 22.875000
268 Kevin Durant SF Oklahoma City Thunder 20.158622

The same table can be created in many ways. Here is another, and no doubt you can think of more.