2. Selecting Rows¶
Often, we would like to extract just those rows that correspond to entries with a particular feature. For example, we might want only the rows corresponding to the Warriors, or to players who earned more than \(\$10\) million. Or we might just want the top five earners.
2.1. Specified Rows¶
The fact that a DataFrame creates an index by default startts to become very useful here as we can specify which rows (by default) we wish to inspect by stating an index or an index range. The argument used a row index or array of indices, and it creates a new DataFrame consisting of only those rows.
For example, if we wanted just the first row of nba
, we could use df.iloc[]
as follows.
nba
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
0 | Paul Millsap | PF | Atlanta Hawks | 18.671659 |
1 | Al Horford | C | Atlanta Hawks | 12.000000 |
2 | Tiago Splitter | C | Atlanta Hawks | 9.756250 |
3 | Jeff Teague | PG | Atlanta Hawks | 8.000000 |
4 | Kyle Korver | SG | Atlanta Hawks | 5.746479 |
... | ... | ... | ... | ... |
412 | Gary Neal | PG | Washington Wizards | 2.139000 |
413 | DeJuan Blair | C | Washington Wizards | 2.000000 |
414 | Kelly Oubre Jr. | SF | Washington Wizards | 1.920240 |
415 | Garrett Temple | SG | Washington Wizards | 1.100602 |
416 | Jarell Eddie | SG | Washington Wizards | 0.561716 |
417 rows × 4 columns
nba.iloc[0]
PLAYER Paul Millsap
POSITION PF
TEAM Atlanta Hawks
SALARY 18.671659
Name: 0, dtype: object
nba.iloc[[0]]
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
0 | Paul Millsap | PF | Atlanta Hawks | 18.671659 |
This is a new table with just the single row that we specified.
We could also get the fourth, fifth, and sixth rows by specifying a range of indices as the argument.
nba.iloc[np.arange(3, 6)]
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
3 | Jeff Teague | PG | Atlanta Hawks | 8.000000 |
4 | Kyle Korver | SG | Atlanta Hawks | 5.746479 |
5 | Thabo Sefolosha | SF | Atlanta Hawks | 4.000000 |
If we want a table of the top 5 highest paid players, we can first sort the list by salary and then df.iloc[]
the first five rows:
nba.sort_values('SALARY', ascending=False).iloc[(np.arange(5))]
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
169 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
72 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
255 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
131 | Dwight Howard | C | Houston Rockets | 22.359364 |
2.1.1. Rows Corresponding to a Specified Feature¶
More often, we will want to access data in a set of rows that have a certain feature, but whose indices we don’t know ahead of time. For example, we might want data on all the players who made more than \(\$10\) million, but we don’t want to spend time counting rows in the sorted table.
Array version - if we wish to work with an array we can use np.where(df['column'] criteria)
.
DataFrame version - to implement a selection criteria the df is called with selection criteria being applied to the df.col i.e. df[df['column_name']criteria]
.
In the first example, we extract the data for all those who earned more than \(\$10\) million.
np.where(nba['SALARY'] > 10)
# or - this is an example of alternatives being available to select,
# this may depend upon preference, the task at hand, the impact of processing time or export requirements
#nba[nba['SALARY'] >10]
(array([ 0, 1, 29, 30, 42, 43, 44, 60, 61, 62, 72, 73, 74,
75, 76, 82, 83, 93, 94, 95, 107, 117, 118, 119, 120, 121,
131, 132, 133, 143, 144, 156, 157, 169, 170, 180, 201, 202, 203,
204, 213, 226, 227, 239, 240, 241, 255, 256, 268, 269, 270, 271,
284, 285, 298, 311, 312, 342, 343, 353, 354, 355, 366, 367, 368,
383, 400, 401, 402]),)
The use of the argument df[df[col] > 10]
ensured that each selected row had a value of SALARY
that was greater than 10.
There are 69 rows in the new table, corresponding to the 69 players who made more than \(10\) million dollars. Arranging these rows in order makes the data easier to analyze. DeMar DeRozan of the Toronto Raptors was the “poorest” of this group, at a salary of just over \(10\) million dollars.
nba[nba['SALARY'] >10].sort_values('SALARY')
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
368 | DeMar DeRozan | SG | Toronto Raptors | 10.050000 |
298 | Gerald Wallace | SF | Philadelphia 76ers | 10.105855 |
204 | Luol Deng | SF | Miami Heat | 10.151612 |
144 | Monta Ellis | SG | Indiana Pacers | 10.300000 |
95 | Wilson Chandler | SF | Denver Nuggets | 10.449438 |
... | ... | ... | ... | ... |
131 | Dwight Howard | C | Houston Rockets | 22.359364 |
255 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
72 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
169 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
69 rows × 4 columns
How much did Stephen Curry make? For the answer, we have to access the row where the value of PLAYER
is equal to Stephen Curry
. That is placed a table consisting of just one line:
nba[nba['PLAYER'] == 'Stephen Curry']
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
121 | Stephen Curry | PG | Golden State Warriors | 11.370786 |
Curry made just under \(\$11.4\) million dollars. That’s a lot of money, but it’s less than half the salary of LeBron James. You’ll find that salary in the “Top 5” table earlier in this section, or you could find it replacing 'Stephen Curry'
by 'LeBron James'
in the line of code above.
Thus for example you can get a DataFrame where the ‘TEAM’ is exactly equal to ‘Golden State Warriors’:
nba[nba['TEAM'] == 'Golden State Warriors']
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
117 | Klay Thompson | SG | Golden State Warriors | 15.501000 |
118 | Draymond Green | PF | Golden State Warriors | 14.260870 |
119 | Andrew Bogut | C | Golden State Warriors | 13.800000 |
120 | Andre Iguodala | SF | Golden State Warriors | 11.710456 |
121 | Stephen Curry | PG | Golden State Warriors | 11.370786 |
122 | Jason Thompson | PF | Golden State Warriors | 7.008475 |
123 | Shaun Livingston | PG | Golden State Warriors | 5.543725 |
124 | Harrison Barnes | SF | Golden State Warriors | 3.873398 |
125 | Marreese Speights | C | Golden State Warriors | 3.815000 |
126 | Leandro Barbosa | SG | Golden State Warriors | 2.500000 |
127 | Festus Ezeli | C | Golden State Warriors | 2.008748 |
128 | Brandon Rush | SF | Golden State Warriors | 1.270964 |
129 | Kevon Looney | SF | Golden State Warriors | 1.131960 |
130 | Anderson Varejao | PF | Golden State Warriors | 0.289755 |
This portion of the table is already sorted by salary, because the original table listed players sorted by salary within the same team. By not using .head()
at the end of the line all rows are shown, not just the first 10.
2.1.2. Multiple Features¶
You can access rows that have multiple specified features, by using the boolean &
operator. For example, here is a way to extract all the Point Guards whose salaries were over \(\$15\) million.
nba[(nba['POSITION'] == 'PG') & (nba['SALARY'] > 15)]
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
60 | Derrick Rose | PG | Chicago Bulls | 20.093064 |
74 | Kyrie Irving | PG | Cleveland Cavaliers | 16.407501 |
156 | Chris Paul | PG | Los Angeles Clippers | 21.468695 |
269 | Russell Westbrook | PG | Oklahoma City Thunder | 16.744218 |
400 | John Wall | PG | Washington Wizards | 15.851950 |
2.1.3. General Form¶
By now you will have realized that the general way to create a new df by selecting rows with a given feature is to use &
or OR
with the appropriate condition:
df[df['column_label_string'] condition(<, >, ==, =>, etc) criteria]
nba[(nba['SALARY'] >= 10) & (nba['SALARY'] <=10.3)]
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
144 | Monta Ellis | SG | Indiana Pacers | 10.300000 |
204 | Luol Deng | SF | Miami Heat | 10.151612 |
298 | Gerald Wallace | SF | Philadelphia 76ers | 10.105855 |
356 | Danny Green | SG | San Antonio Spurs | 10.000000 |
368 | DeMar DeRozan | SG | Toronto Raptors | 10.050000 |
If we specify a condition that isn’t satisfied by any row, we get a table with column labels but no rows.
nba[nba['PLAYER'] == 'Barack Obama']
PLAYER | POSITION | TEAM | SALARY |
---|
We end the section with a series of examples.
The use of are.containing
can help save some typing. For example, you can just specify Warriors
instead of Golden State Warriors
:
nba[nba['TEAM'].str.contains('Warriors')]
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
117 | Klay Thompson | SG | Golden State Warriors | 15.501000 |
118 | Draymond Green | PF | Golden State Warriors | 14.260870 |
119 | Andrew Bogut | C | Golden State Warriors | 13.800000 |
120 | Andre Iguodala | SF | Golden State Warriors | 11.710456 |
121 | Stephen Curry | PG | Golden State Warriors | 11.370786 |
122 | Jason Thompson | PF | Golden State Warriors | 7.008475 |
123 | Shaun Livingston | PG | Golden State Warriors | 5.543725 |
124 | Harrison Barnes | SF | Golden State Warriors | 3.873398 |
125 | Marreese Speights | C | Golden State Warriors | 3.815000 |
126 | Leandro Barbosa | SG | Golden State Warriors | 2.500000 |
127 | Festus Ezeli | C | Golden State Warriors | 2.008748 |
128 | Brandon Rush | SF | Golden State Warriors | 1.270964 |
129 | Kevon Looney | SF | Golden State Warriors | 1.131960 |
130 | Anderson Varejao | PF | Golden State Warriors | 0.289755 |
You can extract data for all the guards, both Point Guards and Shooting Guards:
nba[nba['POSITION'].str.contains('G')]
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
3 | Jeff Teague | PG | Atlanta Hawks | 8.000000 |
4 | Kyle Korver | SG | Atlanta Hawks | 5.746479 |
8 | Dennis Schroder | PG | Atlanta Hawks | 1.763400 |
9 | Tim Hardaway Jr. | SG | Atlanta Hawks | 1.304520 |
11 | Jason Richardson | SG | Atlanta Hawks | 0.947276 |
... | ... | ... | ... | ... |
409 | Alan Anderson | SG | Washington Wizards | 4.000000 |
411 | Ramon Sessions | PG | Washington Wizards | 2.170465 |
412 | Gary Neal | PG | Washington Wizards | 2.139000 |
415 | Garrett Temple | SG | Washington Wizards | 1.100602 |
416 | Jarell Eddie | SG | Washington Wizards | 0.561716 |
181 rows × 4 columns
You can get all the players who were not Cleveland Cavaliers and had a salary of no less than \(\$20\) million:
other_than_Cavs = nba[nba['TEAM'] != 'Cleveland Cavaliers']
other_than_Cavs[other_than_Cavs['SALARY'] > 20]
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
60 | Derrick Rose | PG | Chicago Bulls | 20.093064 |
131 | Dwight Howard | C | Houston Rockets | 22.359364 |
156 | Chris Paul | PG | Los Angeles Clippers | 21.468695 |
169 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
201 | Chris Bosh | PF | Miami Heat | 22.192730 |
255 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
268 | Kevin Durant | SF | Oklahoma City Thunder | 20.158622 |
The same table can be created in many ways. Here is another, and no doubt you can think of more.