2. Selecting Rows¶

Often, we would like to extract just those rows that correspond to entries with a particular feature. For example, we might want only the rows corresponding to the Warriors, or to players who earned more than $\$10$ million. Or we might just want the top five earners.

2.1. Specified Rows¶

The fact that a DataFrame creates an index by default startts to become very useful here as we can specify which rows (by default) we wish to inspect by stating an index or an index range. The argument used a row index or array of indices, and it creates a new DataFrame consisting of only those rows.

For example, if we wanted just the first row of nba, we could use df.iloc[] as follows.

nba

	PLAYER	POSITION	TEAM	SALARY
0	Paul Millsap	PF	Atlanta Hawks	18.671659
1	Al Horford	C	Atlanta Hawks	12.000000
2	Tiago Splitter	C	Atlanta Hawks	9.756250
3	Jeff Teague	PG	Atlanta Hawks	8.000000
4	Kyle Korver	SG	Atlanta Hawks	5.746479
...	...	...	...	...
412	Gary Neal	PG	Washington Wizards	2.139000
413	DeJuan Blair	C	Washington Wizards	2.000000
414	Kelly Oubre Jr.	SF	Washington Wizards	1.920240
415	Garrett Temple	SG	Washington Wizards	1.100602
416	Jarell Eddie	SG	Washington Wizards	0.561716

417 rows × 4 columns

nba.iloc[0]

PLAYER       Paul Millsap
POSITION               PF
TEAM        Atlanta Hawks
SALARY          18.671659
Name: 0, dtype: object

nba.iloc[[0]]

	PLAYER	POSITION	TEAM	SALARY
0	Paul Millsap	PF	Atlanta Hawks	18.671659

This is a new table with just the single row that we specified.

We could also get the fourth, fifth, and sixth rows by specifying a range of indices as the argument.

nba.iloc[np.arange(3, 6)]

	PLAYER	POSITION	TEAM	SALARY
3	Jeff Teague	PG	Atlanta Hawks	8.000000
4	Kyle Korver	SG	Atlanta Hawks	5.746479
5	Thabo Sefolosha	SF	Atlanta Hawks	4.000000

If we want a table of the top 5 highest paid players, we can first sort the list by salary and then df.iloc[] the first five rows:

nba.sort_values('SALARY', ascending=False).iloc[(np.arange(5))]

	PLAYER	POSITION	TEAM	SALARY
169	Kobe Bryant	SF	Los Angeles Lakers	25.000000
29	Joe Johnson	SF	Brooklyn Nets	24.894863
72	LeBron James	SF	Cleveland Cavaliers	22.970500
255	Carmelo Anthony	SF	New York Knicks	22.875000
131	Dwight Howard	C	Houston Rockets	22.359364

2.1.1. Rows Corresponding to a Specified Feature¶

More often, we will want to access data in a set of rows that have a certain feature, but whose indices we don’t know ahead of time. For example, we might want data on all the players who made more than $\$10$ million, but we don’t want to spend time counting rows in the sorted table.

Array version - if we wish to work with an array we can use np.where(df['column'] criteria).

np.where()

DataFrame version - to implement a selection criteria the df is called with selection criteria being applied to the df.col i.e. df[df['column_name']criteria].

In the first example, we extract the data for all those who earned more than $\$10$ million.

np.where(nba['SALARY'] > 10)

# or - this is an example of alternatives being available to select,
# this may depend upon preference, the task at hand, the impact of processing time or export requirements

#nba[nba['SALARY'] >10]

(array([  0,   1,  29,  30,  42,  43,  44,  60,  61,  62,  72,  73,  74,
         75,  76,  82,  83,  93,  94,  95, 107, 117, 118, 119, 120, 121,
        131, 132, 133, 143, 144, 156, 157, 169, 170, 180, 201, 202, 203,
        204, 213, 226, 227, 239, 240, 241, 255, 256, 268, 269, 270, 271,
        284, 285, 298, 311, 312, 342, 343, 353, 354, 355, 366, 367, 368,
        383, 400, 401, 402]),)

The use of the argument df[df[col] > 10] ensured that each selected row had a value of SALARY that was greater than 10.

There are 69 rows in the new table, corresponding to the 69 players who made more than $10$ million dollars. Arranging these rows in order makes the data easier to analyze. DeMar DeRozan of the Toronto Raptors was the “poorest” of this group, at a salary of just over $10$ million dollars.

nba[nba['SALARY'] >10].sort_values('SALARY')

	PLAYER	POSITION	TEAM	SALARY
368	DeMar DeRozan	SG	Toronto Raptors	10.050000
298	Gerald Wallace	SF	Philadelphia 76ers	10.105855
204	Luol Deng	SF	Miami Heat	10.151612
144	Monta Ellis	SG	Indiana Pacers	10.300000
95	Wilson Chandler	SF	Denver Nuggets	10.449438
...	...	...	...	...
131	Dwight Howard	C	Houston Rockets	22.359364
255	Carmelo Anthony	SF	New York Knicks	22.875000
72	LeBron James	SF	Cleveland Cavaliers	22.970500
29	Joe Johnson	SF	Brooklyn Nets	24.894863
169	Kobe Bryant	SF	Los Angeles Lakers	25.000000

69 rows × 4 columns

How much did Stephen Curry make? For the answer, we have to access the row where the value of PLAYER is equal to Stephen Curry. That is placed a table consisting of just one line:

nba[nba['PLAYER'] == 'Stephen Curry']

	PLAYER	POSITION	TEAM	SALARY
121	Stephen Curry	PG	Golden State Warriors	11.370786

Curry made just under $\$11.4$ million dollars. That’s a lot of money, but it’s less than half the salary of LeBron James. You’ll find that salary in the “Top 5” table earlier in this section, or you could find it replacing 'Stephen Curry' by 'LeBron James' in the line of code above.

Thus for example you can get a DataFrame where the ‘TEAM’ is exactly equal to ‘Golden State Warriors’:

nba[nba['TEAM'] == 'Golden State Warriors']

	PLAYER	POSITION	TEAM	SALARY
117	Klay Thompson	SG	Golden State Warriors	15.501000
118	Draymond Green	PF	Golden State Warriors	14.260870
119	Andrew Bogut	C	Golden State Warriors	13.800000
120	Andre Iguodala	SF	Golden State Warriors	11.710456
121	Stephen Curry	PG	Golden State Warriors	11.370786
122	Jason Thompson	PF	Golden State Warriors	7.008475
123	Shaun Livingston	PG	Golden State Warriors	5.543725
124	Harrison Barnes	SF	Golden State Warriors	3.873398
125	Marreese Speights	C	Golden State Warriors	3.815000
126	Leandro Barbosa	SG	Golden State Warriors	2.500000
127	Festus Ezeli	C	Golden State Warriors	2.008748
128	Brandon Rush	SF	Golden State Warriors	1.270964
129	Kevon Looney	SF	Golden State Warriors	1.131960
130	Anderson Varejao	PF	Golden State Warriors	0.289755

This portion of the table is already sorted by salary, because the original table listed players sorted by salary within the same team. By not using .head() at the end of the line all rows are shown, not just the first 10.

2.1.2. Multiple Features¶

You can access rows that have multiple specified features, by using the boolean & operator. For example, here is a way to extract all the Point Guards whose salaries were over $\$15$ million.

nba[(nba['POSITION'] == 'PG') & (nba['SALARY'] > 15)]

	PLAYER	POSITION	TEAM	SALARY
60	Derrick Rose	PG	Chicago Bulls	20.093064
74	Kyrie Irving	PG	Cleveland Cavaliers	16.407501
156	Chris Paul	PG	Los Angeles Clippers	21.468695
269	Russell Westbrook	PG	Oklahoma City Thunder	16.744218
400	John Wall	PG	Washington Wizards	15.851950

2.1.3. General Form¶

By now you will have realized that the general way to create a new df by selecting rows with a given feature is to use & or OR with the appropriate condition:

df[df['column_label_string'] condition(<, >, ==, =>, etc) criteria]

nba[(nba['SALARY'] >= 10) & (nba['SALARY'] <=10.3)]

	PLAYER	POSITION	TEAM	SALARY
144	Monta Ellis	SG	Indiana Pacers	10.300000
204	Luol Deng	SF	Miami Heat	10.151612
298	Gerald Wallace	SF	Philadelphia 76ers	10.105855
356	Danny Green	SG	San Antonio Spurs	10.000000
368	DeMar DeRozan	SG	Toronto Raptors	10.050000

If we specify a condition that isn’t satisfied by any row, we get a table with column labels but no rows.

nba[nba['PLAYER'] == 'Barack Obama']

	PLAYER	POSITION	TEAM	SALARY

We end the section with a series of examples.

The use of are.containing can help save some typing. For example, you can just specify Warriors instead of Golden State Warriors:

nba[nba['TEAM'].str.contains('Warriors')]

	PLAYER	POSITION	TEAM	SALARY
117	Klay Thompson	SG	Golden State Warriors	15.501000
118	Draymond Green	PF	Golden State Warriors	14.260870
119	Andrew Bogut	C	Golden State Warriors	13.800000
120	Andre Iguodala	SF	Golden State Warriors	11.710456
121	Stephen Curry	PG	Golden State Warriors	11.370786
122	Jason Thompson	PF	Golden State Warriors	7.008475
123	Shaun Livingston	PG	Golden State Warriors	5.543725
124	Harrison Barnes	SF	Golden State Warriors	3.873398
125	Marreese Speights	C	Golden State Warriors	3.815000
126	Leandro Barbosa	SG	Golden State Warriors	2.500000
127	Festus Ezeli	C	Golden State Warriors	2.008748
128	Brandon Rush	SF	Golden State Warriors	1.270964
129	Kevon Looney	SF	Golden State Warriors	1.131960
130	Anderson Varejao	PF	Golden State Warriors	0.289755

You can extract data for all the guards, both Point Guards and Shooting Guards:

nba[nba['POSITION'].str.contains('G')]

	PLAYER	POSITION	TEAM	SALARY
3	Jeff Teague	PG	Atlanta Hawks	8.000000
4	Kyle Korver	SG	Atlanta Hawks	5.746479
8	Dennis Schroder	PG	Atlanta Hawks	1.763400
9	Tim Hardaway Jr.	SG	Atlanta Hawks	1.304520
11	Jason Richardson	SG	Atlanta Hawks	0.947276
...	...	...	...	...
409	Alan Anderson	SG	Washington Wizards	4.000000
411	Ramon Sessions	PG	Washington Wizards	2.170465
412	Gary Neal	PG	Washington Wizards	2.139000
415	Garrett Temple	SG	Washington Wizards	1.100602
416	Jarell Eddie	SG	Washington Wizards	0.561716

181 rows × 4 columns

You can get all the players who were not Cleveland Cavaliers and had a salary of no less than $\$20$ million:

other_than_Cavs = nba[nba['TEAM'] != 'Cleveland Cavaliers']
other_than_Cavs[other_than_Cavs['SALARY']  > 20]

	PLAYER	POSITION	TEAM	SALARY
29	Joe Johnson	SF	Brooklyn Nets	24.894863
60	Derrick Rose	PG	Chicago Bulls	20.093064
131	Dwight Howard	C	Houston Rockets	22.359364
156	Chris Paul	PG	Los Angeles Clippers	21.468695
169	Kobe Bryant	SF	Los Angeles Lakers	25.000000
201	Chris Bosh	PF	Miami Heat	22.192730
255	Carmelo Anthony	SF	New York Knicks	22.875000
268	Kevin Durant	SF	Oklahoma City Thunder	20.158622

The same table can be created in many ways. Here is another, and no doubt you can think of more.

Fundamentals of Data Science

2. Selecting Rows¶

2.1. Specified Rows¶

2.1.1. Rows Corresponding to a Specified Feature¶

2.1.2. Multiple Features¶

2.1.3. General Form¶