4. Introduction to DataFrames¶

We can now apply Python to analyze data. We will work with data stored in DataFrame structures.

A DataFrames (df) is a fundamental way of representing data sets. A df can be viewed in two ways:

a sequence of named columns that each describe a single attribute of all entries in a data set, or
a sequence of rows that each contain all information about a single individual in a data set.

We will study dfs in great detail in the next several chapters. For now, we will just introduce a few methods without going into technical details.

The df cones has been imported for us; later we will see how, but here we will just work with it. First, let’s take a look at it.

cones.head()

	Flavor	Color	Price
0	strawberry	pink	3.55
1	chocolate	light brown	4.75
2	chocolate	dark brown	5.25
3	strawberry	pink	5.25
4	chocolate	dark brown	5.25

The DataFrame has six rows. Each row corresponds to one ice cream cone. The ice cream cones are the individuals.

Each cone has three attributes: flavor, color, and price. Each column contains the data on one of these attributes, and so all the entries of any single column are of the same kind. Each column has a label. We will refer to columns by their labels.

A df method is just like a function, but it must operate on a df. So the call looks like

name_of_DataFrame.method(arguments)

For example, if you want to see just the first two rows of a df, you can use the df method head.

cones.head(2)

	Flavor	Color	Price
0	strawberry	pink	3.55
1	chocolate	light brown	4.75

You can replace 2 by any number of rows. If you ask for more than six, you will only get six, because cones only has six rows.

4.1. Choosing Sets of Columns¶

The method select creates a new table consisting of only the specified columns. We can state which columns we want to view by using dot ‘.’ notation (not he same as in maths) or hard brackets with quotes. Note that an index is automatically generated, this is a fundamental aspect of the DataFrame as the index allows us to ‘locate’ members of the DataFrame.

# single square brackets

cones['Flavor']

# uncomment (remove the hash mark) the line below to view the 'type()' of the output

#type(cones['Flavor'])

  strawberry
   chocolate
   chocolate
  strawberry
   chocolate
   bubblegum
Name: Flavor, dtype: object

# double square brackets

cones[['Flavor']]

# uncomment the line below to view the 'type()' of the output

# type(cones[['Flavor']])

	Flavor
0	strawberry
1	chocolate
2	chocolate
3	strawberry
4	chocolate
5	bubblegum

cones.Flavor

  strawberry
   chocolate
   chocolate
  strawberry
   chocolate
   bubblegum
Name: Flavor, dtype: object

This leaves the original table unchanged.

cones

	Flavor	Color	Price
0	strawberry	pink	3.55
1	chocolate	light brown	4.75
2	chocolate	dark brown	5.25
3	strawberry	pink	5.25
4	chocolate	dark brown	5.25
5	bubblegum	pink	4.75

You can select more than one column, by separating the column labels by commas. When you wish to view more than one column the ‘hard brackets’ must be used twice.

cones[['Flavor', 'Price']]

	Flavor	Price
0	strawberry	3.55
1	chocolate	4.75
2	chocolate	5.25
3	strawberry	5.25
4	chocolate	5.25
5	bubblegum	4.75

You can also drop columns you don’t want. The table above can be created by dropping the Color column.

cones.drop(columns=['Color'])

	Flavor	Price
0	strawberry	3.55
1	chocolate	4.75
2	chocolate	5.25
3	strawberry	5.25
4	chocolate	5.25
5	bubblegum	4.75

You can name this new table and look at it again by just typing its name.

no_colors = cones.drop(columns=['Color'])

no_colors

	Flavor	Price
0	strawberry	3.55
1	chocolate	4.75
2	chocolate	5.25
3	strawberry	5.25
4	chocolate	5.25
5	bubblegum	4.75

Like selecting columns using hard brackets or dot notation, the drop method creates a smaller table and leaves the original table unchanged. In order to explore your data, you can create any number of smaller tables by using choosing or dropping columns. It will do no harm to your original data table.

4.2. Sorting Rows¶

The sort_values method creates a new table by arranging the rows of the original table in ascending order of the values in the specified column. Here the cones table has been sorted in ascending order of the price of the cones.

pandas sort_values

cones.sort_values('Price')

	Flavor	Color	Price
0	strawberry	pink	3.55
1	chocolate	light brown	4.75
5	bubblegum	pink	4.75
2	chocolate	dark brown	5.25
3	strawberry	pink	5.25
4	chocolate	dark brown	5.25

To sort in descending order, you can use an optional argument to sort. As the name implies, optional arguments don’t have to be used, but they can be used if you want to change the default behavior of a method.

By default, sort sorts in increasing order of the values in the specified column. To sort in decreasing order, use the optional argument ascending=False, the default value for ascending is True.

cones.sort_values('Price', ascending=False)

	Flavor	Color	Price
2	chocolate	dark brown	5.25
3	strawberry	pink	5.25
4	chocolate	dark brown	5.25
1	chocolate	light brown	4.75
5	bubblegum	pink	4.75
0	strawberry	pink	3.55

As when selecting and droping the sort method leaves the original table unchanged.

4.2.1. Selecting Rows that Satisfy a Condition¶

Creating a new DataFrame (in database world this wold be a ‘view’), consisting only of the rows that satisfy a given condition we use the ‘exactly equal to’ ==. In this section we will work with a very simple condition, which is that the value in a specified column must be exactly equal to a value that we also specify. Thus the == method has two arguments.

The code in the cell below creates a df consisting only of the rows corresponding to chocolate cones.

cones[cones['Flavor']=='chocolate']

	Flavor	Color	Price
1	chocolate	light brown	4.75
2	chocolate	dark brown	5.25
4	chocolate	dark brown	5.25

The arguments are the label of the column and the value we are looking for in that column. The == method can also be used when the condition that the rows must satisfy is more complicated. In those situations the call will be a little more complicated as well.

It is important to provide the value exactly. For example, if we specify Chocolate instead of chocolate, then where correctly finds no rows where the flavor is Chocolate.

cones[cones['Flavor'] == 'Chocolate']

	Flavor	Color	Price

Like all the other table methods in this section, == leaves the original table unchanged.

4.2.2. Example: Salaries in the NBA¶

“The NBA is the highest paying professional sports league in the world,” reported CNN in March 2016. The table nba contains the salaries of all National Basketball Association players in 2015-2016.

Each row represents one player. The columns are:

Column Label	Description
`PLAYER`	Player’s name
`POSITION`	Player’s position on team
`TEAM`	Team name
`SALARY`	Player’s salary in 2015-2016, in millions of dollars

The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn’t involve details about how basketball is played.

The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost $\$18.7$ million in 2015-2016.

nba

	PLAYER	POSITION	TEAM	SALARY
0	Paul Millsap	PF	Atlanta Hawks	18.671659
1	Al Horford	C	Atlanta Hawks	12.000000
2	Tiago Splitter	C	Atlanta Hawks	9.756250
3	Jeff Teague	PG	Atlanta Hawks	8.000000
4	Kyle Korver	SG	Atlanta Hawks	5.746479
...	...	...	...	...
412	Gary Neal	PG	Washington Wizards	2.139000
413	DeJuan Blair	C	Washington Wizards	2.000000
414	Kelly Oubre Jr.	SF	Washington Wizards	1.920240
415	Garrett Temple	SG	Washington Wizards	1.100602
416	Jarell Eddie	SG	Washington Wizards	0.561716

417 rows × 4 columns

Fans of Stephen Curry can find his row by using where.

nba[nba['PLAYER'] == 'Stephen Curry']

	PLAYER	POSITION	TEAM	SALARY
121	Stephen Curry	PG	Golden State Warriors	11.370786

We can also create a new table called warriors consisting of just the data for the Golden State Warriors.

warriors = nba[nba['TEAM'] =='Golden State Warriors']
warriors

	PLAYER	POSITION	TEAM	SALARY
117	Klay Thompson	SG	Golden State Warriors	15.501000
118	Draymond Green	PF	Golden State Warriors	14.260870
119	Andrew Bogut	C	Golden State Warriors	13.800000
120	Andre Iguodala	SF	Golden State Warriors	11.710456
121	Stephen Curry	PG	Golden State Warriors	11.370786
122	Jason Thompson	PF	Golden State Warriors	7.008475
123	Shaun Livingston	PG	Golden State Warriors	5.543725
124	Harrison Barnes	SF	Golden State Warriors	3.873398
125	Marreese Speights	C	Golden State Warriors	3.815000
126	Leandro Barbosa	SG	Golden State Warriors	2.500000
127	Festus Ezeli	C	Golden State Warriors	2.008748
128	Brandon Rush	SF	Golden State Warriors	1.270964
129	Kevon Looney	SF	Golden State Warriors	1.131960
130	Anderson Varejao	PF	Golden State Warriors	0.289755

By default, the first 10 lines of a table are displayed. You can use head() to display more or fewer. To display the entire table type the name of the DataFrame.

warriors

	PLAYER	POSITION	TEAM	SALARY
117	Klay Thompson	SG	Golden State Warriors	15.501000
118	Draymond Green	PF	Golden State Warriors	14.260870
119	Andrew Bogut	C	Golden State Warriors	13.800000
120	Andre Iguodala	SF	Golden State Warriors	11.710456
121	Stephen Curry	PG	Golden State Warriors	11.370786
122	Jason Thompson	PF	Golden State Warriors	7.008475
123	Shaun Livingston	PG	Golden State Warriors	5.543725
124	Harrison Barnes	SF	Golden State Warriors	3.873398
125	Marreese Speights	C	Golden State Warriors	3.815000
126	Leandro Barbosa	SG	Golden State Warriors	2.500000
127	Festus Ezeli	C	Golden State Warriors	2.008748
128	Brandon Rush	SF	Golden State Warriors	1.270964
129	Kevon Looney	SF	Golden State Warriors	1.131960
130	Anderson Varejao	PF	Golden State Warriors	0.289755

The nba table is sorted in alphabetical order of the team names. To see how the players were paid in 2015-2016, it is useful to sort the data by salary. Remember that by default, the sorting is in increasing order.

nba.sort_values('SALARY')

	PLAYER	POSITION	TEAM	SALARY
267	Thanasis Antetokounmpo	SF	New York Knicks	0.030888
327	Cory Jefferson	PF	Phoenix Suns	0.049709
326	Jordan McRae	SG	Phoenix Suns	0.049709
324	Orlando Johnson	SG	Phoenix Suns	0.055722
325	Phil Pressey	PG	Phoenix Suns	0.055722
...	...	...	...	...
131	Dwight Howard	C	Houston Rockets	22.359364
255	Carmelo Anthony	SF	New York Knicks	22.875000
72	LeBron James	SF	Cleveland Cavaliers	22.970500
29	Joe Johnson	SF	Brooklyn Nets	24.894863
169	Kobe Bryant	SF	Los Angeles Lakers	25.000000

417 rows × 4 columns

These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table.

The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world. To identify these players we can sort in descending order of salary and look at the top few rows.

nba.sort_values('SALARY', ascending=False)

	PLAYER	POSITION	TEAM	SALARY
169	Kobe Bryant	SF	Los Angeles Lakers	25.000000
29	Joe Johnson	SF	Brooklyn Nets	24.894863
72	LeBron James	SF	Cleveland Cavaliers	22.970500
255	Carmelo Anthony	SF	New York Knicks	22.875000
131	Dwight Howard	C	Houston Rockets	22.359364
...	...	...	...	...
200	Elliot Williams	SG	Memphis Grizzlies	0.055722
324	Orlando Johnson	SG	Phoenix Suns	0.055722
327	Cory Jefferson	PF	Phoenix Suns	0.049709
326	Jordan McRae	SG	Phoenix Suns	0.049709
267	Thanasis Antetokounmpo	SF	New York Knicks	0.030888

417 rows × 4 columns

Kobe Bryant, since retired, was the highest earning NBA player in 2015-2016.

Fundamentals of Data Science

4. Introduction to DataFrames¶

4.1. Choosing Sets of Columns¶

4.2. Sorting Rows¶

4.2.1. Selecting Rows that Satisfy a Condition¶

4.2.2. Example: Salaries in the NBA¶