6. DataFrames¶
DataFrames (df’s) are a fundamental object type for representing data sets. A df can be viewed in two ways:
a sequence of named columns that each describe a single aspect of all entries in a data set, or
a sequence of rows that each contain all information about a single entry in a data set.
In order to use a DataFrame, import all of the module called pandas
, by convention this is usually imported and as pd
.
import pandas as pd
Empty tables can be created using the pd.DataFrame()
function. An empty table is usefuly because it can be extended to contain new rows and columns.
pd.DataFrame(columns=['A', 'B', 'C'], index=(0,1,2,3,4)).fillna(0)
A | B | C | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 |
2 | 0 | 0 | 0 |
3 | 0 | 0 | 0 |
4 | 0 | 0 | 0 |
A dictionary is used to construct a new table with labeled columns. Each column of a df is an array.
Below, we begin each example with an empty table that has no columns.
Notice:
the column heading is supplied as a
key
with the corresponding column content added as avalue
in a{key:value}
dictionarywhen using a list as a data source for the
np.array()
function the list must be placed within square orhard
brackets.an
index
has been added automatically
pd.DataFrame({'Number of petals': np.array([8, 34, 5])})
Number of petals | |
---|---|
0 | 8 |
1 | 34 |
2 | 5 |
To add two (or more) new columns, provide the label and array for each column. All columns must have the same length, or an error will occur.
pd.DataFrame({
'Number of petals': np.array([8, 34, 5]),
'Name':np.array(['lotus', 'sunflower', 'rose'])
})
Number of petals | Name | |
---|---|---|
0 | 8 | lotus |
1 | 34 | sunflower |
2 | 5 | rose |
We can give this table a name, and then extend the table with another column.
flowers = pd.DataFrame({'Number of petals':np.array([8, 34, 5]),'Name':np.array(['lotus', 'sunflower', 'rose'])})
flowers_two_col = flowers.copy()
flowers['Color'] = np.array({'pink', 'yellow', 'red'})
flowers
Number of petals | Name | Color | |
---|---|---|---|
0 | 8 | lotus | {pink, yellow, red} |
1 | 34 | sunflower | {pink, yellow, red} |
2 | 5 | rose | {pink, yellow, red} |
When a new column is added to a Dataframe a new DatFrame is not created, so the original DataFrame is affected. For example, the original DatFrame flowers
before the third was added.
flowers
Number of petals | Name | Color | |
---|---|---|---|
0 | 8 | lotus | {pink, yellow, red} |
1 | 34 | sunflower | {pink, yellow, red} |
2 | 5 | rose | {pink, yellow, red} |
Before adding a third column a copy of df ‘flowers’ is created, in this case the new df created is called flowers_two_col. ‘flowers_two_col = flowers.copy()
’
flowers_two_col
Number of petals | Name | |
---|---|---|
0 | 8 | lotus |
1 | 34 | sunflower |
2 | 5 | rose |
Creating dfs in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use Python to read it into a table, instead of typing it all in cell by cell.
Often, dfs are created from files that contain comma-separated values. Such files are called CSV files.
Below, we use the Table method pd.read_csv()
to read a CSV file that contains some of the data used by Minard in his graphic about Napoleon’s Russian campaign. The data are placed in a df named minard
.
minard = pd.read_csv(path_data + 'minard.csv')
minard
Longitude | Latitude | City | Direction | Survivors | |
---|---|---|---|---|---|
0 | 32.0 | 54.8 | Smolensk | Advance | 145000 |
1 | 33.2 | 54.9 | Dorogobouge | Advance | 140000 |
2 | 34.4 | 55.5 | Chjat | Advance | 127100 |
3 | 37.6 | 55.8 | Moscou | Advance | 100000 |
4 | 34.3 | 55.2 | Wixma | Retreat | 55000 |
5 | 32.0 | 54.6 | Smolensk | Retreat | 24000 |
6 | 30.4 | 54.4 | Orscha | Retreat | 20000 |
7 | 26.8 | 54.3 | Moiodexno | Retreat | 12000 |
We will use this small df to demonstrate some useful DataFrame methods. We will then use those same methods, and develop other methods, on much larger DataFrames.
The Size of the Table¶
The method df.shape(1)
gives the number of columns in the table, and df.shape(0)
the number of rows.
minard.shape
(8, 5)
minard.shape[1]
5
minard.shape[0]
8
len( )¶
The number of rows in a df can also be found by using the len()
function. For number of rows len(df.rows)
, and number of columns len(df.columns)
. As the default parameter for the len()
function is set for number of rows and if we want to know the number of rows we don’t usually add ‘.rows’
len(minard.columns)
5
len(minard)
8
Column Labels¶
The method .columns
can be used to list the labels of all the columns. With minard
we don’t gain much by this, but it can be very useful for tables that are so large that not all columns are visible on the screen.
minard.columns
Index(['Longitude', 'Latitude', 'City', 'Direction', 'Survivors'], dtype='object')
We can change column labels using the rename(columns={})
method. This creates a new df and leaves minard
unchanged.
minard.rename(columns={'City':'City Name'})
Longitude | Latitude | City Name | Direction | Survivors | |
---|---|---|---|---|---|
0 | 32.0 | 54.8 | Smolensk | Advance | 145000 |
1 | 33.2 | 54.9 | Dorogobouge | Advance | 140000 |
2 | 34.4 | 55.5 | Chjat | Advance | 127100 |
3 | 37.6 | 55.8 | Moscou | Advance | 100000 |
4 | 34.3 | 55.2 | Wixma | Retreat | 55000 |
5 | 32.0 | 54.6 | Smolensk | Retreat | 24000 |
6 | 30.4 | 54.4 | Orscha | Retreat | 20000 |
7 | 26.8 | 54.3 | Moiodexno | Retreat | 12000 |
However, this method does not change the original DataFrame.
minard
Longitude | Latitude | City | Direction | Survivors | |
---|---|---|---|---|---|
0 | 32.0 | 54.8 | Smolensk | Advance | 145000 |
1 | 33.2 | 54.9 | Dorogobouge | Advance | 140000 |
2 | 34.4 | 55.5 | Chjat | Advance | 127100 |
3 | 37.6 | 55.8 | Moscou | Advance | 100000 |
4 | 34.3 | 55.2 | Wixma | Retreat | 55000 |
5 | 32.0 | 54.6 | Smolensk | Retreat | 24000 |
6 | 30.4 | 54.4 | Orscha | Retreat | 20000 |
7 | 26.8 | 54.3 | Moiodexno | Retreat | 12000 |
A common pattern is to assign the original name minard
to the new table, so that all future uses of minard
will refer to the relabeled table.
minard = minard.rename(columns={'City':'City Name'})
minard
Longitude | Latitude | City Name | Direction | Survivors | |
---|---|---|---|---|---|
0 | 32.0 | 54.8 | Smolensk | Advance | 145000 |
1 | 33.2 | 54.9 | Dorogobouge | Advance | 140000 |
2 | 34.4 | 55.5 | Chjat | Advance | 127100 |
3 | 37.6 | 55.8 | Moscou | Advance | 100000 |
4 | 34.3 | 55.2 | Wixma | Retreat | 55000 |
5 | 32.0 | 54.6 | Smolensk | Retreat | 24000 |
6 | 30.4 | 54.4 | Orscha | Retreat | 20000 |
7 | 26.8 | 54.3 | Moiodexno | Retreat | 12000 |
Accessing the Data in a Column¶
We can use a column’s label to access the array of data in the column.
minard['Survivors']
0 145000
1 140000
2 127100
3 100000
4 55000
5 24000
6 20000
7 12000
Name: Survivors, dtype: int64
type( )¶
To determine the tupe of object created we can use the type()
function.
type(minard)
pandas.core.frame.DataFrame
Using two sets of square brackets the output is displayed in DataFrame format.
minard[['Survivors']]
Survivors | |
---|---|
0 | 145000 |
1 | 140000 |
2 | 127100 |
3 | 100000 |
4 | 55000 |
5 | 24000 |
6 | 20000 |
7 | 12000 |
type(minard)
pandas.core.frame.DataFrame
iLoc[ ]¶
(index location)
The 5 columns are indexed 0, 1, 2, 3, and 4. The column Survivors
can also be accessed by using the iloc[]
method with the required column index. Notice that to select a column using the iloc[]
method we have to first place a colon followed by a comma in the swuare brackets due to the default setting for iloc[]
being set to ‘rows’.
minard.iloc[:,4]
0 145000
1 140000
2 127100
3 100000
4 55000
5 24000
6 20000
7 12000
Name: Survivors, dtype: int64
The 8 items in the array are indexed 0, 1, 2, and so on, up to 7. The items in the column can be accessed using item
, as with any array.
minard.iloc[:,4][0]
145000
minard.iloc[:,4][5]
24000
Alternatively¶
if we wish to find a particular member of a row we select a row rather than a column. Notice that in this instance we have selected the 4th row and the 4th column, remembering that though there are 5 columns Pandas refers to the first column as column 0 and first row as row 0.
minard.iloc[5][4]
24000
Working with the Data in a Column¶
Because columns are arrays, we can use array operations on them to discover new information. For example, we can create a new column that contains the percent of all survivors at each city after Smolensk.
initial = minard['Survivors'][0]
minard['Percent Surviving'] = minard['Survivors']/initial
minard
Longitude | Latitude | City Name | Direction | Survivors | Percent Surviving | |
---|---|---|---|---|---|---|
0 | 32.0 | 54.8 | Smolensk | Advance | 145000 | 1.000000 |
1 | 33.2 | 54.9 | Dorogobouge | Advance | 140000 | 0.965517 |
2 | 34.4 | 55.5 | Chjat | Advance | 127100 | 0.876552 |
3 | 37.6 | 55.8 | Moscou | Advance | 100000 | 0.689655 |
4 | 34.3 | 55.2 | Wixma | Retreat | 55000 | 0.379310 |
5 | 32.0 | 54.6 | Smolensk | Retreat | 24000 | 0.165517 |
6 | 30.4 | 54.4 | Orscha | Retreat | 20000 | 0.137931 |
7 | 26.8 | 54.3 | Moiodexno | Retreat | 12000 | 0.082759 |
Formatting¶
To make the proportions in the new columns appear as percents, we can use the method style.format()
with the option.
minard.style.format({'Percent Surviving': "{:.2%}"})
Longitude | Latitude | City Name | Direction | Survivors | Percent Surviving | |
---|---|---|---|---|---|---|
0 | 32.000000 | 54.800000 | Smolensk | Advance | 145000 | 100.00% |
1 | 33.200000 | 54.900000 | Dorogobouge | Advance | 140000 | 96.55% |
2 | 34.400000 | 55.500000 | Chjat | Advance | 127100 | 87.66% |
3 | 37.600000 | 55.800000 | Moscou | Advance | 100000 | 68.97% |
4 | 34.300000 | 55.200000 | Wixma | Retreat | 55000 | 37.93% |
5 | 32.000000 | 54.600000 | Smolensk | Retreat | 24000 | 16.55% |
6 | 30.400000 | 54.400000 | Orscha | Retreat | 20000 | 13.79% |
7 | 26.800000 | 54.300000 | Moiodexno | Retreat | 12000 | 8.28% |
N.B. a peculiarity of the Jupyter notebook is that if you make a mistake e.g. misspelling a column name, when you run the formatting function a nwe column will be created. to remive this colummn you must retart the kernel.
Toolbar - Kernel - Restart & Clear Output
Choosing Sets of Columns¶
To select particular columns we can use df.['col1', 'col2']
which creates a new table that contains only the specified columns. When selecting a single column we can use one set of square brackets, when selecting multiple columns two sets of swuare brackets are required.
minard[['Longitude', 'Latitude']]
Longitude | Latitude | |
---|---|---|
0 | 32.0 | 54.8 |
1 | 33.2 | 54.9 |
2 | 34.4 | 55.5 |
3 | 37.6 | 55.8 |
4 | 34.3 | 55.2 |
5 | 32.0 | 54.6 |
6 | 30.4 | 54.4 |
7 | 26.8 | 54.3 |
The same selection can be made using column indices instead of labels.
N.B. the column range selected is 0:2 with the range being bottom heavy. Though the range bottom limit is 0 and the top limit is 2 instead of processing elements 0, 1 and 2 only elements 0 and 1 will be processed i.e. bottom heavy or top light
minard.iloc[:, 0:2]
Longitude | Latitude | |
---|---|---|
0 | 32.0 | 54.8 |
1 | 33.2 | 54.9 |
2 | 34.4 | 55.5 |
3 | 37.6 | 55.8 |
4 | 34.3 | 55.2 |
5 | 32.0 | 54.6 |
6 | 30.4 | 54.4 |
7 | 26.8 | 54.3 |
The result of using df.[' ']
is a new DataFrame, even when you select just one column.
minard['Survivors']
0 145000
1 140000
2 127100
3 100000
4 55000
5 24000
6 20000
7 12000
Name: Survivors, dtype: int64
Another way to create a new table consisting of a set of columns is to drop
the columns you don’t want.
minard.drop(columns=['Longitude', 'Latitude', 'Direction'])
City Name | Survivors | Percent Surviving | |
---|---|---|---|
0 | Smolensk | 145000 | 1.000000 |
1 | Dorogobouge | 140000 | 0.965517 |
2 | Chjat | 127100 | 0.876552 |
3 | Moscou | 100000 | 0.689655 |
4 | Wixma | 55000 | 0.379310 |
5 | Smolensk | 24000 | 0.165517 |
6 | Orscha | 20000 | 0.137931 |
7 | Moiodexno | 12000 | 0.082759 |
Neither df.[' ']
nor drop
change the original DataFrame. Instead, they create new smaller DataFrames that share the same data. The fact that the original DataFrame is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.
minard
Longitude | Latitude | City Name | Direction | Survivors | Percent Surviving | |
---|---|---|---|---|---|---|
0 | 32.0 | 54.8 | Smolensk | Advance | 145000 | 1.000000 |
1 | 33.2 | 54.9 | Dorogobouge | Advance | 140000 | 0.965517 |
2 | 34.4 | 55.5 | Chjat | Advance | 127100 | 0.876552 |
3 | 37.6 | 55.8 | Moscou | Advance | 100000 | 0.689655 |
4 | 34.3 | 55.2 | Wixma | Retreat | 55000 | 0.379310 |
5 | 32.0 | 54.6 | Smolensk | Retreat | 24000 | 0.165517 |
6 | 30.4 | 54.4 | Orscha | Retreat | 20000 | 0.137931 |
7 | 26.8 | 54.3 | Moiodexno | Retreat | 12000 | 0.082759 |
All of the methods that we have used above can be applied to any DataFrame.