6. DataFrames

DataFrames (df’s) are a fundamental object type for representing data sets. A df can be viewed in two ways:

  • a sequence of named columns that each describe a single aspect of all entries in a data set, or

  • a sequence of rows that each contain all information about a single entry in a data set.

In order to use a DataFrame, import all of the module called pandas, by convention this is usually imported and as pd.

import pandas as pd

Empty tables can be created using the pd.DataFrame() function. An empty table is usefuly because it can be extended to contain new rows and columns.

pd.DataFrame(columns=['A', 'B', 'C'], index=(0,1,2,3,4)).fillna(0)
A B C
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

A dictionary is used to construct a new table with labeled columns. Each column of a df is an array.

Below, we begin each example with an empty table that has no columns.

Notice:

  • the column heading is supplied as a key with the corresponding column content added as avalue in a {key:value} dictionary

  • when using a list as a data source for the np.array() function the list must be placed within square or hard brackets.

  • an index has been added automatically

pd.DataFrame({'Number of petals': np.array([8, 34, 5])})
Number of petals
0 8
1 34
2 5

To add two (or more) new columns, provide the label and array for each column. All columns must have the same length, or an error will occur.

pd.DataFrame({
    'Number of petals': np.array([8, 34, 5]),
    'Name':np.array(['lotus', 'sunflower', 'rose'])
})
Number of petals Name
0 8 lotus
1 34 sunflower
2 5 rose

We can give this table a name, and then extend the table with another column.

flowers = pd.DataFrame({'Number of petals':np.array([8, 34, 5]),'Name':np.array(['lotus', 'sunflower', 'rose'])})

flowers_two_col = flowers.copy()

flowers['Color'] = np.array({'pink', 'yellow', 'red'})

    
flowers
Number of petals Name Color
0 8 lotus {pink, yellow, red}
1 34 sunflower {pink, yellow, red}
2 5 rose {pink, yellow, red}

When a new column is added to a Dataframe a new DatFrame is not created, so the original DataFrame is affected. For example, the original DatFrame flowers before the third was added.

flowers
Number of petals Name Color
0 8 lotus {pink, yellow, red}
1 34 sunflower {pink, yellow, red}
2 5 rose {pink, yellow, red}

Before adding a third column a copy of df ‘flowers’ is created, in this case the new df created is called flowers_two_col. ‘flowers_two_col = flowers.copy()

Pandas ‘df.copy()’

flowers_two_col
Number of petals Name
0 8 lotus
1 34 sunflower
2 5 rose

Creating dfs in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use Python to read it into a table, instead of typing it all in cell by cell.

Often, dfs are created from files that contain comma-separated values. Such files are called CSV files.

Below, we use the Table method pd.read_csv() to read a CSV file that contains some of the data used by Minard in his graphic about Napoleon’s Russian campaign. The data are placed in a df named minard.

pd.read_csv()

minard = pd.read_csv(path_data + 'minard.csv')
minard
Longitude Latitude City Direction Survivors
0 32.0 54.8 Smolensk Advance 145000
1 33.2 54.9 Dorogobouge Advance 140000
2 34.4 55.5 Chjat Advance 127100
3 37.6 55.8 Moscou Advance 100000
4 34.3 55.2 Wixma Retreat 55000
5 32.0 54.6 Smolensk Retreat 24000
6 30.4 54.4 Orscha Retreat 20000
7 26.8 54.3 Moiodexno Retreat 12000

We will use this small df to demonstrate some useful DataFrame methods. We will then use those same methods, and develop other methods, on much larger DataFrames.

The Size of the Table

The method df.shape(1) gives the number of columns in the table, and df.shape(0) the number of rows.

df.shape[]

minard.shape
(8, 5)
minard.shape[1]
5
minard.shape[0]
8

len( )

The number of rows in a df can also be found by using the len() function. For number of rows len(df.rows), and number of columns len(df.columns). As the default parameter for the len() function is set for number of rows and if we want to know the number of rows we don’t usually add ‘.rows’

len(minard.columns)
5
len(minard)
8

Column Labels

The method .columns can be used to list the labels of all the columns. With minard we don’t gain much by this, but it can be very useful for tables that are so large that not all columns are visible on the screen.

minard.columns
Index(['Longitude', 'Latitude', 'City', 'Direction', 'Survivors'], dtype='object')

We can change column labels using the rename(columns={}) method. This creates a new df and leaves minard unchanged.

minard.rename(columns={'City':'City Name'})
Longitude Latitude City Name Direction Survivors
0 32.0 54.8 Smolensk Advance 145000
1 33.2 54.9 Dorogobouge Advance 140000
2 34.4 55.5 Chjat Advance 127100
3 37.6 55.8 Moscou Advance 100000
4 34.3 55.2 Wixma Retreat 55000
5 32.0 54.6 Smolensk Retreat 24000
6 30.4 54.4 Orscha Retreat 20000
7 26.8 54.3 Moiodexno Retreat 12000

However, this method does not change the original DataFrame.

minard
Longitude Latitude City Direction Survivors
0 32.0 54.8 Smolensk Advance 145000
1 33.2 54.9 Dorogobouge Advance 140000
2 34.4 55.5 Chjat Advance 127100
3 37.6 55.8 Moscou Advance 100000
4 34.3 55.2 Wixma Retreat 55000
5 32.0 54.6 Smolensk Retreat 24000
6 30.4 54.4 Orscha Retreat 20000
7 26.8 54.3 Moiodexno Retreat 12000

A common pattern is to assign the original name minard to the new table, so that all future uses of minard will refer to the relabeled table.

minard = minard.rename(columns={'City':'City Name'})
minard
Longitude Latitude City Name Direction Survivors
0 32.0 54.8 Smolensk Advance 145000
1 33.2 54.9 Dorogobouge Advance 140000
2 34.4 55.5 Chjat Advance 127100
3 37.6 55.8 Moscou Advance 100000
4 34.3 55.2 Wixma Retreat 55000
5 32.0 54.6 Smolensk Retreat 24000
6 30.4 54.4 Orscha Retreat 20000
7 26.8 54.3 Moiodexno Retreat 12000

Accessing the Data in a Column

We can use a column’s label to access the array of data in the column.

minard['Survivors']
0    145000
1    140000
2    127100
3    100000
4     55000
5     24000
6     20000
7     12000
Name: Survivors, dtype: int64

type( )

To determine the tupe of object created we can use the type() function.

type(minard)
pandas.core.frame.DataFrame

Using two sets of square brackets the output is displayed in DataFrame format.

minard[['Survivors']]
Survivors
0 145000
1 140000
2 127100
3 100000
4 55000
5 24000
6 20000
7 12000
type(minard)
pandas.core.frame.DataFrame

iLoc[ ]

(index location)

The 5 columns are indexed 0, 1, 2, 3, and 4. The column Survivors can also be accessed by using the iloc[] method with the required column index. Notice that to select a column using the iloc[] method we have to first place a colon followed by a comma in the swuare brackets due to the default setting for iloc[] being set to ‘rows’.

Pandas iloc []

minard.iloc[:,4]
0    145000
1    140000
2    127100
3    100000
4     55000
5     24000
6     20000
7     12000
Name: Survivors, dtype: int64

The 8 items in the array are indexed 0, 1, 2, and so on, up to 7. The items in the column can be accessed using item, as with any array.

minard.iloc[:,4][0]
145000
minard.iloc[:,4][5]
24000

Alternatively

if we wish to find a particular member of a row we select a row rather than a column. Notice that in this instance we have selected the 4th row and the 4th column, remembering that though there are 5 columns Pandas refers to the first column as column 0 and first row as row 0.

minard.iloc[5][4]
24000

Working with the Data in a Column

Because columns are arrays, we can use array operations on them to discover new information. For example, we can create a new column that contains the percent of all survivors at each city after Smolensk.

initial = minard['Survivors'][0]

minard['Percent Surviving'] = minard['Survivors']/initial

minard
Longitude Latitude City Name Direction Survivors Percent Surviving
0 32.0 54.8 Smolensk Advance 145000 1.000000
1 33.2 54.9 Dorogobouge Advance 140000 0.965517
2 34.4 55.5 Chjat Advance 127100 0.876552
3 37.6 55.8 Moscou Advance 100000 0.689655
4 34.3 55.2 Wixma Retreat 55000 0.379310
5 32.0 54.6 Smolensk Retreat 24000 0.165517
6 30.4 54.4 Orscha Retreat 20000 0.137931
7 26.8 54.3 Moiodexno Retreat 12000 0.082759

Formatting

To make the proportions in the new columns appear as percents, we can use the method style.format() with the option.

style.format()

minard.style.format({'Percent Surviving': "{:.2%}"})
Longitude Latitude City Name Direction Survivors Percent Surviving
0 32.000000 54.800000 Smolensk Advance 145000 100.00%
1 33.200000 54.900000 Dorogobouge Advance 140000 96.55%
2 34.400000 55.500000 Chjat Advance 127100 87.66%
3 37.600000 55.800000 Moscou Advance 100000 68.97%
4 34.300000 55.200000 Wixma Retreat 55000 37.93%
5 32.000000 54.600000 Smolensk Retreat 24000 16.55%
6 30.400000 54.400000 Orscha Retreat 20000 13.79%
7 26.800000 54.300000 Moiodexno Retreat 12000 8.28%

N.B. a peculiarity of the Jupyter notebook is that if you make a mistake e.g. misspelling a column name, when you run the formatting function a nwe column will be created. to remive this colummn you must retart the kernel.

Toolbar - Kernel - Restart & Clear Output

Choosing Sets of Columns

To select particular columns we can use df.['col1', 'col2'] which creates a new table that contains only the specified columns. When selecting a single column we can use one set of square brackets, when selecting multiple columns two sets of swuare brackets are required.

minard[['Longitude', 'Latitude']]
Longitude Latitude
0 32.0 54.8
1 33.2 54.9
2 34.4 55.5
3 37.6 55.8
4 34.3 55.2
5 32.0 54.6
6 30.4 54.4
7 26.8 54.3

The same selection can be made using column indices instead of labels.

N.B. the column range selected is 0:2 with the range being bottom heavy. Though the range bottom limit is 0 and the top limit is 2 instead of processing elements 0, 1 and 2 only elements 0 and 1 will be processed i.e. bottom heavy or top light

minard.iloc[:, 0:2]
Longitude Latitude
0 32.0 54.8
1 33.2 54.9
2 34.4 55.5
3 37.6 55.8
4 34.3 55.2
5 32.0 54.6
6 30.4 54.4
7 26.8 54.3

The result of using df.[' '] is a new DataFrame, even when you select just one column.

minard['Survivors']
0    145000
1    140000
2    127100
3    100000
4     55000
5     24000
6     20000
7     12000
Name: Survivors, dtype: int64

Another way to create a new table consisting of a set of columns is to drop the columns you don’t want.

minard.drop(columns=['Longitude', 'Latitude', 'Direction'])
City Name Survivors Percent Surviving
0 Smolensk 145000 1.000000
1 Dorogobouge 140000 0.965517
2 Chjat 127100 0.876552
3 Moscou 100000 0.689655
4 Wixma 55000 0.379310
5 Smolensk 24000 0.165517
6 Orscha 20000 0.137931
7 Moiodexno 12000 0.082759

Neither df.[' '] nor drop change the original DataFrame. Instead, they create new smaller DataFrames that share the same data. The fact that the original DataFrame is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.

minard
Longitude Latitude City Name Direction Survivors Percent Surviving
0 32.0 54.8 Smolensk Advance 145000 1.000000
1 33.2 54.9 Dorogobouge Advance 140000 0.965517
2 34.4 55.5 Chjat Advance 127100 0.876552
3 37.6 55.8 Moscou Advance 100000 0.689655
4 34.3 55.2 Wixma Retreat 55000 0.379310
5 32.0 54.6 Smolensk Retreat 24000 0.165517
6 30.4 54.4 Orscha Retreat 20000 0.137931
7 26.8 54.3 Moiodexno Retreat 12000 0.082759

All of the methods that we have used above can be applied to any DataFrame.