6. DataFrames¶

DataFrames (df’s) are a fundamental object type for representing data sets. A df can be viewed in two ways:

a sequence of named columns that each describe a single aspect of all entries in a data set, or
a sequence of rows that each contain all information about a single entry in a data set.

In order to use a DataFrame, import all of the module called pandas, by convention this is usually imported and as pd.

import pandas as pd

Empty tables can be created using the pd.DataFrame() function. An empty table is usefuly because it can be extended to contain new rows and columns.

pd.DataFrame(columns=['A', 'B', 'C'], index=(0,1,2,3,4)).fillna(0)

	A	B	C
0	0	0	0
1	0	0	0
2	0	0	0
3	0	0	0
4	0	0	0

A dictionary is used to construct a new table with labeled columns. Each column of a df is an array.

Below, we begin each example with an empty table that has no columns.

Notice:

the column heading is supplied as a key with the corresponding column content added as avalue in a {key:value} dictionary
when using a list as a data source for the np.array() function the list must be placed within square or hard brackets.
an index has been added automatically

pd.DataFrame({'Number of petals': np.array([8, 34, 5])})

	Number of petals
0	8
1	34
2	5

To add two (or more) new columns, provide the label and array for each column. All columns must have the same length, or an error will occur.

pd.DataFrame({
    'Number of petals': np.array([8, 34, 5]),
    'Name':np.array(['lotus', 'sunflower', 'rose'])
})

	Number of petals	Name
0	8	lotus
1	34	sunflower
2	5	rose

We can give this table a name, and then extend the table with another column.

flowers = pd.DataFrame({'Number of petals':np.array([8, 34, 5]),'Name':np.array(['lotus', 'sunflower', 'rose'])})

flowers_two_col = flowers.copy()

flowers['Color'] = np.array({'pink', 'yellow', 'red'})

flowers

	Number of petals	Name	Color
0	8	lotus	{pink, yellow, red}
1	34	sunflower	{pink, yellow, red}
2	5	rose	{pink, yellow, red}

When a new column is added to a Dataframe a new DatFrame is not created, so the original DataFrame is affected. For example, the original DatFrame flowers before the third was added.

flowers

	Number of petals	Name	Color
0	8	lotus	{pink, yellow, red}
1	34	sunflower	{pink, yellow, red}
2	5	rose	{pink, yellow, red}

Before adding a third column a copy of df ‘flowers’ is created, in this case the new df created is called flowers_two_col. ‘flowers_two_col = flowers.copy()’

Pandas ‘df.copy()’

flowers_two_col

	Number of petals	Name
0	8	lotus
1	34	sunflower
2	5	rose

Creating dfs in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use Python to read it into a table, instead of typing it all in cell by cell.

Often, dfs are created from files that contain comma-separated values. Such files are called CSV files.

Below, we use the Table method pd.read_csv() to read a CSV file that contains some of the data used by Minard in his graphic about Napoleon’s Russian campaign. The data are placed in a df named minard.

pd.read_csv()

minard = pd.read_csv(path_data + 'minard.csv')
minard

	Longitude	Latitude	City	Direction	Survivors
0	32.0	54.8	Smolensk	Advance	145000
1	33.2	54.9	Dorogobouge	Advance	140000
2	34.4	55.5	Chjat	Advance	127100
3	37.6	55.8	Moscou	Advance	100000
4	34.3	55.2	Wixma	Retreat	55000
5	32.0	54.6	Smolensk	Retreat	24000
6	30.4	54.4	Orscha	Retreat	20000
7	26.8	54.3	Moiodexno	Retreat	12000

We will use this small df to demonstrate some useful DataFrame methods. We will then use those same methods, and develop other methods, on much larger DataFrames.

The Size of the Table¶

The method df.shape(1) gives the number of columns in the table, and df.shape(0) the number of rows.

df.shape[]

minard.shape

(8, 5)

minard.shape[1]

minard.shape[0]

len( )¶

The number of rows in a df can also be found by using the len() function. For number of rows len(df.rows), and number of columns len(df.columns). As the default parameter for the len() function is set for number of rows and if we want to know the number of rows we don’t usually add ‘.rows’

len(minard.columns)

len(minard)

Column Labels¶

The method .columns can be used to list the labels of all the columns. With minard we don’t gain much by this, but it can be very useful for tables that are so large that not all columns are visible on the screen.

minard.columns

Index(['Longitude', 'Latitude', 'City', 'Direction', 'Survivors'], dtype='object')

We can change column labels using the rename(columns={}) method. This creates a new df and leaves minard unchanged.

minard.rename(columns={'City':'City Name'})

	Longitude	Latitude	City Name	Direction	Survivors
0	32.0	54.8	Smolensk	Advance	145000
1	33.2	54.9	Dorogobouge	Advance	140000
2	34.4	55.5	Chjat	Advance	127100
3	37.6	55.8	Moscou	Advance	100000
4	34.3	55.2	Wixma	Retreat	55000
5	32.0	54.6	Smolensk	Retreat	24000
6	30.4	54.4	Orscha	Retreat	20000
7	26.8	54.3	Moiodexno	Retreat	12000

However, this method does not change the original DataFrame.

minard

	Longitude	Latitude	City	Direction	Survivors
0	32.0	54.8	Smolensk	Advance	145000
1	33.2	54.9	Dorogobouge	Advance	140000
2	34.4	55.5	Chjat	Advance	127100
3	37.6	55.8	Moscou	Advance	100000
4	34.3	55.2	Wixma	Retreat	55000
5	32.0	54.6	Smolensk	Retreat	24000
6	30.4	54.4	Orscha	Retreat	20000
7	26.8	54.3	Moiodexno	Retreat	12000

A common pattern is to assign the original name minard to the new table, so that all future uses of minard will refer to the relabeled table.

minard = minard.rename(columns={'City':'City Name'})
minard

	Longitude	Latitude	City Name	Direction	Survivors
0	32.0	54.8	Smolensk	Advance	145000
1	33.2	54.9	Dorogobouge	Advance	140000
2	34.4	55.5	Chjat	Advance	127100
3	37.6	55.8	Moscou	Advance	100000
4	34.3	55.2	Wixma	Retreat	55000
5	32.0	54.6	Smolensk	Retreat	24000
6	30.4	54.4	Orscha	Retreat	20000
7	26.8	54.3	Moiodexno	Retreat	12000

Accessing the Data in a Column¶

We can use a column’s label to access the array of data in the column.

minard['Survivors']

  145000
  140000
  127100
  100000
   55000
   24000
   20000
   12000
Name: Survivors, dtype: int64

type( )¶

To determine the tupe of object created we can use the type() function.

type(minard)

pandas.core.frame.DataFrame

Using two sets of square brackets the output is displayed in DataFrame format.

minard[['Survivors']]

	Survivors
0	145000
1	140000
2	127100
3	100000
4	55000
5	24000
6	20000
7	12000

type(minard)

pandas.core.frame.DataFrame

iLoc[ ]¶

(index location)

The 5 columns are indexed 0, 1, 2, 3, and 4. The column Survivors can also be accessed by using the iloc[] method with the required column index. Notice that to select a column using the iloc[] method we have to first place a colon followed by a comma in the swuare brackets due to the default setting for iloc[] being set to ‘rows’.

Pandas iloc []

minard.iloc[:,4]

  145000
  140000
  127100
  100000
   55000
   24000
   20000
   12000
Name: Survivors, dtype: int64

The 8 items in the array are indexed 0, 1, 2, and so on, up to 7. The items in the column can be accessed using item, as with any array.

minard.iloc[:,4][0]

minard.iloc[:,4][5]

Alternatively¶

if we wish to find a particular member of a row we select a row rather than a column. Notice that in this instance we have selected the 4th row and the 4th column, remembering that though there are 5 columns Pandas refers to the first column as column 0 and first row as row 0.

minard.iloc[5][4]

Working with the Data in a Column¶

Because columns are arrays, we can use array operations on them to discover new information. For example, we can create a new column that contains the percent of all survivors at each city after Smolensk.

initial = minard['Survivors'][0]

minard['Percent Surviving'] = minard['Survivors']/initial

minard

	Longitude	Latitude	City Name	Direction	Survivors	Percent Surviving
0	32.0	54.8	Smolensk	Advance	145000	1.000000
1	33.2	54.9	Dorogobouge	Advance	140000	0.965517
2	34.4	55.5	Chjat	Advance	127100	0.876552
3	37.6	55.8	Moscou	Advance	100000	0.689655
4	34.3	55.2	Wixma	Retreat	55000	0.379310
5	32.0	54.6	Smolensk	Retreat	24000	0.165517
6	30.4	54.4	Orscha	Retreat	20000	0.137931
7	26.8	54.3	Moiodexno	Retreat	12000	0.082759

Formatting¶

To make the proportions in the new columns appear as percents, we can use the method style.format() with the option.

style.format()

minard.style.format({'Percent Surviving': "{:.2%}"})

	Longitude	Latitude	City Name	Direction	Survivors	Percent Surviving
0	32.000000	54.800000	Smolensk	Advance	145000	100.00%
1	33.200000	54.900000	Dorogobouge	Advance	140000	96.55%
2	34.400000	55.500000	Chjat	Advance	127100	87.66%
3	37.600000	55.800000	Moscou	Advance	100000	68.97%
4	34.300000	55.200000	Wixma	Retreat	55000	37.93%
5	32.000000	54.600000	Smolensk	Retreat	24000	16.55%
6	30.400000	54.400000	Orscha	Retreat	20000	13.79%
7	26.800000	54.300000	Moiodexno	Retreat	12000	8.28%

N.B. a peculiarity of the Jupyter notebook is that if you make a mistake e.g. misspelling a column name, when you run the formatting function a nwe column will be created. to remive this colummn you must retart the kernel.

Toolbar - Kernel - Restart & Clear Output

Choosing Sets of Columns¶

To select particular columns we can use df.['col1', 'col2'] which creates a new table that contains only the specified columns. When selecting a single column we can use one set of square brackets, when selecting multiple columns two sets of swuare brackets are required.

minard[['Longitude', 'Latitude']]

	Longitude	Latitude
0	32.0	54.8
1	33.2	54.9
2	34.4	55.5
3	37.6	55.8
4	34.3	55.2
5	32.0	54.6
6	30.4	54.4
7	26.8	54.3

The same selection can be made using column indices instead of labels.

N.B. the column range selected is 0:2 with the range being bottom heavy. Though the range bottom limit is 0 and the top limit is 2 instead of processing elements 0, 1 and 2 only elements 0 and 1 will be processed i.e. bottom heavy or top light

minard.iloc[:, 0:2]

	Longitude	Latitude
0	32.0	54.8
1	33.2	54.9
2	34.4	55.5
3	37.6	55.8
4	34.3	55.2
5	32.0	54.6
6	30.4	54.4
7	26.8	54.3

The result of using df.[' '] is a new DataFrame, even when you select just one column.

minard['Survivors']

  145000
  140000
  127100
  100000
   55000
   24000
   20000
   12000
Name: Survivors, dtype: int64

Another way to create a new table consisting of a set of columns is to drop the columns you don’t want.

minard.drop(columns=['Longitude', 'Latitude', 'Direction'])

	City Name	Survivors	Percent Surviving
0	Smolensk	145000	1.000000
1	Dorogobouge	140000	0.965517
2	Chjat	127100	0.876552
3	Moscou	100000	0.689655
4	Wixma	55000	0.379310
5	Smolensk	24000	0.165517
6	Orscha	20000	0.137931
7	Moiodexno	12000	0.082759

Neither df.[' '] nor drop change the original DataFrame. Instead, they create new smaller DataFrames that share the same data. The fact that the original DataFrame is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.

minard

	Longitude	Latitude	City Name	Direction	Survivors	Percent Surviving
0	32.0	54.8	Smolensk	Advance	145000	1.000000
1	33.2	54.9	Dorogobouge	Advance	140000	0.965517
2	34.4	55.5	Chjat	Advance	127100	0.876552
3	37.6	55.8	Moscou	Advance	100000	0.689655
4	34.3	55.2	Wixma	Retreat	55000	0.379310
5	32.0	54.6	Smolensk	Retreat	24000	0.165517
6	30.4	54.4	Orscha	Retreat	20000	0.137931
7	26.8	54.3	Moiodexno	Retreat	12000	0.082759

All of the methods that we have used above can be applied to any DataFrame.

Fundamentals of Data Science