10. Sampling and Empirical Distributions

An important part of data science consists of making conclusions based on the data in random samples. In order to correctly interpret their results, data scientists have to first understand exactly what random samples are.

In this chapter we will take a more careful look at sampling, with special attention to the properties of large random samples.

Let’s start by drawing some samples. Our examples are based on the top_movies.csv data set.

top_raw = pd.read_csv(path_data + 'top_movies.csv')

top1 = top_raw.copy()

top1['Row Index'] = np.arange(len(top1))

top1.head(5)
Title Studio Gross Gross (Adjusted) Year Row Index
0 Star Wars: The Force Awakens Buena Vista (Disney) 906723418 906723400 2015 0
1 Avatar Fox 760507625 846120800 2009 1
2 Titanic Paramount 658672302 1178627900 1997 2
3 Jurassic World Universal 652270625 687728000 2015 3
4 Marvel's The Avengers Buena Vista (Disney) 623357910 668866600 2012 4

Column Position

Notice that column we have created ‘Row Index’ is positioned last in the df, to make life easier we would like this column to be first in the df. There are several ways in which we can move the position of this column e.g. we could pop the column out of the df then re-insert it in the desired position or we could drop the column then re-insert into the df. Yet another method would be insert the column ‘Row Index’ at the desired df position.

Pandas ‘pop’

Pandas ‘drop’

Pandas ‘insert’

Insert

top2 = top1.drop(columns=['Row Index'])

top2.insert(0, 'Row Index', np.arange(len(top2)))

top2
Row Index Title Studio Gross Gross (Adjusted) Year
0 0 Star Wars: The Force Awakens Buena Vista (Disney) 906723418 906723400 2015
1 1 Avatar Fox 760507625 846120800 2009
2 2 Titanic Paramount 658672302 1178627900 1997
3 3 Jurassic World Universal 652270625 687728000 2015
4 4 Marvel's The Avengers Buena Vista (Disney) 623357910 668866600 2012
... ... ... ... ... ... ...
195 195 The Caine Mutiny Columbia 21750000 386173500 1954
196 196 The Bells of St. Mary's RKO 21333333 545882400 1945
197 197 Duel in the Sun Selz. 20408163 443877500 1946
198 198 Sergeant York Warner Bros. 16361885 418671800 1941
199 199 The Four Horsemen of the Apocalypse MPC 9183673 399489800 1921

200 rows × 6 columns

Rename Index

Rather than creating a new column we can rename the existing axis, howeverby doing this we must remember that ‘Row Index’ is the actual df index and not simple ‘column’

top = top1.drop(columns=['Row Index'])

top = top.rename_axis('Row Index', axis='columns')

top
Row Index Title Studio Gross Gross (Adjusted) Year
0 Star Wars: The Force Awakens Buena Vista (Disney) 906723418 906723400 2015
1 Avatar Fox 760507625 846120800 2009
2 Titanic Paramount 658672302 1178627900 1997
3 Jurassic World Universal 652270625 687728000 2015
4 Marvel's The Avengers Buena Vista (Disney) 623357910 668866600 2012
... ... ... ... ... ...
195 The Caine Mutiny Columbia 21750000 386173500 1954
196 The Bells of St. Mary's RKO 21333333 545882400 1945
197 Duel in the Sun Selz. 20408163 443877500 1946
198 Sergeant York Warner Bros. 16361885 418671800 1941
199 The Four Horsemen of the Apocalypse MPC 9183673 399489800 1921

200 rows × 5 columns

Number Formatting

Before going on to process the data we may wish to adjust the format of data (as we have previously). To achieve this we can employ Pandas ‘Display Values’ which allows us to format an entire df or to specific columns.

Pandas format

top.head(10).style.format({'Gross': "{:,}", 'Gross (Adjusted)': '{:,}'})
Row Index Title Studio Gross Gross (Adjusted) Year
0 Star Wars: The Force Awakens Buena Vista (Disney) 906,723,418 906,723,400 2015
1 Avatar Fox 760,507,625 846,120,800 2009
2 Titanic Paramount 658,672,302 1,178,627,900 1997
3 Jurassic World Universal 652,270,625 687,728,000 2015
4 Marvel's The Avengers Buena Vista (Disney) 623,357,910 668,866,600 2012
5 The Dark Knight Warner Bros. 534,858,444 647,761,600 2008
6 Star Wars: Episode I - The Phantom Menace Fox 474,544,677 785,715,000 1999
7 Star Wars Fox 460,998,007 1,549,640,500 1977
8 Avengers: Age of Ultron Buena Vista (Disney) 459,005,868 465,684,200 2015
9 The Dark Knight Rises Warner Bros. 448,139,099 500,961,700 2012

Sampling Rows of a Table

Each row of a data table represents an individual; in top, each individual is a movie. Sampling individuals can thus be achieved by sampling the rows of a table.

The contents of a row are the values of different variables measured on the same individual. So the contents of the sampled rows form samples of values of each of the variables.

Deterministic Samples

When you simply specify which elements of a set you want to choose, without any chances involved, you create a deterministic sample.

You have done this many times, for example by using df.iloc (index location and the df index values [ ]:

Pandas iloc

top.iloc[np.array([3, 18, 100])]
Row Index Title Studio Gross Gross (Adjusted) Year
3 Jurassic World Universal 652270625 687728000 2015
18 Spider-Man Sony 403706375 604517300 2002
100 Gone with the Wind MGM 198676459 1757788200 1939

We can also use Pandas contains as a conditional operator:

Pandas where

top[top['Title'].str.contains('Harry Potter')]
Row Index Title Studio Gross Gross (Adjusted) Year
22 Harry Potter and the Deathly Hallows Part 2 Warner Bros. 381011219 417512200 2011
43 Harry Potter and the Sorcerer's Stone Warner Bros. 317575550 486442900 2001
54 Harry Potter and the Half-Blood Prince Warner Bros. 301959197 352098800 2009
59 Harry Potter and the Order of the Phoenix Warner Bros. 292004738 369250200 2007
62 Harry Potter and the Goblet of Fire Warner Bros. 290013036 393024800 2005
69 Harry Potter and the Chamber of Secrets Warner Bros. 261988482 390768100 2002
76 Harry Potter and the Prisoner of Azkaban Warner Bros. 249541069 349598600 2004

While these are samples, they are not random samples. They don’t involve chance.

Probability Samples

For describing random samples, some terminology will be helpful.

A population is the set of all elements from whom a sample will be drawn.

A probability sample is one for which it is possible to calculate, before the sample is drawn, the chance with which any subset of elements will enter the sample.

In a probability sample, all elements need not have the same chance of being chosen.

A Random Sampling Scheme

For example, suppose you choose two people from a population that consists of three people A, B, and C, according to the following scheme:

  • Person A is chosen with probability 1.

  • One of Persons B or C is chosen according to the toss of a coin: if the coin lands heads, you choose B, and if it lands tails you choose C.

This is a probability sample of size 2. Here are the chances of entry for all non-empty subsets:

A: 1 
B: 1/2
C: 1/2
AB: 1/2
AC: 1/2
BC: 0
ABC: 0

Person A has a higher chance of being selected than Persons B or C; indeed, Person A is certain to be selected. Since these differences are known and quantified, they can be taken into account when working with the sample.

A Systematic Sample

Imagine all the elements of the population listed in a sequence. One method of sampling starts by choosing a random position early in the list, and then evenly spaced positions after that. The sample consists of the elements in those positions. Such a sample is called a systematic sample.

Here we will choose a systematic sample of the rows of top. We will start by picking one of the first 10 rows at random, and then we will pick applying the take method every 10th row after that.

Pandas take

"""Choose a random start among rows 0 through 9;
then take every 10th row."""

start = np.random.choice(np.arange(10))
top.take(np.arange(start, len(top), 10))
Row Index Title Studio Gross Gross (Adjusted) Year
0 Star Wars: The Force Awakens Buena Vista (Disney) 906723418 906723400 2015
10 Shrek 2 Dreamworks 441226247 618143100 2004
20 Transformers: Revenge of the Fallen Paramount/Dreamworks 402111870 468938100 2009
30 Furious 7 Universal 353007020 356907000 2015
40 Shrek the Third Paramount/Dreamworks 322719944 408090600 2007
50 Independence Day Fox 306169268 602639200 1996
60 The Chronicles of Narnia: The Lion, the Witch ... Buena Vista (Disney) 291710957 393033100 2005
70 The Incredibles Buena Vista (Disney) 261441092 365660600 2004
80 Bruce Almighty Universal 242829261 350350700 2003
90 Mrs. Doubtfire Fox 219195243 458354100 1993
100 Gone with the Wind MGM 198676459 1757788200 1939
110 Indiana Jones and the Temple of Doom Paramount 179870271 465735500 1984
120 Three Men and a Baby Buena Vista (Disney) 167780960 362822900 1987
130 Rambo: First Blood Part II TriS 150415432 368623700 1985
140 Rocky III UA 125049125 369865300 1982
150 Superman II Warner Bros. 108185706 338566800 1981
160 Saturday Night Fever Paramount 94213184 353261200 1977
170 Fantasia Disney 76408097 722478200 1941
180 Funny Girl Columbia 52223306 348343200 1968
190 The Greatest Show on Earth Paramount 36000000 522000000 1952

Run the cell a few times to see how the output varies.

This systematic sample is a probability sample. In this scheme, all rows have chance \(1/10\) of being chosen. For example, Row 23 is chosen if and only if Row 3 is chosen, and the chance of that is \(1/10\).

But not all subsets have the same chance of being chosen. Because the selected rows are evenly spaced, most subsets of rows have no chance of being chosen. The only subsets that are possible are those that consist of rows all separated by multiples of 10. Any of those subsets is selected with chance 1/10. Other subsets, like the subset containing the first 11 rows of the table, are selected with chance 0.

Random Samples Drawn With or Without Replacement

In this course, we will mostly deal with the two most straightforward methods of sampling.

The first is random sampling with replacement, which (as we have seen earlier) is the default behavior of np.random.choice when it samples from an array.

The other, called a “simple random sample”, is a sample drawn at random without replacement. Sampled individuals are not replaced in the population before the next individual is drawn. This is the kind of sampling that happens when you deal a hand from a deck of cards, for example.

In this chapter, we will use simulation to study the behavior of large samples drawn at random with or without replacement.

Numpy random.choice

Drawing a random sample requires care and precision. It is not haphazard, even though that is a colloquial meaning of the word “random”. If you stand at a street corner and take as your sample the first ten people who pass by, you might think you’re sampling at random because you didn’t choose who walked by. But it’s not a random sample – it’s a sample of convenience. You didn’t know ahead of time the probability of each person entering the sample; perhaps you hadn’t even specified exactly who was in the population.