10. Sampling and Empirical Distributions¶

An important part of data science consists of making conclusions based on the data in random samples. In order to correctly interpret their results, data scientists have to first understand exactly what random samples are.

In this chapter we will take a more careful look at sampling, with special attention to the properties of large random samples.

Let’s start by drawing some samples. Our examples are based on the top_movies.csv data set.

top_raw = pd.read_csv(path_data + 'top_movies.csv')

top1 = top_raw.copy()

top1['Row Index'] = np.arange(len(top1))

top1.head(5)

	Title	Studio	Gross	Gross (Adjusted)	Year	Row Index
0	Star Wars: The Force Awakens	Buena Vista (Disney)	906723418	906723400	2015	0
1	Avatar	Fox	760507625	846120800	2009	1
2	Titanic	Paramount	658672302	1178627900	1997	2
3	Jurassic World	Universal	652270625	687728000	2015	3
4	Marvel's The Avengers	Buena Vista (Disney)	623357910	668866600	2012	4

Column Position¶

Notice that column we have created ‘Row Index’ is positioned last in the df, to make life easier we would like this column to be first in the df. There are several ways in which we can move the position of this column e.g. we could pop the column out of the df then re-insert it in the desired position or we could drop the column then re-insert into the df. Yet another method would be insert the column ‘Row Index’ at the desired df position.

Pandas ‘pop’

Pandas ‘drop’

Pandas ‘insert’

Insert¶

top2 = top1.drop(columns=['Row Index'])

top2.insert(0, 'Row Index', np.arange(len(top2)))

top2

	Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
0	0	Star Wars: The Force Awakens	Buena Vista (Disney)	906723418	906723400	2015
1	1	Avatar	Fox	760507625	846120800	2009
2	2	Titanic	Paramount	658672302	1178627900	1997
3	3	Jurassic World	Universal	652270625	687728000	2015
4	4	Marvel's The Avengers	Buena Vista (Disney)	623357910	668866600	2012
...	...	...	...	...	...	...
195	195	The Caine Mutiny	Columbia	21750000	386173500	1954
196	196	The Bells of St. Mary's	RKO	21333333	545882400	1945
197	197	Duel in the Sun	Selz.	20408163	443877500	1946
198	198	Sergeant York	Warner Bros.	16361885	418671800	1941
199	199	The Four Horsemen of the Apocalypse	MPC	9183673	399489800	1921

200 rows × 6 columns

Rename Index¶

Rather than creating a new column we can rename the existing axis, howeverby doing this we must remember that ‘Row Index’ is the actual df index and not simple ‘column’

top = top1.drop(columns=['Row Index'])

top = top.rename_axis('Row Index', axis='columns')

top

Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
0	Star Wars: The Force Awakens	Buena Vista (Disney)	906723418	906723400	2015
1	Avatar	Fox	760507625	846120800	2009
2	Titanic	Paramount	658672302	1178627900	1997
3	Jurassic World	Universal	652270625	687728000	2015
4	Marvel's The Avengers	Buena Vista (Disney)	623357910	668866600	2012
...	...	...	...	...	...
195	The Caine Mutiny	Columbia	21750000	386173500	1954
196	The Bells of St. Mary's	RKO	21333333	545882400	1945
197	Duel in the Sun	Selz.	20408163	443877500	1946
198	Sergeant York	Warner Bros.	16361885	418671800	1941
199	The Four Horsemen of the Apocalypse	MPC	9183673	399489800	1921

200 rows × 5 columns

Number Formatting¶

Before going on to process the data we may wish to adjust the format of data (as we have previously). To achieve this we can employ Pandas ‘Display Values’ which allows us to format an entire df or to specific columns.

Pandas format

top.head(10).style.format({'Gross': "{:,}", 'Gross (Adjusted)': '{:,}'})

Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
0	Star Wars: The Force Awakens	Buena Vista (Disney)	906,723,418	906,723,400	2015
1	Avatar	Fox	760,507,625	846,120,800	2009
2	Titanic	Paramount	658,672,302	1,178,627,900	1997
3	Jurassic World	Universal	652,270,625	687,728,000	2015
4	Marvel's The Avengers	Buena Vista (Disney)	623,357,910	668,866,600	2012
5	The Dark Knight	Warner Bros.	534,858,444	647,761,600	2008
6	Star Wars: Episode I - The Phantom Menace	Fox	474,544,677	785,715,000	1999
7	Star Wars	Fox	460,998,007	1,549,640,500	1977
8	Avengers: Age of Ultron	Buena Vista (Disney)	459,005,868	465,684,200	2015
9	The Dark Knight Rises	Warner Bros.	448,139,099	500,961,700	2012

Sampling Rows of a Table¶

Each row of a data table represents an individual; in top, each individual is a movie. Sampling individuals can thus be achieved by sampling the rows of a table.

The contents of a row are the values of different variables measured on the same individual. So the contents of the sampled rows form samples of values of each of the variables.

Deterministic Samples¶

When you simply specify which elements of a set you want to choose, without any chances involved, you create a deterministic sample.

You have done this many times, for example by using df.iloc (index location and the df index values [ ]:

Pandas iloc

top.iloc[np.array([3, 18, 100])]

Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
3	Jurassic World	Universal	652270625	687728000	2015
18	Spider-Man	Sony	403706375	604517300	2002
100	Gone with the Wind	MGM	198676459	1757788200	1939

We can also use Pandas contains as a conditional operator:

Pandas where

top[top['Title'].str.contains('Harry Potter')]

Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
22	Harry Potter and the Deathly Hallows Part 2	Warner Bros.	381011219	417512200	2011
43	Harry Potter and the Sorcerer's Stone	Warner Bros.	317575550	486442900	2001
54	Harry Potter and the Half-Blood Prince	Warner Bros.	301959197	352098800	2009
59	Harry Potter and the Order of the Phoenix	Warner Bros.	292004738	369250200	2007
62	Harry Potter and the Goblet of Fire	Warner Bros.	290013036	393024800	2005
69	Harry Potter and the Chamber of Secrets	Warner Bros.	261988482	390768100	2002
76	Harry Potter and the Prisoner of Azkaban	Warner Bros.	249541069	349598600	2004

While these are samples, they are not random samples. They don’t involve chance.

Probability Samples¶

For describing random samples, some terminology will be helpful.

A population is the set of all elements from whom a sample will be drawn.

A probability sample is one for which it is possible to calculate, before the sample is drawn, the chance with which any subset of elements will enter the sample.

In a probability sample, all elements need not have the same chance of being chosen.

A Random Sampling Scheme¶

For example, suppose you choose two people from a population that consists of three people A, B, and C, according to the following scheme:

Person A is chosen with probability 1.
One of Persons B or C is chosen according to the toss of a coin: if the coin lands heads, you choose B, and if it lands tails you choose C.

This is a probability sample of size 2. Here are the chances of entry for all non-empty subsets:

A: 1 
B: 1/2
C: 1/2
AB: 1/2
AC: 1/2
BC: 0
ABC: 0

Person A has a higher chance of being selected than Persons B or C; indeed, Person A is certain to be selected. Since these differences are known and quantified, they can be taken into account when working with the sample.

A Systematic Sample¶

Imagine all the elements of the population listed in a sequence. One method of sampling starts by choosing a random position early in the list, and then evenly spaced positions after that. The sample consists of the elements in those positions. Such a sample is called a systematic sample.

Here we will choose a systematic sample of the rows of top. We will start by picking one of the first 10 rows at random, and then we will pick applying the take method every 10th row after that.

Pandas take

"""Choose a random start among rows 0 through 9;
then take every 10th row."""

start = np.random.choice(np.arange(10))
top.take(np.arange(start, len(top), 10))

Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
0	Star Wars: The Force Awakens	Buena Vista (Disney)	906723418	906723400	2015
10	Shrek 2	Dreamworks	441226247	618143100	2004
20	Transformers: Revenge of the Fallen	Paramount/Dreamworks	402111870	468938100	2009
30	Furious 7	Universal	353007020	356907000	2015
40	Shrek the Third	Paramount/Dreamworks	322719944	408090600	2007
50	Independence Day	Fox	306169268	602639200	1996
60	The Chronicles of Narnia: The Lion, the Witch ...	Buena Vista (Disney)	291710957	393033100	2005
70	The Incredibles	Buena Vista (Disney)	261441092	365660600	2004
80	Bruce Almighty	Universal	242829261	350350700	2003
90	Mrs. Doubtfire	Fox	219195243	458354100	1993
100	Gone with the Wind	MGM	198676459	1757788200	1939
110	Indiana Jones and the Temple of Doom	Paramount	179870271	465735500	1984
120	Three Men and a Baby	Buena Vista (Disney)	167780960	362822900	1987
130	Rambo: First Blood Part II	TriS	150415432	368623700	1985
140	Rocky III	UA	125049125	369865300	1982
150	Superman II	Warner Bros.	108185706	338566800	1981
160	Saturday Night Fever	Paramount	94213184	353261200	1977
170	Fantasia	Disney	76408097	722478200	1941
180	Funny Girl	Columbia	52223306	348343200	1968
190	The Greatest Show on Earth	Paramount	36000000	522000000	1952

Run the cell a few times to see how the output varies.

This systematic sample is a probability sample. In this scheme, all rows have chance \(1/10\) of being chosen. For example, Row 23 is chosen if and only if Row 3 is chosen, and the chance of that is \(1/10\).

But not all subsets have the same chance of being chosen. Because the selected rows are evenly spaced, most subsets of rows have no chance of being chosen. The only subsets that are possible are those that consist of rows all separated by multiples of 10. Any of those subsets is selected with chance 1/10. Other subsets, like the subset containing the first 11 rows of the table, are selected with chance 0.

Random Samples Drawn With or Without Replacement¶

In this course, we will mostly deal with the two most straightforward methods of sampling.

The first is random sampling with replacement, which (as we have seen earlier) is the default behavior of np.random.choice when it samples from an array.

The other, called a “simple random sample”, is a sample drawn at random without replacement. Sampled individuals are not replaced in the population before the next individual is drawn. This is the kind of sampling that happens when you deal a hand from a deck of cards, for example.

In this chapter, we will use simulation to study the behavior of large samples drawn at random with or without replacement.

Numpy random.choice

Drawing a random sample requires care and precision. It is not haphazard, even though that is a colloquial meaning of the word “random”. If you stand at a street corner and take as your sample the first ten people who pass by, you might think you’re sampling at random because you didn’t choose who walked by. But it’s not a random sample – it’s a sample of convenience. You didn’t know ahead of time the probability of each person entering the sample; perhaps you hadn’t even specified exactly who was in the population.

Fundamentals of Data Science