3. Example: Population Trends¶

We are now ready to work with large tables of data. The file below contains “Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.” Notice that read_table can read data directly from a URL.

# As of Jan 2017, this census file is online here: 
data = 'http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.csv'

# A local copy can be accessed here in case census.gov moves the file:
# data = path_data + 'nc-est2015-agesex-res.csv'

full_census_table = pd.read_csv(data)
full_census_table

	SEX	AGE	CENSUS2010POP	ESTIMATESBASE2010	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
0	0	0	3944153	3944160	3951330	3963087	3926540	3931141	3949775	3978038
1	0	1	3978070	3978090	3957888	3966551	3977939	3942872	3949776	3968564
2	0	2	4096929	4096939	4090862	3971565	3980095	3992720	3959664	3966583
3	0	3	4119040	4119051	4111920	4102470	3983157	3992734	4007079	3974061
4	0	4	4063170	4063186	4077551	4122294	4112849	3994449	4005716	4020035
...	...	...	...	...	...	...	...	...	...	...
301	2	97	53582	53605	54118	57159	59533	61255	62779	69285
302	2	98	36641	36675	37532	40116	42857	44359	46208	47272
303	2	99	26193	26214	26074	27030	29320	31112	32517	34064
304	2	100	44202	44246	45058	47556	50661	53902	58008	61886
305	2	999	156964212	156969328	157258820	158427085	159581546	160720625	161952064	163189523

306 rows × 10 columns

Only the first 5 and last 5 rows of the DataFrame are displayed. Later we will see how to display the entire DataFrame; however, this is typically not useful with large tables.

a description of the table appears online. The SEX column contains numeric codes: 0 stands for the total, 1 for male, and 2 for female. The AGE column contains ages in completed years, but the special value 999 is a sum of the total population. The rest of the columns contain estimates of the US population.

Typically, a public table will contain more information than necessary for a particular investigation or analysis. In this case, let us suppose that we are only interested in the population changes from 2010 to 2014. Let us select the relevant columns.

partial_census_table = full_census_table[['SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2014']]
partial_census_table.head(10)

	AGE	POPESTIMATE2010	POPESTIMATE2014
0	0	3951330	3949775
1	1	3957888	3949776
2	2	4090862	3959664
3	3	4111920	4007079
4	4	4077551	4005716
5	5	4064653	4006900
6	6	4073013	4135930
7	7	4043046	4155326
8	8	4025604	4120903
9	9	4125415	4108349

We can also simplify the labels of the selected columns.

us_pop = partial_census_table.rename(columns={'POPESTIMATE2010': '2010', 'POPESTIMATE2014':'2014'})
us_pop.head(10)

	AGE	2010	2014
0	0	3951330	3949775
1	1	3957888	3949776
2	2	4090862	3959664
3	3	4111920	4007079
4	4	4077551	4005716
5	5	4064653	4006900
6	6	4073013	4135930
7	7	4043046	4155326
8	8	4025604	4120903
9	9	4125415	4108349

We now have a table that is easy to work with. Each column of the table is an array of the same length, and so columns can be combined using arithmetic. Here is the change in population between 2010 and 2014.

us_pop['2014'] - us_pop['2010']

      -1555
      -8112
    -131198
    -104841
     -71835
        ...   
     8661
     8676
     6443
    12950
  4693244
Length: 306, dtype: int64

Let us augment us_pop with a column that contains these changes, both in absolute terms and as percents relative to the value in 2010.

change = us_pop['2014'] - us_pop['2010']

census = us_pop

census['Change'] = change

census['Percent Change'] = change/us_pop['2010']

census.head().style.format({'Percent Change': "{:,.2%}"})

	AGE	2010	2014	Change	Percent Change
0	0	3951330	3949775	-1555	-0.04%
1	1	3957888	3949776	-8112	-0.20%
2	2	4090862	3959664	-131198	-3.21%
3	3	4111920	4007079	-104841	-2.55%
4	4	4077551	4005716	-71835	-1.76%

Sorting the data. Let us sort the table in decreasing order of the absolute change in population.

census.head().sort_values('Change', ascending=False).style.format({'Percent Change': "{:,.2%}"})

	AGE	2010	2014	Change	Percent Change
0	0	3951330	3949775	-1555	-0.04%
1	1	3957888	3949776	-8112	-0.20%
4	4	4077551	4005716	-71835	-1.76%
3	3	4111920	4007079	-104841	-2.55%
2	2	4090862	3959664	-131198	-3.21%

Not surprisingly, the top row of the sorted table is the line that corresponds to the entire population: both sexes and all age groups. From 2010 to 2014, the population of the United States increased by about 9.5 million people, a change of just over 3%.

The next two rows correspond to all the men and all the women respectively. The male population grew more than the female population, both in absolute and percentage terms. Both percent changes were around 3%.

Now take a look at the next few rows. The percent change jumps from about 3% for the overall population to almost 30% for the people in their late sixties and early seventies. This stunning change contributes to what is known as the greying of America.

By far the greatest absolute change was among those in the 64-67 agegroup in 2014. What could explain this large increase? We can explore this question by examining the years in which the relevant groups were born.

Those who were in the 64-67 age group in 2010 were born in the years 1943 to 1946. The attack on Pearl Harbor was in late 1941, and by 1942 U.S. forces were heavily engaged in a massive war that ended in 1945.
Those who were 64 to 67 years old in 2014 were born in the years 1947 to 1950, at the height of the post-WWII baby boom in the United States.

The post-war jump in births is the major reason for the large changes that we have observed.

Fundamentals of Data Science

3. Example: Population Trends¶