{ "cells": [ { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "path_data = '../../data/'\n", "\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Example: Population Trends\n", "\n", "We are now ready to work with large tables of data. The file below contains \"Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.\" Notice that `read_table` can read data directly from a URL." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEXAGECENSUS2010POPESTIMATESBASE2010POPESTIMATE2010POPESTIMATE2011POPESTIMATE2012POPESTIMATE2013POPESTIMATE2014POPESTIMATE2015
00039441533944160395133039630873926540393114139497753978038
10139780703978090395788839665513977939394287239497763968564
20240969294096939409086239715653980095399272039596643966583
30341190404119051411192041024703983157399273440070793974061
40440631704063186407755141222944112849399444940057164020035
.................................
3012975358253605541185715959533612556277969285
3022983664136675375324011642857443594620847272
3032992619326214260742703029320311123251734064
30421004420244246450584755650661539025800861886
3052999156964212156969328157258820158427085159581546160720625161952064163189523
\n", "

306 rows × 10 columns

\n", "
" ], "text/plain": [ " SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 \\\n", "0 0 0 3944153 3944160 3951330 \n", "1 0 1 3978070 3978090 3957888 \n", "2 0 2 4096929 4096939 4090862 \n", "3 0 3 4119040 4119051 4111920 \n", "4 0 4 4063170 4063186 4077551 \n", ".. ... ... ... ... ... \n", "301 2 97 53582 53605 54118 \n", "302 2 98 36641 36675 37532 \n", "303 2 99 26193 26214 26074 \n", "304 2 100 44202 44246 45058 \n", "305 2 999 156964212 156969328 157258820 \n", "\n", " POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 \\\n", "0 3963087 3926540 3931141 3949775 \n", "1 3966551 3977939 3942872 3949776 \n", "2 3971565 3980095 3992720 3959664 \n", "3 4102470 3983157 3992734 4007079 \n", "4 4122294 4112849 3994449 4005716 \n", ".. ... ... ... ... \n", "301 57159 59533 61255 62779 \n", "302 40116 42857 44359 46208 \n", "303 27030 29320 31112 32517 \n", "304 47556 50661 53902 58008 \n", "305 158427085 159581546 160720625 161952064 \n", "\n", " POPESTIMATE2015 \n", "0 3978038 \n", "1 3968564 \n", "2 3966583 \n", "3 3974061 \n", "4 4020035 \n", ".. ... \n", "301 69285 \n", "302 47272 \n", "303 34064 \n", "304 61886 \n", "305 163189523 \n", "\n", "[306 rows x 10 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# As of Jan 2017, this census file is online here: \n", "data = 'http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.csv'\n", "\n", "# A local copy can be accessed here in case census.gov moves the file:\n", "# data = path_data + 'nc-est2015-agesex-res.csv'\n", "\n", "full_census_table = pd.read_csv(data)\n", "full_census_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only the first 5 and last 5 rows of the DataFrame are displayed. Later we will see how to display the entire DataFrame; however, this is typically not useful with large tables.\n", "\n", "a [description of the table](http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.pdf) appears online. The `SEX` column contains numeric codes: `0` stands for the total, `1` for male, and `2` for female. The `AGE` column contains ages in completed years, but the special value `999` is a sum of the total population. The rest of the columns contain estimates of the US population.\n", "\n", "Typically, a public table will contain more information than necessary for a particular investigation or analysis. In this case, let us suppose that we are only interested in the population changes from 2010 to 2014. Let us `select` the relevant columns." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEXAGEPOPESTIMATE2010POPESTIMATE2014
00039513303949775
10139578883949776
20240908623959664
30341119204007079
40440775514005716
50540646534006900
60640730134135930
70740430464155326
80840256044120903
90941254154108349
\n", "
" ], "text/plain": [ " SEX AGE POPESTIMATE2010 POPESTIMATE2014\n", "0 0 0 3951330 3949775\n", "1 0 1 3957888 3949776\n", "2 0 2 4090862 3959664\n", "3 0 3 4111920 4007079\n", "4 0 4 4077551 4005716\n", "5 0 5 4064653 4006900\n", "6 0 6 4073013 4135930\n", "7 0 7 4043046 4155326\n", "8 0 8 4025604 4120903\n", "9 0 9 4125415 4108349" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "partial_census_table = full_census_table[['SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2014']]\n", "partial_census_table.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also simplify the labels of the selected columns." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEXAGE20102014
00039513303949775
10139578883949776
20240908623959664
30341119204007079
40440775514005716
50540646534006900
60640730134135930
70740430464155326
80840256044120903
90941254154108349
\n", "
" ], "text/plain": [ " SEX AGE 2010 2014\n", "0 0 0 3951330 3949775\n", "1 0 1 3957888 3949776\n", "2 0 2 4090862 3959664\n", "3 0 3 4111920 4007079\n", "4 0 4 4077551 4005716\n", "5 0 5 4064653 4006900\n", "6 0 6 4073013 4135930\n", "7 0 7 4043046 4155326\n", "8 0 8 4025604 4120903\n", "9 0 9 4125415 4108349" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "us_pop = partial_census_table.rename(columns={'POPESTIMATE2010': '2010', 'POPESTIMATE2014':'2014'})\n", "us_pop.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a table that is easy to work with. Each column of the table is an array of the same length, and so columns can be combined using arithmetic. Here is the change in population between 2010 and 2014." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -1555\n", "1 -8112\n", "2 -131198\n", "3 -104841\n", "4 -71835\n", " ... \n", "301 8661\n", "302 8676\n", "303 6443\n", "304 12950\n", "305 4693244\n", "Length: 306, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "us_pop['2014'] - us_pop['2010']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us augment `us_pop` with a column that contains these changes, both in absolute terms and as percents relative to the value in 2010." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEX AGE 2010 2014 Change Percent Change
00039513303949775-1555-0.04%
10139578883949776-8112-0.20%
20240908623959664-131198-3.21%
30341119204007079-104841-2.55%
40440775514005716-71835-1.76%
" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "change = us_pop['2014'] - us_pop['2010']\n", "\n", "\n", "census = us_pop\n", "\n", "census['Change'] = change\n", "\n", "census['Percent Change'] = change/us_pop['2010']\n", "\n", "census.head().style.format({'Percent Change': \"{:,.2%}\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sorting the data.** Let us sort the table in decreasing order of the absolute change in population." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEX AGE 2010 2014 Change Percent Change
00039513303949775-1555-0.04%
10139578883949776-8112-0.20%
40440775514005716-71835-1.76%
30341119204007079-104841-2.55%
20240908623959664-131198-3.21%
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census.head().sort_values('Change', ascending=False).style.format({'Percent Change': \"{:,.2%}\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not surprisingly, the top row of the sorted table is the line that corresponds to the entire population: both sexes and all age groups. From 2010 to 2014, the population of the United States increased by about 9.5 million people, a change of just over 3%.\n", "\n", "The next two rows correspond to all the men and all the women respectively. The male population grew more than the female population, both in absolute and percentage terms. Both percent changes were around 3%.\n", "\n", "Now take a look at the next few rows. The percent change jumps from about 3% for the overall population to almost 30% for the people in their late sixties and early seventies. This stunning change contributes to what is known as the greying of America.\n", "\n", "By far the greatest absolute change was among those in the 64-67 agegroup in 2014. What could explain this large increase? We can explore this question by examining the years in which the relevant groups were born.\n", "\n", "- Those who were in the 64-67 age group in 2010 were born in the years 1943 to 1946. The attack on Pearl Harbor was in late 1941, and by 1942 U.S. forces were heavily engaged in a massive war that ended in 1945. \n", "\n", "- Those who were 64 to 67 years old in 2014 were born in the years 1947 to 1950, at the height of the post-WWII baby boom in the United States. \n", "\n", "The post-war jump in births is the major reason for the large changes that we have observed." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.12" } }, "nbformat": 4, "nbformat_minor": 2 }