1. Visualizing Categorical Distributions¶
Data come in many forms that are not numerical. Data can be pieces of music, or places on a map. They can also be categories into which you can place individuals. Here are some examples of categorical variables.
The individuals are cartons of ice-cream, and the variable is the flavor in the carton.
The individuals are professional basketball players, and the variable is the player’s team.
The individuals are years, and the variable is the genre of the highest grossing movie of the year.
The individuals are survey respondents, and the variable is the response they choose from among “Not at all satisfied,” “Somewhat satisfied,” and “Very satisfied.”
The DataFrame icecream
contains data on 30 cartons of ice-cream.
icecream = pd.DataFrame({
'Flavor':np.array(['Chocolate', 'Strawberry', 'Vanilla']),
'Number of Cartons':np.array([16, 5, 9])}
)
icecream
Flavor | Number of Cartons | |
---|---|---|
0 | Chocolate | 16 |
1 | Strawberry | 5 |
2 | Vanilla | 9 |
The values of the categorical variable “flavor” are chocolate, strawberry, and vanilla. The df shows the number of cartons of each flavor. We call this a distribution table. A distribution shows all the values of a variable, along with the frequency of each one.
1.1. Bar Chart (pandas)¶
The bar chart (as opposed to Histogram) is a familiar way of visualizing categorical distributions. It displays a bar for each category. The bars are equally spaced and equally wide. The length of each bar is proportional to the frequency of the corresponding category.
We will draw bar charts with horizontal bars because it’s easier to label the bars that way. The pandas df method is therefore called barh
. It takes two arguments: the first is the column label of the categories, and the second is the column label of the frequencies.
icecream.plot.barh('Flavor', 'Number of Cartons')
plt.show()
If the table consists just of a column of categories and a column of frequencies, as in icecream
, the method call is even simpler. You can just specify the column containing the categories, and barh
will use the values in the other column as frequencies.
icecream.plot.barh('Flavor')
plt.show()
1.1.1. Features of Categorical Distributions¶
Apart from purely visual differences, there is an important fundamental distinction between bar charts and the two graphs that we saw in the previous sections. Those were the scatter plot and the line plot, both of which display two numerical variables – the variables on both axes are numerical. In contrast, the bar chart has categories on one axis and numerical frequencies on the other.
This has consequences for the chart. First, the width of each bar and the space between consecutive bars is entirely up to the person who is producing the graph, or to the program being used to produce it. Python made those choices for us. If you were to draw the bar graph by hand, you could make completely different choices and still have a perfectly correct bar graph, provided you drew all the bars with the same width and kept all the spaces the same.
Most importantly, the bars can be drawn in any order. The categories “chocolate,” “vanilla,” and “strawberry” have no universal rank order, unlike for example the numbers 5, 7, and 10.
This means that we can draw a bar chart that is easier to interpret, by rearranging the bars in decreasing order. To do this, we first rearrange the rows of icecream
in decreasing order of Number of Cartons
, and then draw the bar chart.
icecream.sort_values('Number of Cartons', ascending=False).plot.barh('Flavor')
plt.show()
This bar chart contains exactly the same information as the previous ones, but it is a little easier to read. While this is not a huge gain in reading a chart with just three bars, it can be quite significant when the number of categories is large.
1.1.2. Grouping Categorical Data¶
To construct the df icecream
, someone had to look at all 30 cartons of ice-cream and count the number of each flavor. But if our data does not already include frequencies, we have to compute the frequencies before we can draw a bar chart. Here is an example where this is necessary.
The df top
consists of U.S.A.’s top grossing movies of all time. The first column contains the title of the movie; Star Wars: The Force Awakens has the top rank, with a box office gross amount of more than 900 million dollars in the United States. The second column contains the name of the studio that produced the movie. The third contains the domestic box office gross in dollars, and the fourth contains the gross amount that would have been earned from ticket sales at 2016 prices. The fifth contains the release year of the movie.
There are 200 movies on the list. Here are the top ten according to unadjusted gross receipts.
top = pd.read_csv(path_data + 'top_movies.csv')
top
Title | Studio | Gross | Gross (Adjusted) | Year | |
---|---|---|---|---|---|
0 | Star Wars: The Force Awakens | Buena Vista (Disney) | 906723418 | 906723400 | 2015 |
1 | Avatar | Fox | 760507625 | 846120800 | 2009 |
2 | Titanic | Paramount | 658672302 | 1178627900 | 1997 |
3 | Jurassic World | Universal | 652270625 | 687728000 | 2015 |
4 | Marvel's The Avengers | Buena Vista (Disney) | 623357910 | 668866600 | 2012 |
... | ... | ... | ... | ... | ... |
195 | The Caine Mutiny | Columbia | 21750000 | 386173500 | 1954 |
196 | The Bells of St. Mary's | RKO | 21333333 | 545882400 | 1945 |
197 | Duel in the Sun | Selz. | 20408163 | 443877500 | 1946 |
198 | Sergeant York | Warner Bros. | 16361885 | 418671800 | 1941 |
199 | The Four Horsemen of the Apocalypse | MPC | 9183673 | 399489800 | 1921 |
200 rows × 5 columns
The Disney subsidiary Buena Vista shows up frequently in the top ten, as do Fox and Warner Brothers. Which studios will appear most frequently if we look among all 200 rows?
To figure this out, first notice that all we need is a table with the movies and the studios; the other information is unnecessary. Again, notice that we use two sqaure brackets as we wish to use more than one column.
movies_and_studios = top[['Title', 'Studio']]
The df methods groupby()
and count()
allows us to count how frequently each studio appears in the table, by calling each studio a category and assigning each row to one category. The groupby
method takes as its argument the label of the column that contains the categories, the count()
returns a table of counts of rows in each category. Unless we state which column we wish to count()
the first column in the df will be used, notice how the name of the grouping column or category is shown below the name of the column being used for the count. There a number of ways by which we could ‘remove’ the name of the count column - can you think of any?
movies_and_studios.groupby(['Studio']).count()
Title | |
---|---|
Studio | |
AVCO | 1 |
Buena Vista (Disney) | 29 |
Columbia | 10 |
Disney | 11 |
Dreamworks | 3 |
Fox | 26 |
IFC | 1 |
Lionsgate | 3 |
MGM | 7 |
MPC | 1 |
NM | 1 |
New Line | 5 |
Orion | 1 |
Paramount | 25 |
Paramount/Dreamworks | 4 |
RKO | 3 |
Selz. | 1 |
Sony | 6 |
Sum. | 2 |
TriS | 2 |
UA | 6 |
Universal | 22 |
Warner Bros. | 29 |
Warner Bros. (New Line) | 1 |
Thus groupby
creates a distribution table that shows how the movies are distributed among the categories (studios).
We can now use this table, along with the graphing skills that we acquired above, to draw a bar chart that shows which studios are most frequent among the 200 highest grossing movies.
#we can also use the '.size()' method as this will return the 'size'
#i.e. an int representing the number of elements in and object.
studio_distribution1 = movies_and_studios.groupby(['Studio'], as_index=False).count()
studio_distribution = studio_distribution1.copy()
studio_distribution.sort_values(by=['Title']).plot.barh('Studio', width=0.8, figsize=(10,18))
plt.xlabel('count')
plt.show()
Warner Brothers and Buena Vista are the most common studios among the top 200 movies. Warner Brothers produces the Harry Potter movies and Buena Vista produces Star Wars.
Because total gross receipts are being measured in unadjusted dollars, it is not very surprising that the top movies are more frequently from recent years than from bygone decades. In absolute terms, movie tickets cost more now than they used to, and thus gross receipts are higher. This is borne out by a bar chart that show the distribution of the 200 movies by year of release.
movies_and_years = top[['Title', 'Year']]
movies_and_years1 = movies_and_years.copy()
movies_and_years = movies_and_years.groupby(['Year'], as_index=False).count()
movies_and_years.sort_values(by=['Title']).plot.barh('Year', width=0.8, figsize=(10,38), color="blue")
plt.show()
All of the longest bars correspond to years after 2000. This is consistent with our observation that recent years should be among the most frequent.
1.1.3. Towards numerical variables¶
There is something unsettling about this chart. Though it does answer the question of which release years appear most frequently among the 200 top grossing movies, it doesn’t list all the years in chronological order. It is treating Year
as a categorical variable.
But years are fixed chronological units that do have an order. They also have fixed numerical spacings relative to each other. Let’s see what happens when we try to take that into account.
By default, barh
sorts the categories (years) from lowest to highest. So we will run the code without sorting by count.
#movies_and_years.group('Year').barh('Year')
movies_and_years1.groupby(['Year'], as_index=False).count()
movies_and_years.plot.barh('Year', width=0.8, figsize=(10,38), color ='blue')
plt.show()
Now the years are in increasing order. But there is still something disquieting about this bar chart. The bars at 1921 and 1937 are just as far apart from each other as the bars at 1937 and 1939. The bar chart doesn’t show that none of the 200 movies were released in the years 1922 through 1936, nor in 1938. Such inconsistencies and omissions make the distribution in the early years hard to understand based on this visualization.
Bar charts are intended as visualizations of categorical variables. When the variable is numerical, the numerical relations between its values have to be taken into account when we create visualizations. That is the topic of the next section.