1. Visualizing Categorical Distributions

Data come in many forms that are not numerical. Data can be pieces of music, or places on a map. They can also be categories into which you can place individuals. Here are some examples of categorical variables.

  • The individuals are cartons of ice-cream, and the variable is the flavor in the carton.

  • The individuals are professional basketball players, and the variable is the player’s team.

  • The individuals are years, and the variable is the genre of the highest grossing movie of the year.

  • The individuals are survey respondents, and the variable is the response they choose from among “Not at all satisfied,” “Somewhat satisfied,” and “Very satisfied.”

The DataFrame icecream contains data on 30 cartons of ice-cream.

icecream = pd.DataFrame({
    'Flavor':np.array(['Chocolate', 'Strawberry', 'Vanilla']),
    'Number of Cartons':np.array([16, 5, 9])}
)
icecream
Flavor Number of Cartons
0 Chocolate 16
1 Strawberry 5
2 Vanilla 9

The values of the categorical variable “flavor” are chocolate, strawberry, and vanilla. The df shows the number of cartons of each flavor. We call this a distribution table. A distribution shows all the values of a variable, along with the frequency of each one.

1.1. Bar Chart (pandas)

The bar chart (as opposed to Histogram) is a familiar way of visualizing categorical distributions. It displays a bar for each category. The bars are equally spaced and equally wide. The length of each bar is proportional to the frequency of the corresponding category.

We will draw bar charts with horizontal bars because it’s easier to label the bars that way. The pandas df method is therefore called barh. It takes two arguments: the first is the column label of the categories, and the second is the column label of the frequencies.

icecream.plot.barh('Flavor', 'Number of Cartons')
plt.show()
../../_images/Visualizing_Categorical_Distributions_6_0.png

If the table consists just of a column of categories and a column of frequencies, as in icecream, the method call is even simpler. You can just specify the column containing the categories, and barh will use the values in the other column as frequencies.

icecream.plot.barh('Flavor')
plt.show()
../../_images/Visualizing_Categorical_Distributions_8_0.png

1.1.1. Features of Categorical Distributions

Apart from purely visual differences, there is an important fundamental distinction between bar charts and the two graphs that we saw in the previous sections. Those were the scatter plot and the line plot, both of which display two numerical variables – the variables on both axes are numerical. In contrast, the bar chart has categories on one axis and numerical frequencies on the other.

This has consequences for the chart. First, the width of each bar and the space between consecutive bars is entirely up to the person who is producing the graph, or to the program being used to produce it. Python made those choices for us. If you were to draw the bar graph by hand, you could make completely different choices and still have a perfectly correct bar graph, provided you drew all the bars with the same width and kept all the spaces the same.

Most importantly, the bars can be drawn in any order. The categories “chocolate,” “vanilla,” and “strawberry” have no universal rank order, unlike for example the numbers 5, 7, and 10.

This means that we can draw a bar chart that is easier to interpret, by rearranging the bars in decreasing order. To do this, we first rearrange the rows of icecream in decreasing order of Number of Cartons, and then draw the bar chart.

icecream.sort_values('Number of Cartons', ascending=False).plot.barh('Flavor')
plt.show()
../../_images/Visualizing_Categorical_Distributions_10_0.png

This bar chart contains exactly the same information as the previous ones, but it is a little easier to read. While this is not a huge gain in reading a chart with just three bars, it can be quite significant when the number of categories is large.

1.1.2. Grouping Categorical Data

To construct the df icecream, someone had to look at all 30 cartons of ice-cream and count the number of each flavor. But if our data does not already include frequencies, we have to compute the frequencies before we can draw a bar chart. Here is an example where this is necessary.

The df top consists of U.S.A.’s top grossing movies of all time. The first column contains the title of the movie; Star Wars: The Force Awakens has the top rank, with a box office gross amount of more than 900 million dollars in the United States. The second column contains the name of the studio that produced the movie. The third contains the domestic box office gross in dollars, and the fourth contains the gross amount that would have been earned from ticket sales at 2016 prices. The fifth contains the release year of the movie.

There are 200 movies on the list. Here are the top ten according to unadjusted gross receipts.

top = pd.read_csv(path_data + 'top_movies.csv')
top
Title Studio Gross Gross (Adjusted) Year
0 Star Wars: The Force Awakens Buena Vista (Disney) 906723418 906723400 2015
1 Avatar Fox 760507625 846120800 2009
2 Titanic Paramount 658672302 1178627900 1997
3 Jurassic World Universal 652270625 687728000 2015
4 Marvel's The Avengers Buena Vista (Disney) 623357910 668866600 2012
... ... ... ... ... ...
195 The Caine Mutiny Columbia 21750000 386173500 1954
196 The Bells of St. Mary's RKO 21333333 545882400 1945
197 Duel in the Sun Selz. 20408163 443877500 1946
198 Sergeant York Warner Bros. 16361885 418671800 1941
199 The Four Horsemen of the Apocalypse MPC 9183673 399489800 1921

200 rows × 5 columns

The Disney subsidiary Buena Vista shows up frequently in the top ten, as do Fox and Warner Brothers. Which studios will appear most frequently if we look among all 200 rows?

To figure this out, first notice that all we need is a table with the movies and the studios; the other information is unnecessary. Again, notice that we use two sqaure brackets as we wish to use more than one column.

movies_and_studios = top[['Title', 'Studio']]

The df methods groupby() and count() allows us to count how frequently each studio appears in the table, by calling each studio a category and assigning each row to one category. The groupby method takes as its argument the label of the column that contains the categories, the count() returns a table of counts of rows in each category. Unless we state which column we wish to count() the first column in the df will be used, notice how the name of the grouping column or category is shown below the name of the column being used for the count. There a number of ways by which we could ‘remove’ the name of the count column - can you think of any?

movies_and_studios.groupby(['Studio']).count()
Title
Studio
AVCO 1
Buena Vista (Disney) 29
Columbia 10
Disney 11
Dreamworks 3
Fox 26
IFC 1
Lionsgate 3
MGM 7
MPC 1
NM 1
New Line 5
Orion 1
Paramount 25
Paramount/Dreamworks 4
RKO 3
Selz. 1
Sony 6
Sum. 2
TriS 2
UA 6
Universal 22
Warner Bros. 29
Warner Bros. (New Line) 1

Thus groupby creates a distribution table that shows how the movies are distributed among the categories (studios).

We can now use this table, along with the graphing skills that we acquired above, to draw a bar chart that shows which studios are most frequent among the 200 highest grossing movies.

#we can also use the '.size()' method as this will return the 'size' 
#i.e. an int representing the number of elements in and object.

studio_distribution1 = movies_and_studios.groupby(['Studio'], as_index=False).count()
studio_distribution = studio_distribution1.copy()
studio_distribution.sort_values(by=['Title']).plot.barh('Studio', width=0.8, figsize=(10,18))

plt.xlabel('count')

plt.show()
../../_images/Visualizing_Categorical_Distributions_19_0.png

Warner Brothers and Buena Vista are the most common studios among the top 200 movies. Warner Brothers produces the Harry Potter movies and Buena Vista produces Star Wars.

Because total gross receipts are being measured in unadjusted dollars, it is not very surprising that the top movies are more frequently from recent years than from bygone decades. In absolute terms, movie tickets cost more now than they used to, and thus gross receipts are higher. This is borne out by a bar chart that show the distribution of the 200 movies by year of release.

movies_and_years = top[['Title', 'Year']]

movies_and_years1 = movies_and_years.copy()

movies_and_years = movies_and_years.groupby(['Year'], as_index=False).count()

movies_and_years.sort_values(by=['Title']).plot.barh('Year', width=0.8, figsize=(10,38), color="blue")

plt.show()
../../_images/Visualizing_Categorical_Distributions_22_0.png

All of the longest bars correspond to years after 2000. This is consistent with our observation that recent years should be among the most frequent.

1.1.3. Towards numerical variables

There is something unsettling about this chart. Though it does answer the question of which release years appear most frequently among the 200 top grossing movies, it doesn’t list all the years in chronological order. It is treating Year as a categorical variable.

But years are fixed chronological units that do have an order. They also have fixed numerical spacings relative to each other. Let’s see what happens when we try to take that into account.

By default, barh sorts the categories (years) from lowest to highest. So we will run the code without sorting by count.

#movies_and_years.group('Year').barh('Year')

movies_and_years1.groupby(['Year'], as_index=False).count()

movies_and_years.plot.barh('Year', width=0.8, figsize=(10,38), color ='blue')

plt.show()
../../_images/Visualizing_Categorical_Distributions_25_0.png

Now the years are in increasing order. But there is still something disquieting about this bar chart. The bars at 1921 and 1937 are just as far apart from each other as the bars at 1937 and 1939. The bar chart doesn’t show that none of the 200 movies were released in the years 1922 through 1936, nor in 1938. Such inconsistencies and omissions make the distribution in the early years hard to understand based on this visualization.

Bar charts are intended as visualizations of categorical variables. When the variable is numerical, the numerical relations between its values have to be taken into account when we create visualizations. That is the topic of the next section.