1. Sorting Rows

“The NBA is the highest paying professional sports league in the world,” reported CNN in March 2016. The table nba_salaries contains the salaries of all National Basketball Association players in 2015-2016.

Each row represents one player. The columns are:

Column Label

Description

PLAYER

Player’s name

POSITION

Player’s position on team

TEAM

Team name

'15-'16 SALARY

Player’s salary in 2015-2016, in millions of dollars

The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn’t involve details about how basketball is played.

The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost \(\$18.7\) million in 2015-2016.

# This table can be found online: https://www.statcrunch.com/app/index.php?dataid=1843341
nba_salaries = pd.read_csv(path_data + 'nba_salaries.csv')
nba_salaries
PLAYER POSITION TEAM '15-'16 SALARY
0 Paul Millsap PF Atlanta Hawks 18.671659
1 Al Horford C Atlanta Hawks 12.000000
2 Tiago Splitter C Atlanta Hawks 9.756250
3 Jeff Teague PG Atlanta Hawks 8.000000
4 Kyle Korver SG Atlanta Hawks 5.746479
... ... ... ... ...
412 Gary Neal PG Washington Wizards 2.139000
413 DeJuan Blair C Washington Wizards 2.000000
414 Kelly Oubre Jr. SF Washington Wizards 1.920240
415 Garrett Temple SG Washington Wizards 1.100602
416 Jarell Eddie SG Washington Wizards 0.561716

417 rows × 4 columns

The table contains 417 rows, one for each player. Only 10 of the rows are displayed. The show method allows us to specify the number of rows, with the default (no specification) being all the rows of the table.

nba_salaries.head(3)
PLAYER POSITION TEAM '15-'16 SALARY
0 Paul Millsap PF Atlanta Hawks 18.671659
1 Al Horford C Atlanta Hawks 12.000000
2 Tiago Splitter C Atlanta Hawks 9.756250

Glance through about 20 rows or so, and you will see that the rows are in alphabetical order by team name. It’s also possible to list the same rows in alphabetical order by player name using the sort method. The argument to sort is a column label or index.

nba_salaries.sort_values('PLAYER').head(5)
PLAYER POSITION TEAM '15-'16 SALARY
68 Aaron Brooks PG Chicago Bulls 2.250000
291 Aaron Gordon PF Orlando Magic 4.171680
59 Aaron Harrison SG Charlotte Hornets 0.525093
235 Adreian Payne PF Minnesota Timberwolves 1.938840
1 Al Horford C Atlanta Hawks 12.000000

To examine the players’ salaries, it would be much more helpful if the data were ordered by salary.

To do this, we will first simplify the label of the column of salaries (just for convenience), and then sort by the new label SALARY.

This arranges all the rows of the table in increasing order of salary, with the lowest salary appearing first. The output is a new table with the same columns as the original but with the rows rearranged.

nba = nba_salaries.rename(columns={"'15-'16 SALARY": 'SALARY'})
nba.sort_values('SALARY')
PLAYER POSITION TEAM SALARY
267 Thanasis Antetokounmpo SF New York Knicks 0.030888
327 Cory Jefferson PF Phoenix Suns 0.049709
326 Jordan McRae SG Phoenix Suns 0.049709
324 Orlando Johnson SG Phoenix Suns 0.055722
325 Phil Pressey PG Phoenix Suns 0.055722
... ... ... ... ...
131 Dwight Howard C Houston Rockets 22.359364
255 Carmelo Anthony SF New York Knicks 22.875000
72 LeBron James SF Cleveland Cavaliers 22.970500
29 Joe Johnson SF Brooklyn Nets 24.894863
169 Kobe Bryant SF Los Angeles Lakers 25.000000

417 rows × 4 columns

These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table. Point Guard Phil Pressey, for example, moved from Philadelphia to Phoenix during the year, and might be moving yet again to the Golden State Warriors.

The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world.

To order the rows of the table in decreasing order of salary, we must use sort with the option ascending=False.

nba.sort_values('SALARY', ascending=False)
PLAYER POSITION TEAM SALARY
169 Kobe Bryant SF Los Angeles Lakers 25.000000
29 Joe Johnson SF Brooklyn Nets 24.894863
72 LeBron James SF Cleveland Cavaliers 22.970500
255 Carmelo Anthony SF New York Knicks 22.875000
131 Dwight Howard C Houston Rockets 22.359364
... ... ... ... ...
200 Elliot Williams SG Memphis Grizzlies 0.055722
324 Orlando Johnson SG Phoenix Suns 0.055722
327 Cory Jefferson PF Phoenix Suns 0.049709
326 Jordan McRae SG Phoenix Suns 0.049709
267 Thanasis Antetokounmpo SF New York Knicks 0.030888

417 rows × 4 columns

Kobe Bryant, in his final season with the Lakers, was the highest paid at a salary of \(\$25\) million. Notice that the MVP Stephen Curry doesn’t appear among the top 10. He is quite a bit further down the list, as we will see later.

1.1. Named Arguments

The descending=True portion of this call expression is called a named argument. When a function or method is called, each argument has both a position and a name. Both are evident from the help text of a function or method.

help(nba.sort_values)
Help on method sort_values in module pandas.core.frame:

sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key: 'ValueKeyFunc' = None) method of pandas.core.frame.DataFrame instance
    Sort by the values along either axis.
    
    Parameters
    ----------
            by : str or list of str
                Name or list of names to sort by.
    
                - if `axis` is 0 or `'index'` then `by` may contain index
                  levels and/or column labels.
                - if `axis` is 1 or `'columns'` then `by` may contain column
                  levels and/or index labels.
    axis : {0 or 'index', 1 or 'columns'}, default 0
         Axis to be sorted.
    ascending : bool or list of bool, default True
         Sort ascending vs. descending. Specify list for multiple sort
         orders.  If this is a list of bools, must match the length of
         the by.
    inplace : bool, default False
         If True, perform operation in-place.
    kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
         Choice of sorting algorithm. See also ndarray.np.sort for more
         information.  `mergesort` is the only stable algorithm. For
         DataFrames, this option is only applied when sorting on a single
         column or label.
    na_position : {'first', 'last'}, default 'last'
         Puts NaNs at the beginning if `first`; `last` puts NaNs at the
         end.
    ignore_index : bool, default False
         If True, the resulting axis will be labeled 0, 1, …, n - 1.
    
         .. versionadded:: 1.0.0
    
    key : callable, optional
        Apply the key function to the values
        before sorting. This is similar to the `key` argument in the
        builtin :meth:`sorted` function, with the notable difference that
        this `key` function should be *vectorized*. It should expect a
        ``Series`` and return a Series with the same shape as the input.
        It will be applied to each column in `by` independently.
    
        .. versionadded:: 1.1.0
    
    Returns
    -------
    DataFrame or None
        DataFrame with sorted values or None if ``inplace=True``.
    
    See Also
    --------
    DataFrame.sort_index : Sort a DataFrame by the index.
    Series.sort_values : Similar method for a Series.
    
    Examples
    --------
    >>> df = pd.DataFrame({
    ...     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    ...     'col2': [2, 1, 9, 8, 7, 4],
    ...     'col3': [0, 1, 9, 4, 2, 3],
    ...     'col4': ['a', 'B', 'c', 'D', 'e', 'F']
    ... })
    >>> df
      col1  col2  col3 col4
    0    A     2     0    a
    1    A     1     1    B
    2    B     9     9    c
    3  NaN     8     4    D
    4    D     7     2    e
    5    C     4     3    F
    
    Sort by col1
    
    >>> df.sort_values(by=['col1'])
      col1  col2  col3 col4
    0    A     2     0    a
    1    A     1     1    B
    2    B     9     9    c
    5    C     4     3    F
    4    D     7     2    e
    3  NaN     8     4    D
    
    Sort by multiple columns
    
    >>> df.sort_values(by=['col1', 'col2'])
      col1  col2  col3 col4
    1    A     1     1    B
    0    A     2     0    a
    2    B     9     9    c
    5    C     4     3    F
    4    D     7     2    e
    3  NaN     8     4    D
    
    Sort Descending
    
    >>> df.sort_values(by='col1', ascending=False)
      col1  col2  col3 col4
    4    D     7     2    e
    5    C     4     3    F
    2    B     9     9    c
    0    A     2     0    a
    1    A     1     1    B
    3  NaN     8     4    D
    
    Putting NAs first
    
    >>> df.sort_values(by='col1', ascending=False, na_position='first')
      col1  col2  col3 col4
    3  NaN     8     4    D
    4    D     7     2    e
    5    C     4     3    F
    2    B     9     9    c
    0    A     2     0    a
    1    A     1     1    B
    
    Sorting with a key function
    
    >>> df.sort_values(by='col4', key=lambda col: col.str.lower())
       col1  col2  col3 col4
    0    A     2     0    a
    1    A     1     1    B
    2    B     9     9    c
    3  NaN     8     4    D
    4    D     7     2    e
    5    C     4     3    F
    
    Natural sort with the key argument,
    using the `natsort <https://github.com/SethMMorton/natsort>` package.
    
    >>> df = pd.DataFrame({
    ...    "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
    ...    "value": [10, 20, 30, 40, 50]
    ... })
    >>> df
        time  value
    0    0hr     10
    1  128hr     20
    2   72hr     30
    3   48hr     40
    4   96hr     50
    >>> from natsort import index_natsorted
    >>> df.sort_values(
    ...    by="time",
    ...    key=lambda x: np.argsort(index_natsorted(df["time"]))
    ... )
        time  value
    0    0hr     10
    3   48hr     40
    2   72hr     30
    4   96hr     50
    1  128hr     20

At the very top of this help text, the signature of the sort_value method appears:

sort_value(column_or_label, descending=False, distinct=False)

This describes the positions, names, and default values of the three arguments to sort_value. When calling this method, you can use either positional arguments or named arguments, so the following three calls do exactly the same thing.

sort_value('SALARY', True)
sort_value('SALARY', ascending=False)
sort_value(column_or_label='SALARY', ascending=False)

When an argument is simply True or False, it’s a useful convention to include the argument name so that it’s more obvious what the argument value means.