1. Sorting Rows¶

“The NBA is the highest paying professional sports league in the world,” reported CNN in March 2016. The table nba_salaries contains the salaries of all National Basketball Association players in 2015-2016.

Each row represents one player. The columns are:

Column Label	Description
`PLAYER`	Player’s name
`POSITION`	Player’s position on team
`TEAM`	Team name
`'15-'16 SALARY`	Player’s salary in 2015-2016, in millions of dollars

The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn’t involve details about how basketball is played.

The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost $\$18.7$ million in 2015-2016.

# This table can be found online: https://www.statcrunch.com/app/index.php?dataid=1843341
nba_salaries = pd.read_csv(path_data + 'nba_salaries.csv')
nba_salaries

	PLAYER	POSITION	TEAM	'15-'16 SALARY
0	Paul Millsap	PF	Atlanta Hawks	18.671659
1	Al Horford	C	Atlanta Hawks	12.000000
2	Tiago Splitter	C	Atlanta Hawks	9.756250
3	Jeff Teague	PG	Atlanta Hawks	8.000000
4	Kyle Korver	SG	Atlanta Hawks	5.746479
...	...	...	...	...
412	Gary Neal	PG	Washington Wizards	2.139000
413	DeJuan Blair	C	Washington Wizards	2.000000
414	Kelly Oubre Jr.	SF	Washington Wizards	1.920240
415	Garrett Temple	SG	Washington Wizards	1.100602
416	Jarell Eddie	SG	Washington Wizards	0.561716

417 rows × 4 columns

The table contains 417 rows, one for each player. Only 10 of the rows are displayed. The show method allows us to specify the number of rows, with the default (no specification) being all the rows of the table.

nba_salaries.head(3)

	PLAYER	POSITION	TEAM	'15-'16 SALARY
0	Paul Millsap	PF	Atlanta Hawks	18.671659
1	Al Horford	C	Atlanta Hawks	12.000000
2	Tiago Splitter	C	Atlanta Hawks	9.756250

Glance through about 20 rows or so, and you will see that the rows are in alphabetical order by team name. It’s also possible to list the same rows in alphabetical order by player name using the sort method. The argument to sort is a column label or index.

nba_salaries.sort_values('PLAYER').head(5)

	PLAYER	POSITION	TEAM	'15-'16 SALARY
68	Aaron Brooks	PG	Chicago Bulls	2.250000
291	Aaron Gordon	PF	Orlando Magic	4.171680
59	Aaron Harrison	SG	Charlotte Hornets	0.525093
235	Adreian Payne	PF	Minnesota Timberwolves	1.938840
1	Al Horford	C	Atlanta Hawks	12.000000

To examine the players’ salaries, it would be much more helpful if the data were ordered by salary.

To do this, we will first simplify the label of the column of salaries (just for convenience), and then sort by the new label SALARY.

This arranges all the rows of the table in increasing order of salary, with the lowest salary appearing first. The output is a new table with the same columns as the original but with the rows rearranged.

nba = nba_salaries.rename(columns={"'15-'16 SALARY": 'SALARY'})
nba.sort_values('SALARY')

	PLAYER	POSITION	TEAM	SALARY
267	Thanasis Antetokounmpo	SF	New York Knicks	0.030888
327	Cory Jefferson	PF	Phoenix Suns	0.049709
326	Jordan McRae	SG	Phoenix Suns	0.049709
324	Orlando Johnson	SG	Phoenix Suns	0.055722
325	Phil Pressey	PG	Phoenix Suns	0.055722
...	...	...	...	...
131	Dwight Howard	C	Houston Rockets	22.359364
255	Carmelo Anthony	SF	New York Knicks	22.875000
72	LeBron James	SF	Cleveland Cavaliers	22.970500
29	Joe Johnson	SF	Brooklyn Nets	24.894863
169	Kobe Bryant	SF	Los Angeles Lakers	25.000000

417 rows × 4 columns

These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table. Point Guard Phil Pressey, for example, moved from Philadelphia to Phoenix during the year, and might be moving yet again to the Golden State Warriors.

The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world.

To order the rows of the table in decreasing order of salary, we must use sort with the option ascending=False.

nba.sort_values('SALARY', ascending=False)

	PLAYER	POSITION	TEAM	SALARY
169	Kobe Bryant	SF	Los Angeles Lakers	25.000000
29	Joe Johnson	SF	Brooklyn Nets	24.894863
72	LeBron James	SF	Cleveland Cavaliers	22.970500
255	Carmelo Anthony	SF	New York Knicks	22.875000
131	Dwight Howard	C	Houston Rockets	22.359364
...	...	...	...	...
200	Elliot Williams	SG	Memphis Grizzlies	0.055722
324	Orlando Johnson	SG	Phoenix Suns	0.055722
327	Cory Jefferson	PF	Phoenix Suns	0.049709
326	Jordan McRae	SG	Phoenix Suns	0.049709
267	Thanasis Antetokounmpo	SF	New York Knicks	0.030888

417 rows × 4 columns

Kobe Bryant, in his final season with the Lakers, was the highest paid at a salary of $\$25$ million. Notice that the MVP Stephen Curry doesn’t appear among the top 10. He is quite a bit further down the list, as we will see later.

1.1. Named Arguments¶

The descending=True portion of this call expression is called a named argument. When a function or method is called, each argument has both a position and a name. Both are evident from the help text of a function or method.

help(nba.sort_values)

Help on method sort_values in module pandas.core.frame:

sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key: 'ValueKeyFunc' = None) method of pandas.core.frame.DataFrame instance
    Sort by the values along either axis.
    
    Parameters
    ----------
            by : str or list of str
                Name or list of names to sort by.
    
                - if `axis` is 0 or `'index'` then `by` may contain index
                  levels and/or column labels.
                - if `axis` is 1 or `'columns'` then `by` may contain column
                  levels and/or index labels.
    axis : {0 or 'index', 1 or 'columns'}, default 0
         Axis to be sorted.
    ascending : bool or list of bool, default True
         Sort ascending vs. descending. Specify list for multiple sort
         orders.  If this is a list of bools, must match the length of
         the by.
    inplace : bool, default False
         If True, perform operation in-place.
    kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
         Choice of sorting algorithm. See also ndarray.np.sort for more
         information.  `mergesort` is the only stable algorithm. For
         DataFrames, this option is only applied when sorting on a single
         column or label.
    na_position : {'first', 'last'}, default 'last'
         Puts NaNs at the beginning if `first`; `last` puts NaNs at the
         end.
    ignore_index : bool, default False
         If True, the resulting axis will be labeled 0, 1, …, n - 1.
    
         .. versionadded:: 1.0.0
    
    key : callable, optional
        Apply the key function to the values
        before sorting. This is similar to the `key` argument in the
        builtin :meth:`sorted` function, with the notable difference that
        this `key` function should be *vectorized*. It should expect a
        ``Series`` and return a Series with the same shape as the input.
        It will be applied to each column in `by` independently.
    
        .. versionadded:: 1.1.0
    
    Returns
    -------
    DataFrame or None
        DataFrame with sorted values or None if ``inplace=True``.
    
    See Also
    --------
    DataFrame.sort_index : Sort a DataFrame by the index.
    Series.sort_values : Similar method for a Series.
    
    Examples
    --------
    >>> df = pd.DataFrame({
    ...     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    ...     'col2': [2, 1, 9, 8, 7, 4],
    ...     'col3': [0, 1, 9, 4, 2, 3],
    ...     'col4': ['a', 'B', 'c', 'D', 'e', 'F']
    ... })
    >>> df
      col1  col2  col3 col4
    0    A     2     0    a
    1    A     1     1    B
    2    B     9     9    c
    3  NaN     8     4    D
    4    D     7     2    e
    5    C     4     3    F
    
    Sort by col1
    
    >>> df.sort_values(by=['col1'])
      col1  col2  col3 col4
    0    A     2     0    a
    1    A     1     1    B
    2    B     9     9    c
    5    C     4     3    F
    4    D     7     2    e
    3  NaN     8     4    D
    
    Sort by multiple columns
    
    >>> df.sort_values(by=['col1', 'col2'])
      col1  col2  col3 col4
    1    A     1     1    B
    0    A     2     0    a
    2    B     9     9    c
    5    C     4     3    F
    4    D     7     2    e
    3  NaN     8     4    D
    
    Sort Descending
    
    >>> df.sort_values(by='col1', ascending=False)
      col1  col2  col3 col4
    4    D     7     2    e
    5    C     4     3    F
    2    B     9     9    c
    0    A     2     0    a
    1    A     1     1    B
    3  NaN     8     4    D
    
    Putting NAs first
    
    >>> df.sort_values(by='col1', ascending=False, na_position='first')
      col1  col2  col3 col4
    3  NaN     8     4    D
    4    D     7     2    e
    5    C     4     3    F
    2    B     9     9    c
    0    A     2     0    a
    1    A     1     1    B
    
    Sorting with a key function
    
    >>> df.sort_values(by='col4', key=lambda col: col.str.lower())
       col1  col2  col3 col4
    0    A     2     0    a
    1    A     1     1    B
    2    B     9     9    c
    3  NaN     8     4    D
    4    D     7     2    e
    5    C     4     3    F
    
    Natural sort with the key argument,
    using the `natsort <https://github.com/SethMMorton/natsort>` package.
    
    >>> df = pd.DataFrame({
    ...    "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
    ...    "value": [10, 20, 30, 40, 50]
    ... })
    >>> df
        time  value
    0    0hr     10
    1  128hr     20
    2   72hr     30
    3   48hr     40
    4   96hr     50
    >>> from natsort import index_natsorted
    >>> df.sort_values(
    ...    by="time",
    ...    key=lambda x: np.argsort(index_natsorted(df["time"]))
    ... )
        time  value
    0    0hr     10
    3   48hr     40
    2   72hr     30
    4   96hr     50
    1  128hr     20

At the very top of this help text, the signature of the sort_value method appears:

sort_value(column_or_label, descending=False, distinct=False)

This describes the positions, names, and default values of the three arguments to sort_value. When calling this method, you can use either positional arguments or named arguments, so the following three calls do exactly the same thing.

sort_value('SALARY', True)
sort_value('SALARY', ascending=False)
sort_value(column_or_label='SALARY', ascending=False)

When an argument is simply True or False, it’s a useful convention to include the argument name so that it’s more obvious what the argument value means.

Fundamentals of Data Science

1. Sorting Rows¶

1.1. Named Arguments¶