1. Sorting Rows¶
“The NBA is the highest paying professional sports league in the world,” reported CNN in March 2016. The table nba_salaries
contains the salaries of all National Basketball Association players in 2015-2016.
Each row represents one player. The columns are:
Column Label |
Description |
---|---|
|
Player’s name |
|
Player’s position on team |
|
Team name |
|
Player’s salary in 2015-2016, in millions of dollars |
The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn’t involve details about how basketball is played.
The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost \(\$18.7\) million in 2015-2016.
# This table can be found online: https://www.statcrunch.com/app/index.php?dataid=1843341
nba_salaries = pd.read_csv(path_data + 'nba_salaries.csv')
nba_salaries
PLAYER | POSITION | TEAM | '15-'16 SALARY | |
---|---|---|---|---|
0 | Paul Millsap | PF | Atlanta Hawks | 18.671659 |
1 | Al Horford | C | Atlanta Hawks | 12.000000 |
2 | Tiago Splitter | C | Atlanta Hawks | 9.756250 |
3 | Jeff Teague | PG | Atlanta Hawks | 8.000000 |
4 | Kyle Korver | SG | Atlanta Hawks | 5.746479 |
... | ... | ... | ... | ... |
412 | Gary Neal | PG | Washington Wizards | 2.139000 |
413 | DeJuan Blair | C | Washington Wizards | 2.000000 |
414 | Kelly Oubre Jr. | SF | Washington Wizards | 1.920240 |
415 | Garrett Temple | SG | Washington Wizards | 1.100602 |
416 | Jarell Eddie | SG | Washington Wizards | 0.561716 |
417 rows × 4 columns
The table contains 417 rows, one for each player. Only 10 of the rows are displayed. The show
method allows us to specify the number of rows, with the default (no specification) being all the rows of the table.
nba_salaries.head(3)
PLAYER | POSITION | TEAM | '15-'16 SALARY | |
---|---|---|---|---|
0 | Paul Millsap | PF | Atlanta Hawks | 18.671659 |
1 | Al Horford | C | Atlanta Hawks | 12.000000 |
2 | Tiago Splitter | C | Atlanta Hawks | 9.756250 |
Glance through about 20 rows or so, and you will see that the rows are in alphabetical order by team name. It’s also possible to list the same rows in alphabetical order by player name using the sort
method. The argument to sort
is a column label or index.
nba_salaries.sort_values('PLAYER').head(5)
PLAYER | POSITION | TEAM | '15-'16 SALARY | |
---|---|---|---|---|
68 | Aaron Brooks | PG | Chicago Bulls | 2.250000 |
291 | Aaron Gordon | PF | Orlando Magic | 4.171680 |
59 | Aaron Harrison | SG | Charlotte Hornets | 0.525093 |
235 | Adreian Payne | PF | Minnesota Timberwolves | 1.938840 |
1 | Al Horford | C | Atlanta Hawks | 12.000000 |
To examine the players’ salaries, it would be much more helpful if the data were ordered by salary.
To do this, we will first simplify the label of the column of salaries (just for convenience), and then sort by the new label SALARY
.
This arranges all the rows of the table in increasing order of salary, with the lowest salary appearing first. The output is a new table with the same columns as the original but with the rows rearranged.
nba = nba_salaries.rename(columns={"'15-'16 SALARY": 'SALARY'})
nba.sort_values('SALARY')
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
267 | Thanasis Antetokounmpo | SF | New York Knicks | 0.030888 |
327 | Cory Jefferson | PF | Phoenix Suns | 0.049709 |
326 | Jordan McRae | SG | Phoenix Suns | 0.049709 |
324 | Orlando Johnson | SG | Phoenix Suns | 0.055722 |
325 | Phil Pressey | PG | Phoenix Suns | 0.055722 |
... | ... | ... | ... | ... |
131 | Dwight Howard | C | Houston Rockets | 22.359364 |
255 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
72 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
169 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
417 rows × 4 columns
These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table. Point Guard Phil Pressey, for example, moved from Philadelphia to Phoenix during the year, and might be moving yet again to the Golden State Warriors.
The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world.
To order the rows of the table in decreasing order of salary, we must use sort
with the option ascending=False
.
nba.sort_values('SALARY', ascending=False)
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
169 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
72 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
255 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
131 | Dwight Howard | C | Houston Rockets | 22.359364 |
... | ... | ... | ... | ... |
200 | Elliot Williams | SG | Memphis Grizzlies | 0.055722 |
324 | Orlando Johnson | SG | Phoenix Suns | 0.055722 |
327 | Cory Jefferson | PF | Phoenix Suns | 0.049709 |
326 | Jordan McRae | SG | Phoenix Suns | 0.049709 |
267 | Thanasis Antetokounmpo | SF | New York Knicks | 0.030888 |
417 rows × 4 columns
Kobe Bryant, in his final season with the Lakers, was the highest paid at a salary of \(\$25\) million. Notice that the MVP Stephen Curry doesn’t appear among the top 10. He is quite a bit further down the list, as we will see later.
1.1. Named Arguments¶
The descending=True
portion of this call expression is called a named argument. When a function or method is called, each argument has both a position and a name. Both are evident from the help text of a function or method.
help(nba.sort_values)
Help on method sort_values in module pandas.core.frame:
sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key: 'ValueKeyFunc' = None) method of pandas.core.frame.DataFrame instance
Sort by the values along either axis.
Parameters
----------
by : str or list of str
Name or list of names to sort by.
- if `axis` is 0 or `'index'` then `by` may contain index
levels and/or column labels.
- if `axis` is 1 or `'columns'` then `by` may contain column
levels and/or index labels.
axis : {0 or 'index', 1 or 'columns'}, default 0
Axis to be sorted.
ascending : bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort
orders. If this is a list of bools, must match the length of
the by.
inplace : bool, default False
If True, perform operation in-place.
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
Choice of sorting algorithm. See also ndarray.np.sort for more
information. `mergesort` is the only stable algorithm. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position : {'first', 'last'}, default 'last'
Puts NaNs at the beginning if `first`; `last` puts NaNs at the
end.
ignore_index : bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
.. versionadded:: 1.0.0
key : callable, optional
Apply the key function to the values
before sorting. This is similar to the `key` argument in the
builtin :meth:`sorted` function, with the notable difference that
this `key` function should be *vectorized*. It should expect a
``Series`` and return a Series with the same shape as the input.
It will be applied to each column in `by` independently.
.. versionadded:: 1.1.0
Returns
-------
DataFrame or None
DataFrame with sorted values or None if ``inplace=True``.
See Also
--------
DataFrame.sort_index : Sort a DataFrame by the index.
Series.sort_values : Similar method for a Series.
Examples
--------
>>> df = pd.DataFrame({
... 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
... 'col2': [2, 1, 9, 8, 7, 4],
... 'col3': [0, 1, 9, 4, 2, 3],
... 'col4': ['a', 'B', 'c', 'D', 'e', 'F']
... })
>>> df
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
Sort by col1
>>> df.sort_values(by=['col1'])
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D
Sort by multiple columns
>>> df.sort_values(by=['col1', 'col2'])
col1 col2 col3 col4
1 A 1 1 B
0 A 2 0 a
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D
Sort Descending
>>> df.sort_values(by='col1', ascending=False)
col1 col2 col3 col4
4 D 7 2 e
5 C 4 3 F
2 B 9 9 c
0 A 2 0 a
1 A 1 1 B
3 NaN 8 4 D
Putting NAs first
>>> df.sort_values(by='col1', ascending=False, na_position='first')
col1 col2 col3 col4
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
2 B 9 9 c
0 A 2 0 a
1 A 1 1 B
Sorting with a key function
>>> df.sort_values(by='col4', key=lambda col: col.str.lower())
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
Natural sort with the key argument,
using the `natsort <https://github.com/SethMMorton/natsort>` package.
>>> df = pd.DataFrame({
... "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
... "value": [10, 20, 30, 40, 50]
... })
>>> df
time value
0 0hr 10
1 128hr 20
2 72hr 30
3 48hr 40
4 96hr 50
>>> from natsort import index_natsorted
>>> df.sort_values(
... by="time",
... key=lambda x: np.argsort(index_natsorted(df["time"]))
... )
time value
0 0hr 10
3 48hr 40
2 72hr 30
4 96hr 50
1 128hr 20
At the very top of this help
text, the signature of the sort_value
method appears:
sort_value(column_or_label, descending=False, distinct=False)
This describes the positions, names, and default values of the three arguments to sort_value
. When calling this method, you can use either positional arguments or named arguments, so the following three calls do exactly the same thing.
sort_value('SALARY', True)
sort_value('SALARY', ascending=False)
sort_value(column_or_label='SALARY', ascending=False)
When an argument is simply True
or False
, it’s a useful convention to include the argument name so that it’s more obvious what the argument value means.