Pandas: Groupby

Pandas: Groupby#

groupby is an amazingly powerful function in pandas. But it is also complicated to use and understand. The point of this lesson is to make you feel confident in using groupby and its cousins, resample and rolling.

These notes are loosely based on the Pandas GroupBy Documentation.

The “split/apply/combine” concept was first introduced in a paper by Hadley Wickham: https://www.jstatsoft.org/article/view/v040i01.

Imports:

import numpy as np
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (12,7)
%matplotlib inline
import pandas as pd

First we read the Earthquake data from our previous assignment:

df = pd.read_csv('https://raw.githubusercontent.com/earth-DS-ML/summer_2025/refs/heads/main/lectures_DS/data/usgs_earthquakes_2025.csv', parse_dates=['time'], index_col='id')
df['country'] = df.place.str.split(', ').str[-1]
df_small = df[df.mag<4]
df = df[df.mag>4]
df.head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource	country
id
us6000qkt1	2025-06-17 12:00:22.773000+00:00	-23.1645	-175.3549	10.000	4.9	mb	22.0	166.0	6.481	0.49	...	206 km SSW of ‘Ohonua, Tonga	earthquake	15.85	1.938	0.093	36.0	reviewed	us	us	Tonga
us6000qksf	2025-06-17 09:16:30.483000+00:00	-23.1181	-174.9129	10.000	4.8	mb	16.0	111.0	6.151	0.48	...	196 km S of ‘Ohonua, Tonga	earthquake	14.94	1.939	0.148	14.0	reviewed	us	us	Tonga
us6000qks2	2025-06-17 08:36:22.986000+00:00	8.2769	126.8035	34.840	5.3	mww	100.0	67.0	1.706	1.23	...	42 km ENE of Barcelona, Philippines	earthquake	7.91	3.994	0.062	25.0	reviewed	us	us	Philippines
us6000qkrz	2025-06-17 08:13:19.239000+00:00	50.6542	156.7597	91.831	4.7	mb	55.0	152.0	2.526	0.64	...	44 km E of Severo-Kuril’sk, Russia	earthquake	10.38	7.726	0.032	287.0	reviewed	us	us	Russia
us6000qks0	2025-06-17 08:12:55.629000+00:00	-32.8812	-13.2948	10.000	5.0	mb	44.0	59.0	4.254	0.91	...	southern Mid-Atlantic Ridge	earthquake	10.55	1.837	0.076	55.0	reviewed	us	us	southern Mid-Atlantic Ridge

5 rows × 22 columns

len(df)

df.mag.plot()

<Axes: xlabel='id'>

../../_images/a59f63d3c0ccc06370708c698b3d20069bd1ae57e2530e58a396d7011dc3e564.png

df.mag.plot.hist()

<Axes: ylabel='Frequency'>

../../_images/568c53a890d24593897e4f8ba459e71b79fd874b14c7af12db5389b7fc95f021.png

An Example#

This is an example of a “one-liner” that you can accomplish with groupby.

df.groupby('country').mag.count().nlargest(20).plot(kind='bar', figsize=(12,6))

<Axes: xlabel='country'>

../../_images/0510944104a9b737b449a64b1b72000019c5c9c23c60b9b71d4873ad888e5fef.png

df_small.groupby('country').mag.count().nlargest(20).plot(kind='bar', figsize=(12,6))

<Axes: xlabel='country'>

../../_images/40acd377422e69b72e0ef6d2dab0302f459c3f0b8b9981854020cef1a0535577.png

What Happened?#

Let’s break apart this operation a bit. The workflow with groubpy can be divided into three general steps:

Split: Partition the data into different groups based on some criterion.
Apply: Do some caclulation within each group. Different types of “apply” steps might be
Aggregation: Get the mean or max within the group.
Transformation: Normalize all the values within a group
Filtration: Eliminate some groups based on a criterion.
Combine: Put the results back together into a single object.

split-apply-combine

The `groupby` method#

Both Series and DataFrame objects have a groupby method. It accepts a variety of arguments, but the simplest way to think about it is that you pass another series, whose unique values are used to split the original object into different groups.

via https://medium.com/analytics-vidhya/split-apply-combine-strategy-for-data-mining-4fd6e2a0cc99

df.head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource	country
id
us6000qkt1	2025-06-17 12:00:22.773000+00:00	-23.1645	-175.3549	10.000	4.9	mb	22.0	166.0	6.481	0.49	...	206 km SSW of ‘Ohonua, Tonga	earthquake	15.85	1.938	0.093	36.0	reviewed	us	us	Tonga
us6000qksf	2025-06-17 09:16:30.483000+00:00	-23.1181	-174.9129	10.000	4.8	mb	16.0	111.0	6.151	0.48	...	196 km S of ‘Ohonua, Tonga	earthquake	14.94	1.939	0.148	14.0	reviewed	us	us	Tonga
us6000qks2	2025-06-17 08:36:22.986000+00:00	8.2769	126.8035	34.840	5.3	mww	100.0	67.0	1.706	1.23	...	42 km ENE of Barcelona, Philippines	earthquake	7.91	3.994	0.062	25.0	reviewed	us	us	Philippines
us6000qkrz	2025-06-17 08:13:19.239000+00:00	50.6542	156.7597	91.831	4.7	mb	55.0	152.0	2.526	0.64	...	44 km E of Severo-Kuril’sk, Russia	earthquake	10.38	7.726	0.032	287.0	reviewed	us	us	Russia
us6000qks0	2025-06-17 08:12:55.629000+00:00	-32.8812	-13.2948	10.000	5.0	mb	44.0	59.0	4.254	0.91	...	southern Mid-Atlantic Ridge	earthquake	10.55	1.837	0.076	55.0	reviewed	us	us	southern Mid-Atlantic Ridge

5 rows × 22 columns

df.country

id
us6000qkt1                            Tonga
us6000qksf                            Tonga
us6000qks2                      Philippines
us6000qkrz                           Russia
us6000qks0      southern Mid-Atlantic Ridge
                          ...              
us6000pll8                         Ethiopia
us6000pj3b                            Libya
us6000pj2q                            Tonga
us6000pll7    south of the Kermadec Islands
us6000plld                         Ethiopia
Name: country, Length: 5923, dtype: object

df.groupby(df.country)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ba4eb7358e0>

There is a shortcut for doing this with dataframes: you just pass the column name:

df.groupby('country')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ba4eb729040>

The `GroubBy` object#

When we call, groupby we get back a GroupBy object:

gb = df.groupby('country')
gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ba4eb729b20>

The length tells us how many groups were found:

len(gb)

All of the groups are available as a dictionary via the .groups attribute:

groups = gb.groups
len(groups)

groups.keys()

dict_keys(['2025 Drake Passage Earthquake', 'Afghanistan', 'Alaska', 'Algeria', 'Anguilla', 'Antigua and Barbuda', 'Arctic Ocean', 'Argentina', 'Armenia', 'Ascension Island region', 'Australia', 'Azerbaijan', 'Azores Islands region', 'Bahamas', 'Balleny Islands region', 'Banda Sea', 'Barbados', 'Bolivia', 'Bosnia and Herzegovina', 'Brazil', 'Burma', 'Burma (Myanmar)', 'Burma (Myanmar) Earthquake', 'CA', 'Canada', 'Carlsberg Ridge', 'Cayman Islands', 'Chad', 'Chagos Archipelago region', 'Chile', 'China', 'Colombia', 'Costa Rica', 'Croatia', 'Democratic Republic of the Congo', 'Dominica', 'Dominican Republic', 'Drake Passage', 'Easter Island region', 'Ecuador', 'Ecuador region', 'El Salvador', 'Eritrea', 'Ethiopia', 'Federated States of Micronesia', 'Federated States of Micronesia region', 'Fiji', 'Fiji region', 'Georgia', 'Greece', 'Greenland Sea', 'Guadeloupe', 'Guam', 'Guatemala', 'Hawaii', 'Honduras', 'Iceland', 'Iceland region', 'Idaho', 'India', 'India region', 'Indian Ocean Triple Junction', 'Indonesia', 'Iran', 'Italy', 'Japan', 'Japan region', 'Kazakhstan', 'Kermadec Islands region', 'Kuril Islands', 'Kuwait', 'Kyrgyzstan', 'Libya', 'MX', 'Macquarie Island region', 'Madagascar', 'Malawi', 'Mariana Islands region', 'Martinique', 'Mauritius', 'Mauritius - Reunion region', 'Mexico', 'Micronesia', 'Mid-Indian Ridge', 'Mongolia', 'Montenegro', 'Morocco', 'Nepal', 'New Caledonia', 'New Mexico', 'New Zealand', 'Nicaragua', 'North Atlantic Ocean', 'North Korea', 'Northern Mariana Islands', 'Norway', 'Norwegian Sea', 'Oman', 'Oregon', 'Owen Fracture Zone region', 'Pacific-Antarctic Ridge', 'Pakistan', 'Palau', 'Panama', 'Papua New Guinea', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Prince Edward Islands region', 'Puerto Rico', 'Qatar', 'Revilla Gigedo Islands region', 'Reykjanes Ridge', 'Romania', 'Russia', 'Rwanda', 'Saint Helena', 'Saint Kitts and Nevis', 'Samoa', 'Saudi Arabia', 'Scotia Sea', 'Solomon Islands', 'Somalia', 'South Africa', 'South Atlantic Ocean', 'South Georgia Island region', 'South Indian Ocean', 'South Sandwich Islands region', 'South Shetland Islands', 'Southern Tibetan Plateau', 'Southwest Indian Ridge', 'Svalbard and Jan Mayen', 'Taiwan', 'Tajikistan', 'Tanzania', 'Tennessee', 'Texas', 'Thailand', 'Timor Leste', 'Tonga', 'Tristan da Cunha region', 'Tunisia', 'Turkey', 'U.S. Virgin Islands', 'Uganda', 'Ukraine', 'Uzbekistan', 'Vanuatu', 'Vanuatu region', 'Venezuela', 'Vietnam', 'Wallis and Futuna', 'Washington', 'West Chile Rise', 'Western Caribbean Sea', 'Xizang-Qinghai border region', 'Yemen', 'Zimbabwe', 'central East Pacific Rise', 'central Mid-Atlantic Ridge', 'east central Pacific Ocean', 'east of Severnaya Zemlya', 'east of the Kuril Islands', 'east of the South Sandwich Islands', 'north of Ascension Island', 'north of Franz Josef Land', 'north of Severnaya Zemlya', 'north of Svalbard', 'northern East Pacific Rise', 'northern Mid-Atlantic Ridge', 'northwest of New Zealand', 'northwest of the Kuril Islands', 'off the coast of Central America', 'off the coast of Ecuador', 'off the coast of Oregon', 'south of Africa', 'south of Panama', 'south of Tonga', 'south of the Fiji Islands', 'south of the Kermadec Islands', 'southeast Indian Ridge', 'southeast central Pacific Ocean', 'southeast of Easter Island', 'southeast of the Loyalty Islands', 'southern East Pacific Rise', 'southern Mid-Atlantic Ridge', 'southern Pacific Ocean', 'southwest of Africa', 'west of Macquarie Island', 'west of Vancouver Island', 'west of the Galapagos Islands', 'western Indian-Antarctic Ridge', 'western Xizang'])

list(groups.keys())

['2025 Drake Passage Earthquake',
 'Afghanistan',
 'Alaska',
 'Algeria',
 'Anguilla',
 'Antigua and Barbuda',
 'Arctic Ocean',
 'Argentina',
 'Armenia',
 'Ascension Island region',
 'Australia',
 'Azerbaijan',
 'Azores Islands region',
 'Bahamas',
 'Balleny Islands region',
 'Banda Sea',
 'Barbados',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Brazil',
 'Burma',
 'Burma (Myanmar)',
 'Burma (Myanmar) Earthquake',
 'CA',
 'Canada',
 'Carlsberg Ridge',
 'Cayman Islands',
 'Chad',
 'Chagos Archipelago region',
 'Chile',
 'China',
 'Colombia',
 'Costa Rica',
 'Croatia',
 'Democratic Republic of the Congo',
 'Dominica',
 'Dominican Republic',
 'Drake Passage',
 'Easter Island region',
 'Ecuador',
 'Ecuador region',
 'El Salvador',
 'Eritrea',
 'Ethiopia',
 'Federated States of Micronesia',
 'Federated States of Micronesia region',
 'Fiji',
 'Fiji region',
 'Georgia',
 'Greece',
 'Greenland Sea',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Hawaii',
 'Honduras',
 'Iceland',
 'Iceland region',
 'Idaho',
 'India',
 'India region',
 'Indian Ocean Triple Junction',
 'Indonesia',
 'Iran',
 'Italy',
 'Japan',
 'Japan region',
 'Kazakhstan',
 'Kermadec Islands region',
 'Kuril Islands',
 'Kuwait',
 'Kyrgyzstan',
 'Libya',
 'MX',
 'Macquarie Island region',
 'Madagascar',
 'Malawi',
 'Mariana Islands region',
 'Martinique',
 'Mauritius',
 'Mauritius - Reunion region',
 'Mexico',
 'Micronesia',
 'Mid-Indian Ridge',
 'Mongolia',
 'Montenegro',
 'Morocco',
 'Nepal',
 'New Caledonia',
 'New Mexico',
 'New Zealand',
 'Nicaragua',
 'North Atlantic Ocean',
 'North Korea',
 'Northern Mariana Islands',
 'Norway',
 'Norwegian Sea',
 'Oman',
 'Oregon',
 'Owen Fracture Zone region',
 'Pacific-Antarctic Ridge',
 'Pakistan',
 'Palau',
 'Panama',
 'Papua New Guinea',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Prince Edward Islands region',
 'Puerto Rico',
 'Qatar',
 'Revilla Gigedo Islands region',
 'Reykjanes Ridge',
 'Romania',
 'Russia',
 'Rwanda',
 'Saint Helena',
 'Saint Kitts and Nevis',
 'Samoa',
 'Saudi Arabia',
 'Scotia Sea',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Atlantic Ocean',
 'South Georgia Island region',
 'South Indian Ocean',
 'South Sandwich Islands region',
 'South Shetland Islands',
 'Southern Tibetan Plateau',
 'Southwest Indian Ridge',
 'Svalbard and Jan Mayen',
 'Taiwan',
 'Tajikistan',
 'Tanzania',
 'Tennessee',
 'Texas',
 'Thailand',
 'Timor Leste',
 'Tonga',
 'Tristan da Cunha region',
 'Tunisia',
 'Turkey',
 'U.S. Virgin Islands',
 'Uganda',
 'Ukraine',
 'Uzbekistan',
 'Vanuatu',
 'Vanuatu region',
 'Venezuela',
 'Vietnam',
 'Wallis and Futuna',
 'Washington',
 'West Chile Rise',
 'Western Caribbean Sea',
 'Xizang-Qinghai border region',
 'Yemen',
 'Zimbabwe',
 'central East Pacific Rise',
 'central Mid-Atlantic Ridge',
 'east central Pacific Ocean',
 'east of Severnaya Zemlya',
 'east of the Kuril Islands',
 'east of the South Sandwich Islands',
 'north of Ascension Island',
 'north of Franz Josef Land',
 'north of Severnaya Zemlya',
 'north of Svalbard',
 'northern East Pacific Rise',
 'northern Mid-Atlantic Ridge',
 'northwest of New Zealand',
 'northwest of the Kuril Islands',
 'off the coast of Central America',
 'off the coast of Ecuador',
 'off the coast of Oregon',
 'south of Africa',
 'south of Panama',
 'south of Tonga',
 'south of the Fiji Islands',
 'south of the Kermadec Islands',
 'southeast Indian Ridge',
 'southeast central Pacific Ocean',
 'southeast of Easter Island',
 'southeast of the Loyalty Islands',
 'southern East Pacific Rise',
 'southern Mid-Atlantic Ridge',
 'southern Pacific Ocean',
 'southwest of Africa',
 'west of Macquarie Island',
 'west of Vancouver Island',
 'west of the Galapagos Islands',
 'western Indian-Antarctic Ridge',
 'western Xizang']

Iterating and selecting groups#

You can loop through the groups if you want.

a = [1,2,3]

[1, 2, 3]

a[0]

b = {'key1': 1, 'key2':2, 'key3':4
    }

{'key1': 1, 'key2': 2, 'key3': 4}

b['key1']

groups['south of Africa']

Index(['us6000qjkh', 'us7000pp10', 'us6000pzrv', 'us6000pzl1', 'us6000pvw6'], dtype='object', name='id')

groups['Afghanistan']

Index(['us6000qkkm', 'us6000qjhm', 'us6000qiif', 'us6000qhyi', 'us6000qh0s',
       'us7000q1by', 'us7000q0z5', 'us7000q0aw', 'us7000q05h', 'us7000pzz5',
       'us7000pyne', 'us7000pydu', 'us7000px78', 'us7000pwle', 'us7000pve9',
       'us7000pvcr', 'us7000pv82', 'us7000pv7x', 'us7000pv02', 'us6000q76g',
       'us6000q6fr', 'us7000pw58', 'us6000q6c3', 'us7000ppcq', 'us7000pni6',
       'us7000pni0', 'us7000pne1', 'us7000pn0e', 'us7000pmym', 'us7000plhv',
       'us7000plec', 'us6000pzkp', 'us6000pxul', 'us6000pwru', 'us6000pxge',
       'us6000pvsk', 'us7000pflg', 'us7000pfcc', 'us7000pfc9', 'us7000pefx',
       'us7000pd3k', 'us7000pchv', 'us7000pb68', 'us7000pb30', 'us6000pmd7',
       'us6000plpp', 'us6000pk88', 'us6000pk23', 'us6000pjtx', 'us6000pjpy',
       'us6000pjlp'],
      dtype='object', name='id')

for key, group in gb:
    display(group.head())
    print(f'The key is "{key}"')
    break

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource	country
id
us7000pwkn	2025-05-02 12:58:26.014000+00:00	-56.8094	-68.1019	10.0	7.4	mww	284.0	20.0	1.9	0.75	...	2025 Drake Passage Earthquake	earthquake	7.58	1.441	0.038	68.0	reviewed	us	us	2025 Drake Passage Earthquake

1 rows × 22 columns

The key is "2025 Drake Passage Earthquake"

n = 0
for key, group in gb:
    display(group.head())
    print(f'The key is "{key}"')
    n= n+1
    if n>2:
        break

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource	country
id
us7000pwkn	2025-05-02 12:58:26.014000+00:00	-56.8094	-68.1019	10.0	7.4	mww	284.0	20.0	1.9	0.75	...	2025 Drake Passage Earthquake	earthquake	7.58	1.441	0.038	68.0	reviewed	us	us	2025 Drake Passage Earthquake

1 rows × 22 columns

The key is "2025 Drake Passage Earthquake"

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource	country
id
us6000qkkm	2025-06-16 10:34:13.071000+00:00	36.5269	70.2915	197.467	4.4	mb	38.0	89.0	2.218	0.85	...	39 km E of Farkhār, Afghanistan	earthquake	7.68	9.861	0.130	17.0	reviewed	us	us	Afghanistan
us6000qjhm	2025-06-11 04:45:23.745000+00:00	36.5328	70.6591	215.471	4.4	mb	72.0	139.0	2.295	0.88	...	40 km SSW of Jurm, Afghanistan	earthquake	9.25	7.628	0.058	85.0	reviewed	us	us	Afghanistan
us6000qiif	2025-06-06 19:26:58.547000+00:00	36.6295	68.0963	37.077	4.3	mb	69.0	124.0	2.192	0.57	...	36 km ESE of Khulm, Afghanistan	earthquake	8.13	9.473	0.068	61.0	reviewed	us	us	Afghanistan
us6000qhyi	2025-06-04 19:29:30.379000+00:00	36.5033	71.1492	229.220	4.2	mb	69.0	79.0	2.096	0.80	...	39 km WSW of Ashkāsham, Afghanistan	earthquake	7.97	7.099	0.083	41.0	reviewed	us	us	Afghanistan
us6000qh0s	2025-05-31 02:45:35.975000+00:00	35.8044	71.1038	83.711	4.3	mb	31.0	98.0	2.063	1.28	...	45 km NNE of Pārūn, Afghanistan	earthquake	8.31	7.823	0.154	12.0	reviewed	us	us	Afghanistan

5 rows × 22 columns

The key is "Afghanistan"

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource	country
id
ak0257l477bi	2025-06-14 14:32:06.018000+00:00	65.7762	-145.7213	10.800	4.7	ml	NaN	NaN	NaN	0.80	...	47 km WNW of Central, Alaska	earthquake	NaN	0.200	NaN	NaN	reviewed	ak	ak	Alaska
ak0257jjjt29	2025-06-13 19:17:06.365000+00:00	60.1166	-153.4203	159.700	4.2	ml	NaN	NaN	NaN	0.67	...	50 km E of Port Alsworth, Alaska	earthquake	NaN	0.300	NaN	NaN	reviewed	ak	ak	Alaska
us6000qjn2	2025-06-11 20:36:16.744000+00:00	51.3129	-176.2917	27.118	4.7	mb	61.0	177.0	0.761	1.07	...	66 km SSE of Adak, Alaska	earthquake	6.3	6.568	0.039	206.0	reviewed	us	us	Alaska
ak0257egu4bq	2025-06-10 12:28:19.795000+00:00	61.9075	-150.9539	15.200	4.1	ml	NaN	NaN	NaN	0.93	...	25 km ESE of Skwentna, Alaska	earthquake	NaN	0.100	NaN	NaN	reviewed	ak	ak	Alaska
us6000qim5	2025-06-07 04:00:30.558000+00:00	51.6019	-171.8916	35.000	4.3	mb	52.0	180.0	1.550	0.95	...	172 km ESE of Atka, Alaska	earthquake	8.9	1.974	0.047	127.0	reviewed	us	us	Alaska

5 rows × 22 columns

The key is "Alaska"

And you can get a specific group by key.

gb.get_group('Chile').head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource	country
id
us6000qkjt	2025-06-16 08:19:27.151000+00:00	-22.5784	-67.9279	172.967	4.4	mb	41.0	49.0	0.438	0.61	...	46 km NE of San Pedro de Atacama, Chile	earthquake	5.71	7.666	0.099	29.0	reviewed	us	us	Chile
us6000qkhr	2025-06-16 00:51:17.310000+00:00	-20.3278	-70.2485	41.960	4.4	mb	19.0	172.0	1.007	0.56	...	13 km WSW of La Tirana, Chile	earthquake	4.55	10.951	0.309	3.0	reviewed	us	us	Chile
us6000qkfl	2025-06-15 13:30:08.837000+00:00	-22.3399	-68.6815	117.138	4.3	mb	28.0	46.0	0.750	1.13	...	28 km ENE of Calama, Chile	earthquake	4.22	6.238	0.154	14.0	reviewed	us	us	Chile
us6000qjpw	2025-06-11 23:53:41.988000+00:00	-21.6724	-69.1225	98.543	4.1	mb	14.0	88.0	0.166	0.71	...	89 km NNW of Calama, Chile	earthquake	7.11	10.366	0.234	5.0	reviewed	us	us	Chile
us6000qj2v	2025-06-09 00:59:17.293000+00:00	-32.2490	-71.0761	83.965	4.1	mb	28.0	131.0	0.222	0.19	...	26 km NNE of La Ligua, Chile	earthquake	4.23	6.102	0.198	7.0	reviewed	us	us	Chile

5 rows × 22 columns

gb.get_group('Afghanistan').head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource	country
id
us6000qkkm	2025-06-16 10:34:13.071000+00:00	36.5269	70.2915	197.467	4.4	mb	38.0	89.0	2.218	0.85	...	39 km E of Farkhār, Afghanistan	earthquake	7.68	9.861	0.130	17.0	reviewed	us	us	Afghanistan
us6000qjhm	2025-06-11 04:45:23.745000+00:00	36.5328	70.6591	215.471	4.4	mb	72.0	139.0	2.295	0.88	...	40 km SSW of Jurm, Afghanistan	earthquake	9.25	7.628	0.058	85.0	reviewed	us	us	Afghanistan
us6000qiif	2025-06-06 19:26:58.547000+00:00	36.6295	68.0963	37.077	4.3	mb	69.0	124.0	2.192	0.57	...	36 km ESE of Khulm, Afghanistan	earthquake	8.13	9.473	0.068	61.0	reviewed	us	us	Afghanistan
us6000qhyi	2025-06-04 19:29:30.379000+00:00	36.5033	71.1492	229.220	4.2	mb	69.0	79.0	2.096	0.80	...	39 km WSW of Ashkāsham, Afghanistan	earthquake	7.97	7.099	0.083	41.0	reviewed	us	us	Afghanistan
us6000qh0s	2025-05-31 02:45:35.975000+00:00	35.8044	71.1038	83.711	4.3	mb	31.0	98.0	2.063	1.28	...	45 km NNE of Pārūn, Afghanistan	earthquake	8.31	7.823	0.154	12.0	reviewed	us	us	Afghanistan

5 rows × 22 columns

Aggregation#

Now that we know how to create a GroupBy object, let’s learn how to do aggregation on it.

One way us to use the .aggregate method, which accepts another function as its argument. The result is automatically combined into a new dataframe with the group key as the index.

gb.aggregate(np.max).head()

/tmp/ipykernel_4793/2808193076.py:1: FutureWarning: The provided callable <function max at 0x7ba5983a3100> is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  gb.aggregate(np.max).head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	...	updated	place	type	horizontalError	depthError	magError	magNst	status	locationSource	magSource
country
2025 Drake Passage Earthquake	2025-05-02 12:58:26.014000+00:00	-56.8094	-68.1019	10.000	7.4	mww	284.0	20.0	1.900	0.75	...	2025-06-06T12:21:50.736Z	2025 Drake Passage Earthquake	earthquake	7.58	1.441	0.038	68.0	reviewed	us	us
Afghanistan	2025-06-16 10:34:13.071000+00:00	37.4414	72.3551	230.241	5.7	mww	178.0	246.0	2.983	1.28	...	2025-06-16T11:33:01.040Z	80 km SSE of Farkhār, Afghanistan	earthquake	11.79	13.714	0.235	154.0	reviewed	us	us
Alaska	2025-06-14 14:32:06.018000+00:00	67.8223	178.6746	205.536	6.2	mww	287.0	219.0	2.579	1.37	...	2025-06-17T12:05:18.488Z	Rat Islands, Aleutian Islands, Alaska	earthquake	11.37	13.318	0.197	715.0	reviewed	us	us
Algeria	2025-03-18 09:50:26.946000+00:00	36.4349	3.4384	10.000	4.8	mww	167.0	55.0	3.516	0.51	...	2025-05-17T19:20:00.040Z	19 km SW of Lakhdaria, Algeria	earthquake	7.35	1.800	0.083	14.0	reviewed	us	us
Anguilla	2025-05-27 02:55:33.363000+00:00	19.3269	-63.1485	94.968	4.5	mb	75.0	138.0	1.427	0.72	...	2025-06-06T21:12:26.040Z	52 km NNW of Sandy Ground Village, Anguilla	earthquake	6.88	8.063	0.124	44.0	reviewed	us	us

5 rows × 21 columns

By default, the operation is applied to every column. That’s usually not what we want. We can use both . or [] syntax to select a specific column to operate on. Then we get back a series.

gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ba4eb729b20>

gb.mag

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7ba4eb746060>

gb.mag.aggregate(np.max)

/tmp/ipykernel_4793/2229133186.py:1: FutureWarning: The provided callable <function max at 0x7ba5983a3100> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  gb.mag.aggregate(np.max)

country
2025 Drake Passage Earthquake     7.4
Afghanistan                       5.7
Alaska                            6.2
Algeria                           4.8
Anguilla                          4.5
                                 ... 
west of Macquarie Island          6.1
west of Vancouver Island          4.2
west of the Galapagos Islands     5.3
western Indian-Antarctic Ridge    6.2
western Xizang                    5.3
Name: mag, Length: 194, dtype: float64

gb.mag.aggregate(np.sum)

/tmp/ipykernel_4793/2970731060.py:1: FutureWarning: The provided callable <function sum at 0x7ba5983a2700> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  gb.mag.aggregate(np.sum)

country
2025 Drake Passage Earthquake       7.4
Afghanistan                       221.7
Alaska                            583.3
Algeria                             4.8
Anguilla                           13.1
                                  ...  
west of Macquarie Island           75.3
west of Vancouver Island            4.2
west of the Galapagos Islands      46.1
western Indian-Antarctic Ridge    137.5
western Xizang                     40.5
Name: mag, Length: 194, dtype: float64

gb.mag.aggregate(np.mean)

/tmp/ipykernel_4793/480462825.py:1: FutureWarning: The provided callable <function mean at 0x7ba5983a3b00> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  gb.mag.aggregate(np.mean)

country
2025 Drake Passage Earthquake     7.400000
Afghanistan                       4.347059
Alaska                            4.557031
Algeria                           4.800000
Anguilla                          4.366667
                                    ...   
west of Macquarie Island          4.706250
west of Vancouver Island          4.200000
west of the Galapagos Islands     4.610000
western Indian-Antarctic Ridge    4.741379
western Xizang                    4.500000
Name: mag, Length: 194, dtype: float64

gb.mag.aggregate(np.max).head()

/tmp/ipykernel_4793/2287653877.py:1: FutureWarning: The provided callable <function max at 0x7ba5983a3100> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  gb.mag.aggregate(np.max).head()

country
2025 Drake Passage Earthquake    7.4
Afghanistan                      5.7
Alaska                           6.2
Algeria                          4.8
Anguilla                         4.5
Name: mag, dtype: float64

gb.mag.aggregate(np.max).nlargest(10)

/tmp/ipykernel_4793/2071493650.py:1: FutureWarning: The provided callable <function max at 0x7ba5983a3100> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  gb.mag.aggregate(np.max).nlargest(10)

country
Burma (Myanmar) Earthquake       7.7
Cayman Islands                   7.6
2025 Drake Passage Earthquake    7.4
Tonga                            7.0
Papua New Guinea                 6.9
Reykjanes Ridge                  6.9
Japan                            6.8
Macquarie Island region          6.8
Burma (Myanmar)                  6.7
New Zealand                      6.7
Name: mag, dtype: float64

There are shortcuts for common aggregation functions:

gb.mag.max().nlargest(10)

country
Burma (Myanmar) Earthquake       7.7
Cayman Islands                   7.6
2025 Drake Passage Earthquake    7.4
Tonga                            7.0
Papua New Guinea                 6.9
Reykjanes Ridge                  6.9
Japan                            6.8
Macquarie Island region          6.8
Burma (Myanmar)                  6.7
New Zealand                      6.7
Name: mag, dtype: float64

gb.mag.min().nsmallest(10)

country
Puerto Rico            4.02
U.S. Virgin Islands    4.05
CA                     4.06
Dominican Republic     4.06
Afghanistan            4.10
Alaska                 4.10
Argentina              4.10
Australia              4.10
Bahamas                4.10
Bolivia                4.10
Name: mag, dtype: float64

gb.mag.mean().nlargest(10)

country
Burma (Myanmar) Earthquake            7.700000
2025 Drake Passage Earthquake         7.400000
Western Caribbean Sea                 5.400000
north of Franz Josef Land             5.300000
Macquarie Island region               5.233333
southeast central Pacific Ocean       5.200000
Morocco                               5.100000
Pacific-Antarctic Ridge               5.070588
South Atlantic Ocean                  5.050000
east of the South Sandwich Islands    5.000000
Name: mag, dtype: float64

gb.mag.std().nlargest(10)

country
Cayman Islands                    0.978713
U.S. Virgin Islands               0.909340
Macquarie Island region           0.895917
southwest of Africa               0.760263
Revilla Gigedo Islands region     0.726024
Svalbard and Jan Mayen            0.680241
India region                      0.643774
northwest of the Kuril Islands    0.643169
Panama                            0.621435
Reykjanes Ridge                   0.605769
Name: mag, dtype: float64

We can also apply multiple functions at once:

gb.mag.aggregate([np.min, np.max, np.mean]).head()

/tmp/ipykernel_4793/1381609224.py:1: FutureWarning: The provided callable <function min at 0x7ba5983a3240> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  gb.mag.aggregate([np.min, np.max, np.mean]).head()
/tmp/ipykernel_4793/1381609224.py:1: FutureWarning: The provided callable <function max at 0x7ba5983a3100> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  gb.mag.aggregate([np.min, np.max, np.mean]).head()
/tmp/ipykernel_4793/1381609224.py:1: FutureWarning: The provided callable <function mean at 0x7ba5983a3b00> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  gb.mag.aggregate([np.min, np.max, np.mean]).head()

	min	max	mean
country
2025 Drake Passage Earthquake	7.4	7.4	7.400000
Afghanistan	4.1	5.7	4.347059
Alaska	4.1	6.2	4.557031
Algeria	4.8	4.8	4.800000
Anguilla	4.2	4.5	4.366667

gb.mag.aggregate([np.min, np.max, np.mean]).nlargest(10, 'mean').plot(kind='bar')

/tmp/ipykernel_4793/3179219123.py:1: FutureWarning: The provided callable <function min at 0x7ba5983a3240> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  gb.mag.aggregate([np.min, np.max, np.mean]).nlargest(10, 'mean').plot(kind='bar')
/tmp/ipykernel_4793/3179219123.py:1: FutureWarning: The provided callable <function max at 0x7ba5983a3100> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  gb.mag.aggregate([np.min, np.max, np.mean]).nlargest(10, 'mean').plot(kind='bar')
/tmp/ipykernel_4793/3179219123.py:1: FutureWarning: The provided callable <function mean at 0x7ba5983a3b00> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  gb.mag.aggregate([np.min, np.max, np.mean]).nlargest(10, 'mean').plot(kind='bar')

<Axes: xlabel='country'>

../../_images/d757e7bd48aa60681dc9481c296ef34d1dec62c7e3c1acd8a065086d7f0b211b.png

Transformation#

The key difference between aggregation and transformation is that aggregation returns a smaller object than the original, indexed by the group keys, while transformation returns an object with the same index (and same size) as the original object. Groupby + transformation is used when applying an operation that requires information about the whole group.

In this example, we standardize the earthquakes in each country so that the distribution has zero mean and unit variance. We do this by first defining a function called standardize and then passing it to the transform method.

I admit that I don’t know why you would want to do this. transform makes more sense to me in the context of time grouping operation. See below for another example.

def standardize(x):
    return (x - x.mean())/x.std()

mag_standardized_by_country = gb.mag.transform(standardize)
mag_standardized_by_country.head()

id
us6000qkt1    0.668464
us6000qksf    0.419407
us6000qks2    2.157996
us6000qkrz    0.642692
us6000qks0    0.909743
Name: mag, dtype: float64

Time Grouping#

We already saw how pandas has a strong built-in understanding of time. This capability is even more powerful in the context of groupby. With datasets indexed by a pandas DateTimeIndex, we can easily group and resample the data using common time units.

To get started, let’s load the timeseries data we already explored in past lessons.

import urllib
import pandas as pd

header_url = 'ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/HEADERS.txt'
with urllib.request.urlopen(header_url) as response:
    data = response.read().decode('utf-8')
lines = data.split('\n')
headers = lines[1].split(' ')

ftp_base = 'ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/'
dframes = []
for year in range(2016, 2019):
    data_url = f'{year}/CRND0103-{year}-NY_Millbrook_3_W.txt'               
    df = pd.read_csv(ftp_base + data_url, parse_dates=[1],
                     names=headers, header=None, sep='\s+',
                     na_values=[-9999.0, -99.0])
    dframes.append(df)

df = pd.concat(dframes)
df = df.set_index('LST_DATE')

<>:15: SyntaxWarning: invalid escape sequence '\s'
<>:15: SyntaxWarning: invalid escape sequence '\s'
/tmp/ipykernel_4793/2092298281.py:15: SyntaxWarning: invalid escape sequence '\s'
  names=headers, header=None, sep='\s+',

df.head()

	WBANNO	CRX_VN	LONGITUDE	LATITUDE	T_DAILY_MAX	T_DAILY_MIN	T_DAILY_MEAN	T_DAILY_AVG	P_DAILY_CALC	SOLARAD_DAILY	...	SOIL_MOISTURE_10_DAILY	SOIL_MOISTURE_20_DAILY	SOIL_MOISTURE_50_DAILY	SOIL_MOISTURE_100_DAILY	SOIL_TEMP_5_DAILY	SOIL_TEMP_10_DAILY	SOIL_TEMP_20_DAILY	SOIL_TEMP_50_DAILY	SOIL_TEMP_100_DAILY
LST_DATE
2016-01-01	64756	2.422	-73.74	41.79	3.4	-0.5	1.5	1.3	0.0	1.69	...	0.233	0.204	0.155	0.147	4.2	4.4	5.1	6.0	7.6	NaN
2016-01-02	64756	2.422	-73.74	41.79	2.9	-3.6	-0.4	-0.3	0.0	6.25	...	0.227	0.199	0.152	0.144	2.8	3.1	4.2	5.7	7.4	NaN
2016-01-03	64756	2.422	-73.74	41.79	5.1	-1.8	1.6	1.1	0.0	5.69	...	0.223	0.196	0.151	0.141	2.6	2.8	3.8	5.2	7.2	NaN
2016-01-04	64756	2.422	-73.74	41.79	0.5	-14.4	-6.9	-7.5	0.0	9.17	...	0.220	0.194	0.148	0.139	1.7	2.1	3.4	4.9	6.9	NaN
2016-01-05	64756	2.422	-73.74	41.79	-5.2	-15.5	-10.3	-11.7	0.0	9.34	...	0.213	0.191	0.148	0.138	0.4	0.9	2.4	4.3	6.6	NaN

5 rows × 28 columns

This timeseries has daily resolution, and the daily plots are somewhat noisy.

df.T_DAILY_MEAN.plot()

<Axes: xlabel='LST_DATE'>

../../_images/ed4383b447bcc4bd3940ef7461bf60523dcb5932b421abac787359230ff75a4e.png

A common way to analyze such data in climate science is to create a “climatology,” which contains the average values in each month or day of the year. We can do this easily with groupby. Recall that df.index is a pandas DateTimeIndex object.

df.index

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
               '2016-01-09', '2016-01-10',
               ...
               '2018-12-22', '2018-12-23', '2018-12-24', '2018-12-25',
               '2018-12-26', '2018-12-27', '2018-12-28', '2018-12-29',
               '2018-12-30', '2018-12-31'],
              dtype='datetime64[ns]', name='LST_DATE', length=1096, freq=None)

df.index.month

Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       ...
       12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
      dtype='int32', name='LST_DATE', length=1096)

monthly_climatology = df.select_dtypes(include='number').groupby(df.index.month).mean()
monthly_climatology

	WBANNO	CRX_VN	LONGITUDE	LATITUDE	T_DAILY_MAX	T_DAILY_MIN	T_DAILY_MEAN	T_DAILY_AVG	P_DAILY_CALC	SOLARAD_DAILY	...	SOIL_MOISTURE_10_DAILY	SOIL_MOISTURE_20_DAILY	SOIL_MOISTURE_50_DAILY	SOIL_MOISTURE_100_DAILY	SOIL_TEMP_5_DAILY	SOIL_TEMP_10_DAILY	SOIL_TEMP_20_DAILY	SOIL_TEMP_50_DAILY	SOIL_TEMP_100_DAILY
LST_DATE
1	64756.0	2.488667	-73.74	41.79	2.924731	-7.122581	-2.100000	-1.905376	2.478495	5.812258	...	0.240250	0.200698	0.153645	0.160859	0.150538	0.248387	0.788172	1.766667	3.364516	NaN
2	64756.0	2.487882	-73.74	41.79	6.431765	-5.015294	0.712941	1.022353	4.077647	8.495882	...	0.247714	0.210044	0.159153	0.163889	1.216471	1.169412	1.278824	1.617647	2.442353	NaN
3	64756.0	2.488667	-73.74	41.79	7.953763	-3.035484	2.455914	2.643011	2.788172	13.211290	...	0.228612	0.203634	0.157796	0.160366	3.450538	3.400000	3.372043	3.480645	3.777419	NaN
4	64756.0	2.488667	-73.74	41.79	14.793333	1.816667	8.302222	8.574444	2.396667	15.295889	...	0.214067	0.195844	0.153878	0.158100	9.415556	9.117778	8.455556	7.618889	6.670000	NaN
5	64756.0	2.488667	-73.74	41.79	21.235484	8.460215	14.850538	15.121505	3.015054	17.288602	...	0.204796	0.187097	0.148871	0.155720	16.934409	16.640860	15.612903	14.208602	12.455914	NaN
6	64756.0	2.488667	-73.74	41.79	25.627778	11.837778	18.733333	19.026667	3.053333	21.913333	...	0.136911	0.135211	0.129422	0.152711	22.403333	22.126667	20.956667	19.448889	17.552222	NaN
7	64756.0	2.488667	-73.74	41.79	28.568817	15.536559	22.054839	22.012903	3.865591	21.570645	...	0.105806	0.095204	0.114419	0.150000	25.448387	25.318280	24.163441	22.746237	21.068817	NaN
8	64756.0	2.488667	-73.74	41.79	27.473118	15.351613	21.410753	21.378495	4.480645	18.493333	...	0.156161	0.132333	0.128796	0.158750	24.758065	24.829032	24.116129	23.325806	22.301075	NaN
9	64756.0	2.488667	-73.74	41.79	24.084444	12.032222	18.057778	17.866667	3.730000	13.625667	...	0.136900	0.126422	0.121356	0.154000	21.028889	21.168889	20.921111	20.834444	20.707778	NaN
10	64756.0	2.548882	-73.74	41.79	18.127473	5.757143	11.938462	11.952747	3.228261	9.442527	...	0.155275	0.128462	0.120187	0.144588	14.872527	15.056044	15.379121	16.159341	17.059341	NaN
11	64756.0	2.555333	-73.74	41.79	9.586667	-1.375556	4.097778	4.277778	3.991111	6.350111	...	0.226111	0.212200	0.164778	0.163078	6.777778	6.975556	7.658889	9.048889	10.864444	NaN
12	64756.0	2.555333	-73.74	41.79	3.569892	-5.704301	-1.069892	-0.850538	2.791398	4.708602	...	0.259527	0.221280	0.171817	0.177172	1.831183	2.009677	2.647312	3.910753	5.710753	NaN

12 rows × 27 columns

monthly_climatology.T_DAILY_MAX.plot()

<Axes: xlabel='LST_DATE'>

../../_images/04bb90fec3cbcc4f8f702a38bd4eaa674261cb3e96f005c40d6041cd3c0fae3a.png

Each row in this new dataframe respresents the average values for the months (1=January, 2=February, etc.)

We can apply more customized aggregations, as with any groupby operation. Below we keep the mean of the mean, max of the max, and min of the min for the temperature measurements.

monthly_T_climatology = df.groupby(df.index.month).aggregate({'T_DAILY_MEAN': 'mean',
                                                              'T_DAILY_MAX': 'max',
                                                              'T_DAILY_MIN': 'min'})
monthly_T_climatology

	T_DAILY_MEAN	T_DAILY_MAX	T_DAILY_MIN
LST_DATE
1	-2.100000	16.9	-26.0
2	0.712941	24.9	-24.7
3	2.455914	26.8	-16.5
4	8.302222	30.6	-11.3
5	14.850538	33.4	-1.6
6	18.733333	33.8	3.4
7	22.054839	35.7	8.2
8	21.410753	34.5	6.0
9	18.057778	32.7	0.3
10	11.938462	29.4	-4.1
11	4.097778	22.2	-15.9
12	-1.069892	16.9	-21.8

monthly_T_climatology.plot(marker='o')

<Axes: xlabel='LST_DATE'>

../../_images/033551592df0d741a75e1a16ad93b170bf12822724db5963a1bd59f8cf15a1e9.png

If we want to do it on a finer scale, we can group by day of year.

daily_T_climatology = df.groupby(df.index.dayofyear).aggregate({'T_DAILY_MEAN': 'mean',
                                                            'T_DAILY_MAX': 'max',
                                                            'T_DAILY_MIN': 'min'})
daily_T_climatology.plot(marker='.')

<Axes: xlabel='LST_DATE'>

../../_images/14b0b0394d66744dfd1398ca5176b7714bce165c299de8164f868fb7f1ea3ccb.png

Calculating anomalies#

A common mode of analysis in climate science is to remove the climatology from a signal to focus only on the “anomaly” values. This can be accomplished with transformation.

df

	WBANNO	CRX_VN	LONGITUDE	LATITUDE	T_DAILY_MAX	T_DAILY_MIN	T_DAILY_MEAN	T_DAILY_AVG	P_DAILY_CALC	SOLARAD_DAILY	...	SOIL_MOISTURE_10_DAILY	SOIL_MOISTURE_20_DAILY	SOIL_MOISTURE_50_DAILY	SOIL_MOISTURE_100_DAILY	SOIL_TEMP_5_DAILY	SOIL_TEMP_10_DAILY	SOIL_TEMP_20_DAILY	SOIL_TEMP_50_DAILY	SOIL_TEMP_100_DAILY
LST_DATE
2016-01-01	64756	2.422	-73.74	41.79	3.4	-0.5	1.5	1.3	0.0	1.69	...	0.233	0.204	0.155	0.147	4.2	4.4	5.1	6.0	7.6	NaN
2016-01-02	64756	2.422	-73.74	41.79	2.9	-3.6	-0.4	-0.3	0.0	6.25	...	0.227	0.199	0.152	0.144	2.8	3.1	4.2	5.7	7.4	NaN
2016-01-03	64756	2.422	-73.74	41.79	5.1	-1.8	1.6	1.1	0.0	5.69	...	0.223	0.196	0.151	0.141	2.6	2.8	3.8	5.2	7.2	NaN
2016-01-04	64756	2.422	-73.74	41.79	0.5	-14.4	-6.9	-7.5	0.0	9.17	...	0.220	0.194	0.148	0.139	1.7	2.1	3.4	4.9	6.9	NaN
2016-01-05	64756	2.422	-73.74	41.79	-5.2	-15.5	-10.3	-11.7	0.0	9.34	...	0.213	0.191	0.148	0.138	0.4	0.9	2.4	4.3	6.6	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2018-12-27	64756	2.622	-73.74	41.79	2.5	-2.1	0.2	0.3	0.0	7.50	...	0.275	0.248	0.191	0.192	1.3	1.4	1.9	3.2	4.7	NaN
2018-12-28	64756	2.622	-73.74	41.79	11.6	1.9	6.8	7.6	11.5	0.45	...	0.295	0.261	0.193	0.191	2.9	2.7	2.5	3.1	4.5	NaN
2018-12-29	64756	2.622	-73.74	41.79	11.3	-2.1	4.6	6.3	0.0	4.89	...	0.295	0.270	0.208	0.191	4.5	4.4	4.0	3.8	4.5	NaN
2018-12-30	64756	2.622	-73.74	41.79	0.4	-4.1	-1.9	-1.5	0.2	3.32	...	0.286	0.261	0.204	0.194	2.6	2.8	3.3	4.0	4.7	NaN
2018-12-31	64756	2.622	-73.74	41.79	5.0	-2.8	1.1	2.0	17.4	2.13	...	0.291	0.261	0.200	0.193	2.6	2.6	2.9	3.8	4.8	NaN

1096 rows × 28 columns

monthly_T_climatology

	T_DAILY_MEAN	T_DAILY_MAX	T_DAILY_MIN
LST_DATE
1	-2.100000	16.9	-26.0
2	0.712941	24.9	-24.7
3	2.455914	26.8	-16.5
4	8.302222	30.6	-11.3
5	14.850538	33.4	-1.6
6	18.733333	33.8	3.4
7	22.054839	35.7	8.2
8	21.410753	34.5	6.0
9	18.057778	32.7	0.3
10	11.938462	29.4	-4.1
11	4.097778	22.2	-15.9
12	-1.069892	16.9	-21.8

anomaly

----------------------------------------------------------------
NameError                      Traceback (most recent call last)
Cell In[56], line 1
----> 1 anomaly

NameError: name 'anomaly' is not defined

def standardize(x):
    return x #(x - x.mean())#/x.std()

anomaly = df.select_dtypes(include='number').groupby(df.index.month).transform(standardize)
anomaly.plot(y='T_DAILY_MEAN')

<Axes: xlabel='LST_DATE'>

../../_images/d2114725ebeaa2fb6e3917752448a7a720ced1b94f680bed5313540a70d30fc2.png

def standardize(x):
    return (x - x.mean())#/x.std()

anomaly = df.select_dtypes(include='number').groupby(df.index.month).transform(standardize)
anomaly.plot(y='T_DAILY_MEAN')

<Axes: xlabel='LST_DATE'>

../../_images/d71262382ff4f4a28f4cdac3875947d989f5ab4a0f39b808686a7a9930a6e98e.png

def standardize(x):
    return (x - x.mean())/x.std()

anomaly = df.select_dtypes(include='number').groupby(df.index.month).transform(standardize)
anomaly.plot(y='T_DAILY_MEAN')

<Axes: xlabel='LST_DATE'>

../../_images/7c71d7a7b8b2186a1ceeeb6400fb71024f243f5e3e4e4836ae68cb5cab601682.png

Resampling#

Another common operation is to change the resolution of a dataset by resampling in time. Pandas exposes this through the resample function. The resample periods are specified using pandas offset index syntax.

Below we resample the dataset by taking the mean over each month.

df.select_dtypes(include='number').resample('M').mean().plot(y='T_DAILY_MEAN', marker='o')

/tmp/ipykernel_4793/1287015095.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  df.select_dtypes(include='number').resample('M').mean().plot(y='T_DAILY_MEAN', marker='o')

<Axes: xlabel='LST_DATE'>

../../_images/e3ee18c6c9824a2869abac804b39eeeb95309efd8d8e4ad101385f55f34da453.png

df.select_dtypes(include='number').resample('M').std().plot(y='T_DAILY_MEAN', marker='o')

/tmp/ipykernel_4793/2569503248.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  df.select_dtypes(include='number').resample('M').std().plot(y='T_DAILY_MEAN', marker='o')

<Axes: xlabel='LST_DATE'>

../../_images/b63e1a104784dcc44d533ef07a62459e0c7e476e6ca3dbd0697c2f58498b8eaf.png

Just like with groupby, we can apply any aggregation function to our resample operation.

df.select_dtypes(include='number').resample('M').max().plot(y='T_DAILY_MAX', marker='o')

/tmp/ipykernel_4793/1412777515.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  df.select_dtypes(include='number').resample('M').max().plot(y='T_DAILY_MAX', marker='o')

<Axes: xlabel='LST_DATE'>

../../_images/c7d9312c02ff9167307ab3b152132d95723fe27de378947f694419b988b7f68a.png

Rolling Operations#

The final category of operations applies to “rolling windows”. (See rolling documentation.) We specify a function to apply over a moving window along the index. We specify the size of the window and, optionally, the weights. We also use the keyword centered to tell pandas whether to center the operation around the midpoint of the window.

df.rolling(30, center=True).T_DAILY_MEAN.mean().plot()
df.rolling(30, center=True, win_type='triang').T_DAILY_MEAN.mean().plot()

<Axes: xlabel='LST_DATE'>

../../_images/3fa6ffc28dda225a7a88e49eab4a028a9be9565e0e390c415775c24d2bd50d99.png

df.rolling(30, center=True).T_DAILY_MEAN.max().plot()

<Axes: xlabel='LST_DATE'>

../../_images/be42366b586d7eaf300c9fd7756bd2a1437b7f4aa9b3876a7f54f29ccccc90f8.png

Pandas: Groupby

Contents

Pandas: Groupby#

An Example#

What Happened?#

The groupby method#

The GroubBy object#

Iterating and selecting groups#

Aggregation#

Transformation#

Time Grouping#

Calculating anomalies#

Resampling#

Rolling Operations#

The `groupby` method#

The `GroubBy` object#