Mastering Data Aggregation, Summary Statistics on Groups and Pivot Table

In this article, we’ll learn data aggregation, calculate summary statistics on Dataset, and master grouped summary statistics and pivot tables.

What is Data Aggregation

Data aggregation is the process of collecting, combining, and summarizing data from multiple sources into a single, more meaningful representation. This helps to identify patterns, trends, and relationships in large and complex datasets and can be used for data analysis and reporting purposes. Data aggregation includes calculating the summary statistics of data, grouping data based on categories, and creating pivot tables.

Summary Statistics

Starting with summary statistics, as follows from their name, they are numbers that summarize and tell us about our dataset. First, let's import pandas and the gym_subscribers.csv file which contains the data we will use in this article. This gym_subscribers.csv file is available in PyProDev's GitHub repository.

import pandas as pd
print('Using pandas verson:', pd.__version__)

output:
Using pandas verson: 1.3.5

#import data from csv file
subscribers = pd.read_csv('gym_subscribers.csv')
print(type(subscribers))
print('size of subscribers dataset:', subscribers.shape)

output:
<class 'pandas.core.frame.DataFrame'>
size of subscribers dataset: (20, 7)

Explore our DataFrame by displaying the rows of the subscribers DataFrame.

# display the head of the subscribers DataFrame
display(subscribers)

	Name	Gender	Skin Color	Height(cm)	Weight(kg)	Date of Birth	Subscription Type
0	William	Male	White	152	64	1993-09-16	Monthly
1	Jamie	Male	Black	162	91	1999-11-05	Yearly
2	Jamie	Female	Brown	152	47	2003-05-14	Yearly
3	Ethan	Transgender	White	160	41	1998-06-19	Monthly
4	Samantha	Female	Brown	160	41	1991-10-04	Half-yearly
5	Ashley	Female	White	167	50	2000-04-25	Monthly
6	Amelia	Female	White	173	96	1992-12-31	Half-yearly
7	Emma	Female	White	157	60	1991-03-22	Half-yearly
8	Isabella	Female	Brown	166	74	1999-09-13	Monthly
9	Madison	Female	Black	160	113	1991-06-13	Yearly
10	Alexander	Male	Black	147	45	2001-12-29	Half-yearly
11	Michael	Male	Black	175	64	1993-07-29	Yearly
12	Sarah	Female	Black	172	45	1996-10-19	Half-yearly
13	Benjamin	Male	Brown	170	50	1997-06-06	Monthly
14	Sophia	Female	Black	155	69	2000-08-07	Half-yearly
15	James	Male	White	187	118	1995-12-12	Half-yearly
16	Daniel	Male	Brown	160	64	2001-07-10	Half-yearly
17	Joseph	Male	Black	180	132	1999-08-27	Yearly
18	Emily	Female	White	175	104	1992-02-29	Half-yearly
19	David	Male	Brown	193	119	1996-11-24	Monthly

Let's see the information of columns in subscribers using info().

subscribers.info()

output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Name               20 non-null     object
 1   Gender             20 non-null     object
 2   Skin Color         20 non-null     object
 3   Height(cm)         20 non-null     int64 
 4   Weight(kg)         20 non-null     int64 
 5   Date of Birth      20 non-null     object
 6   Subscription Type  20 non-null     object
dtypes: int64(2), object(5)
memory usage: 1.2+ KB

Summarizing Numerical Data

One of the most common summary statistics for numeric data is the mean, which is one way of telling us where the center of our data is. We can calculate the mean of a column by selecting the column with square brackets and calling .mean(). There are lots of other summary statistics that we can compute on columns, like

.median() which returns the median of the column,
.min() which returns the minimum of the column,
.max() which returns the maximum of the column,
.var() which returns the variance of the column and
.std() which returns the standard deviation of the column.

Calculating summary statistics allows us to get a better sense of our data, even if there's a lot of it.

print('Calculating on Weight(kg) column...')
print('Mean:', subscribers['Weight(kg)'].mean())
print('Median:', subscribers['Weight(kg)'].median())
print('Min:', subscribers['Weight(kg)'].min())
print('Max:', subscribers['Weight(kg)'].max())
print('Variance:', subscribers['Weight(kg)'].var())
print('Standard deviation:', subscribers['Weight(kg)'].std())

output:
Calculating on Weight(kg) column...
Mean: 74.35
Median: 64.0
Min: 41
Max: 132
Variance: 883.0815789473684
Standard deviation: 29.716688559584973

Summarizing Dates

Summary statistics can also be calculated on date columns that have values with the standard date format. Some summary statistics like mean don't make sense on dates, but others are super helpful when we can find the oldest person's date of birth by taking the minimum of the date of birth column.

print(subscribers['Date of Birth'].min())

output:
1991-03-22

Similarly, we can take the maximum and find the youngest person in out dataset.

print(subscribers['Date of Birth'].max())

ouput:
2003-05-14

Super summarizing! Taking the minimum and maximum of a column of dates is handy for figuring out what birthday range of our gym subscribers.

Aggregate Data Using `agg()` Method

While pandas and NumPy have tons of functions, sometimes, we may need a different function to summarize our data. The aggregate, agg(), method allows us to compute our custom summary statistics on a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making our aggregations super-efficient. Here, we create a function called get_second_largest that computes the largest value in a DataFrame column. Now we will aggregate the weight column using .agg(), passing in the name of our function, get_second_largest. It gives us the second-largest weight of the persons in our dataset.

def get_second_largest(data):
    sorted_list = data.sort_values(ascending=False).tolist()
    return sorted_list[1] # 2nd largest is at the index 1

print(subscribers['Weight(kg)'].agg(get_second_largest))

output:
119

Aggregating On Multiple Columns

agg() can also be used on more than one column. By selecting the weight and height columns before calling agg(), we get the thirty percentile for both columns.

print('Second largest values:')
print(subscribers[['Weight(kg)', 'Height(cm)']].agg(get_second_largest))

output:
Second largest values:
Weight(kg)    119
Height(cm)    187
dtype: int64

Multiple Summary Statistics

We can also use agg() to get multiple summary statistics at once. Here's another function called get_third_largest that computes the third largest value from the given data. We can pass a list of functions into agg(), in this case, get_second_largest and get_third_largest, which will return the second and third-largest values from the given data.

def get_third_largest(data):
    sorted_list = data.sort_values(ascending=False).tolist()
    return sorted_list[2] # 3rd largest is at the index 2

print(subscribers['Weight(kg)'].agg([get_second_largest, get_third_largest]))

output:
get_second_largest    119
get_third_largest     118
Name: Weight(kg), dtype: int64

Multiple Summary Statistics On Multiple Columns

The utility of agg() shines when we do multiple summary statistics on multiple columns at the same time. Let's find the second and third largest values from the weight and height columns.

print(subscribers[['Weight(kg)', 'Height(cm)']].agg([get_second_largest, get_third_largest]))

                    Weight(kg)  Height(cm)
get_second_largest         119         187
get_third_largest          118         180

Excellent efficiency! The .agg() method makes it easy to compute multiple statistics on multiple columns, all in just one line of code.

Cumulative Statistics

Cumulative statistics can also help track summary statistics over time. pandas also has methods for computing cumulative statistics, for example, the cumulative sum, cumsum(). Calling cumsum() on a column returns not just one number, but a number for each row of the DataFrame. The first number returned, or the number at the zero index is the first person's weight. The next number is the sum of the first and second persons' weights. The third number is the sum of the first, second, and third persons' weights, and so on. The last number is the sum of all the persons' weights. Compute the cumulative sum of Weight(kg) and add it as a new column of subscribers called cumsum_Weight(kg).

subscribers['cumsum_Weight(kg)'] = subscribers['Weight(kg)'].cumsum()
display(subscribers)

	Name	Gender	Skin Color	Height(cm)	Weight(kg)	Date of Birth	Subscription Type	cumsum_Weight(kg)
0	William	Male	White	152	64	1993-09-16	Monthly	64
1	Jamie	Male	Black	162	91	1999-11-05	Yearly	155
2	Jamie	Female	Brown	152	47	2003-05-14	Yearly	202
3	Ethan	Transgender	White	160	41	1998-06-19	Monthly	243
4	Samantha	Female	Brown	160	41	1991-10-04	Half-yearly	284
5	Ashley	Female	White	167	50	2000-04-25	Monthly	334
6	Amelia	Female	White	173	96	1992-12-31	Half-yearly	430
7	Emma	Female	White	157	60	1991-03-22	Half-yearly	490
8	Isabella	Female	Brown	166	74	1999-09-13	Monthly	564
9	Madison	Female	Black	160	113	1991-06-13	Yearly	677
10	Alexander	Male	Black	147	45	2001-12-29	Half-yearly	722
11	Michael	Male	Black	175	64	1993-07-29	Yearly	786
12	Sarah	Female	Black	172	45	1996-10-19	Half-yearly	831
13	Benjamin	Male	Brown	170	50	1997-06-06	Monthly	881
14	Sophia	Female	Black	155	69	2000-08-07	Half-yearly	950
15	James	Male	White	187	118	1995-12-12	Half-yearly	1068
16	Daniel	Male	Brown	160	64	2001-07-10	Half-yearly	1132
17	Joseph	Male	Black	180	132	1999-08-27	Yearly	1264
18	Emily	Female	White	175	104	1992-02-29	Half-yearly	1368
19	David	Male	Brown	193	119	1996-11-24	Monthly	1487

pandas also has methods for other cumulative statistics, such as

cumulative maximum, cummax() and
cumulative minimum, cummin()

These all return an entire column of a DataFrame, rather than a single number. Compute the cumulative maximum of Weight(kg), and add it as a column called cummax_Weight(kg).

subscribers['cummax_Weight(kg)'] = subscribers['Weight(kg)'].cummax()
display(subscribers)

	Name	Gender	Skin Color	Height(cm)	Weight(kg)	Date of Birth	Subscription Type	cumsum_Weight(kg)	cummax_Weight(kg)
0	William	Male	White	152	64	1993-09-16	Monthly	64	64
1	Jamie	Male	Black	162	91	1999-11-05	Yearly	155	91
2	Jamie	Female	Brown	152	47	2003-05-14	Yearly	202	91
3	Ethan	Transgender	White	160	41	1998-06-19	Monthly	243	91
4	Samantha	Female	Brown	160	41	1991-10-04	Half-yearly	284	91
5	Ashley	Female	White	167	50	2000-04-25	Monthly	334	91
6	Amelia	Female	White	173	96	1992-12-31	Half-yearly	430	96
7	Emma	Female	White	157	60	1991-03-22	Half-yearly	490	96
8	Isabella	Female	Brown	166	74	1999-09-13	Monthly	564	96
9	Madison	Female	Black	160	113	1991-06-13	Yearly	677	113
10	Alexander	Male	Black	147	45	2001-12-29	Half-yearly	722	113
11	Michael	Male	Black	175	64	1993-07-29	Yearly	786	113
12	Sarah	Female	Black	172	45	1996-10-19	Half-yearly	831	113
13	Benjamin	Male	Brown	170	50	1997-06-06	Monthly	881	113
14	Sophia	Female	Black	155	69	2000-08-07	Half-yearly	950	113
15	James	Male	White	187	118	1995-12-12	Half-yearly	1068	118
16	Daniel	Male	Brown	160	64	2001-07-10	Half-yearly	1132	118
17	Joseph	Male	Black	180	132	1999-08-27	Yearly	1264	132
18	Emily	Female	White	175	104	1992-02-29	Half-yearly	1368	132
19	David	Male	Brown	193	119	1996-11-24	Monthly	1487	132

Print only the Weight(kg), cumsum_Weight(kg) and cummax_Weight(kg) columns.

# See the columns we calculated
print(subscribers[["Weight(kg)", "cumsum_Weight(kg)", "cummax_Weight(kg)"]])

output:
    Weight(kg)  cumsum_Weight(kg)  cummax_Weight(kg)
0           64                 64                 64
1           91                155                 91
2           47                202                 91
3           41                243                 91
4           41                284                 91
5           50                334                 91
6           96                430                 96
7           60                490                 96
8           74                564                 96
9          113                677                113
10          45                722                113
11          64                786                113
12          45                831                113
13          50                881                113
14          69                950                113
15         118               1068                118
16          64               1132                118
17         132               1264                132
18         104               1368                132
19         119               1487                132

Awesome aggregation! Not all functions that calculate on columns return a single number. Some, like the cumulative statistic functions, return a whole column.

Counting

So far, in this article, we've learned how to summarize numeric variables. Next, we'll learn how to summarize categorical data using counting.

Avoid Double Counting

Counting subscribers manually is impossible when they're going around the gym or when our dataset is big. It's hard to keep track of who we have and haven't counted! Let's count how many subscribers we have.

Dropping Duplicates

Removing duplicates is an essential skill to get accurate counts because we don't want to count the same thing multiple times. Let's ask pandas to drop rows with duplicate names from the dataset as we don't want to count the same name twice. We can do this using the drop_duplicates() method. It takes an argument, subset, which is the column we want pandas to find our duplicates based on. In this case, we want all the unique names to find how many subscribers are there.

subscribers.drop_duplicates(subset='Name')

	Name	Gender	Skin Color	Height(cm)	Weight(kg)	Date of Birth	Subscription Type	cumsum_Weight(kg)	cummax_Weight(kg)
0	William	Male	White	152	64	1993-09-16	Monthly	64	64
1	Jamie	Male	Black	162	91	1999-11-05	Yearly	155	91
3	Ethan	Transgender	White	160	41	1998-06-19	Monthly	243	91
4	Samantha	Female	Brown	160	41	1991-10-04	Half-yearly	284	91
5	Ashley	Female	White	167	50	2000-04-25	Monthly	334	91
6	Amelia	Female	White	173	96	1992-12-31	Half-yearly	430	96
7	Emma	Female	White	157	60	1991-03-22	Half-yearly	490	96
8	Isabella	Female	Brown	166	74	1999-09-13	Monthly	564	96
9	Madison	Female	Black	160	113	1991-06-13	Yearly	677	113
10	Alexander	Male	Black	147	45	2001-12-29	Half-yearly	722	113
11	Michael	Male	Black	175	64	1993-07-29	Yearly	786	113
12	Sarah	Female	Black	172	45	1996-10-19	Half-yearly	831	113
13	Benjamin	Male	Brown	170	50	1997-06-06	Monthly	881	113
14	Sophia	Female	Black	155	69	2000-08-07	Half-yearly	950	113
15	James	Male	White	187	118	1995-12-12	Half-yearly	1068	118
16	Daniel	Male	Brown	160	64	2001-07-10	Half-yearly	1132	118
17	Joseph	Male	Black	180	132	1999-08-27	Yearly	1264	132
18	Emily	Female	White	175	104	1992-02-29	Half-yearly	1368	132
19	David	Male	Brown	193	119	1996-11-24	Monthly	1487	132

Now we have a list of persons where each one appears once. But where did Jamie(Female) go? If we go back to the top of the article, we will notice that our original dataset has Jamie(Male) and Jamie(Female). drop_duplicates() method dropped Jamie(Female) as we didn't mention to consider the gender column also when validating duplicates.

Dropping Duplicates With Multiple Column Pairs

Since Jamie(Male) and Jamie(Female) have the same name but different genders, we have to drop the rows with pairs of both Name and Gender. To base our duplicate dropping on multiple columns, we can pass a list of column names to the subset argument, in this case, Name and Gender.

unique_persons = subscribers.drop_duplicates(subset=['Name', 'Gender'])

Print the number of persons by counting rows.

print('Number of unique subscribers:', unique_persons.shape[0])

output:
Number of unique subscribers: 20

Now both Jamie(Male) and Jamie(Female) have been included and the total number of unique subscribers is 20 as we expected.

Counting Categorical Variables

Counting is a great way to get an overview of our data and to spot curiosities that we might not notice. Here, we will count the persons of each subscription type. We'll subset the Subscription Type column and use the value_counts() method.

subscribers['Subscription Type'].value_counts()

output:
Half-yearly    9
Monthly        6
Yearly         5
Name: Subscription Type, dtype: int64

Count Proportions

The normalize argument can be used to turn the counts into proportions of the total.

subscribers['Subscription Type'].value_counts(normalize=True)

output:
Half-yearly    0.45
Monthly        0.30
Yearly         0.25
Name: Subscription Type, dtype: float64

Looks like the Half-yearly subscription type is the most popular one and subscribed by 45% of the total.

Grouped Summary Statistics

So far, we've been calculating summary statistics for all rows of a dataset, but summary statistics can be useful to compare different groups. While computing summary statistics of entire columns may be useful, we can also gain many insights from summaries of individual groups.

For example, is the average weight of one subscription group larger than the other groups? Which gender is more likely to subscribe to which subscription type? We can answer these questions with what we've learned so far. We can subset the persons into groups based on their subscription types, and take the mean of each.

print(subscribers[subscribers['Subscription Type']=='Yearly']['Weight(kg)'].mean())
print(subscribers[subscribers['Subscription Type']=='Half-yearly']['Weight(kg)'].mean())
print(subscribers[subscribers['Subscription Type']=='Monthly']['Weight(kg)'].mean())

output:
89.4
71.33333333333333
66.33333333333333

But that's a lot of work, and the duplicated code can easily introduce copy-and-paste bugs.

Grouped Summaries Using `groupby()`

That's where the groupby() method comes in. We can group by the Subscription Type variable, select the weight column and take the mean. This will give us the mean weight for each subscriber group. This is just one line of code compared to the three we had to write before to get the same results.

subscribers.groupby('Subscription Type')['Weight(kg)'].mean()

output:
Subscription Type
Half-yearly    71.333333
Monthly        66.333333
Yearly         89.400000
Name: Weight(kg), dtype: float64

Great grouping! We were able to do the same calculation as in the previous one while writing much less code.

Multiple Grouped Summary Statistics

Earlier, we saw that the .agg() method is useful to compute multiple statistics on multiple variables. It also works with grouped data. Here, we pass a list of functions into agg() after grouping by Subscription Type. The following code gives us the minimum and maximum of the persons' weights in different subscription groups.

# Import numpy with the alias np
import numpy as np
print('Using numpy version:', np.__version__)

output:
Using numpy version: 1.21.6

print('Calculating on Weight(kg) column...')
print(subscribers.groupby('Subscription Type')['Weight(kg)'].agg([np.min, np.max]))

output:
Calculating on Weight(kg) column...
                   amin  amax
Subscription Type            
Half-yearly          41   118
Monthly              41   119
Yearly               47   132

Grouping By Multiple Variables

We can also group by multiple columns and calculate summary statistics. Here, we group by Gender and Subscription Type, select the weight column and take the mean. This gives us the mean weight in each group.

print(subscribers.groupby(['Gender', 'Subscription Type'])['Weight(kg)'].mean())

output:
Gender       Subscription Type
Female       Half-yearly          69.166667
             Monthly              62.000000
             Yearly               80.000000
Male         Half-yearly          75.666667
             Monthly              77.666667
             Yearly               95.666667
Transgender  Monthly              41.000000
Name: Weight(kg), dtype: float64

Multiple Summary Statistics on Multiple-Column Groups

We can also group by multiple columns and aggregate multiple statistics on these groups.

print('Calculating mean...')
print(subscribers.groupby(['Gender', 'Subscription Type'])[['Weight(kg)', 'Height(cm)']].mean())

output:
Calculating mean...
                               Weight(kg)  Height(cm)
Gender      Subscription Type                        
Female      Half-yearly         69.166667  165.333333
            Monthly             62.000000  166.500000
            Yearly              80.000000  156.000000
Male        Half-yearly         75.666667  164.666667
            Monthly             77.666667  171.666667
            Yearly              95.666667  172.333333
Transgender Monthly             41.000000  160.000000

Pivot tables

Pivot tables are another way of aggregating data, and calculating grouped summary statistics. Previously, we learned we can group the persons by gender and calculated their weights' mean using groupby() method.

print(subscribers.groupby('Gender')['Weight(kg)'].mean())

output:
Gender
Female         69.9
Male           83.0
Transgender    41.0
Name: Weight(kg), dtype: float64

Let's see how to create pivot tables in pandas using .pivot_table() method, an alternative to .groupby(). Here, we'll use .pivot_table() to replicate the calculations we performed above. The values argument is the column that we want to summarize and the index column is the column that we want to group by. By default, pivot_table takes the mean value for each group.

print('Calculating mean...')
print(subscribers.pivot_table(index='Gender', values='Weight(kg)'))

output:
Calculating mean...
             Weight(kg)
Gender                 
Female             69.9
Male               83.0
Transgender        41.0

Different Statistics Using `aggfunc` Argument

If we want a different summary statistic, we can use the aggfunc argument and pass it a function. Here, we find the maximum weight in each gender group using NumPy's max function.

print('Calculating max...')
print(subscribers.pivot_table(index='Gender',values='Weight(kg)', aggfunc=np.max))

output:
Calculating max...
             Weight(kg)
Gender                 
Female              113
Male                132
Transgender          41

Multiple Statistics Using `aggfunc` Argument

To get multiple summary statistics at a time, we can pass a list of functions to the aggfunc argument. Here, we find the maximum and minimum weights from each gender group.

print(subscribers.pivot_table(index='Gender', values='Weight(kg)', aggfunc=[np.min, np.max]))

output:
                  amin       amax
            Weight(kg) Weight(kg)
Gender                           
Female              41        113
Male                45        132
Transgender         41         41

Pivot On Two Variables

We can compute the maximum weight grouped by two variables, gender and subscription type. But we will do this using the pivot_table method. To group by two variables, we can pass a second variable name into the columns argument. While the result looks a little different than what we had before, it contains the same numbers. There are NaNs, or missing values, because there is only one transgender who subscribes monthly in our dataset.

print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)'))

output:
Subscription Type  Half-yearly    Monthly     Yearly
Gender                                              
Female               69.166667  62.000000  80.000000
Male                 75.666667  77.666667  95.666667
Transgender                NaN  41.000000        NaN

Filling Missing Values in Pivot Tables

Instead of having lots of missing values in our pivot table, fill_value replaces missing values with a real value which is known as imputation. What to replace missing values with is a big topic and we will write its detail in another article, but the simplest thing to do is to substitute a dummy value. Here, all of the NaNs get filled in with zeros.

print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)', fill_value = 0))

output:
Subscription Type  Half-yearly    Monthly     Yearly
Gender                                              
Female               69.166667  62.000000  80.000000
Male                 75.666667  77.666667  95.666667
Transgender           0.000000  41.000000   0.000000

Pivot Tables With Margins

margins is a shortcut for when we pivoted by two variables but also wanted to pivot by each of those variables separately. If we set the margins argument to True, the last row and last column of the pivot table contain the mean of all the values in the column or row, not including the missing values that were filled in with zeros.

print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)', fill_value = 0, margins=True))

output:
Subscription Type  Half-yearly    Monthly     Yearly    All
Gender                                                     
Female               69.166667  62.000000  80.000000  69.90
Male                 75.666667  77.666667  95.666667  83.00
Transgender           0.000000  41.000000   0.000000  41.00
All                  71.333333  66.333333  89.400000  74.35

For example, in the last row of the Monthly column, we can see that the mean weight of all Monthly subscribers is 66.33kg. In the last column of the Male row, the mean weight of all male persons is 83kg. The value in the bottom right, in the last row and last column, is the mean weight of all the persons in the dataset.

Using margins=True allows us to see a summary statistic for multiple levels of the dataset as in

the entire dataset,
grouped by one variable, and
grouped by two variables

We now mastered pivot table skills that can help us compute summaries at multiple grouped levels in one line of code.

Conclusion

In this article, we learned data aggregation, and summary statistics on Dataset, and mastered grouped summary statistics and pivot tables.