Mastering Data Aggregation, Summary Statistics on Groups and Pivot Table

Mastering Data Aggregation, Summary Statistics on Groups and Pivot Table

·

20 min read

Play this article

In this article, we’ll learn data aggregation, calculate summary statistics on Dataset, and master grouped summary statistics and pivot tables.

What is Data Aggregation

Data aggregation is the process of collecting, combining, and summarizing data from multiple sources into a single, more meaningful representation. This helps to identify patterns, trends, and relationships in large and complex datasets and can be used for data analysis and reporting purposes. Data aggregation includes calculating the summary statistics of data, grouping data based on categories, and creating pivot tables.

Summary Statistics

Starting with summary statistics, as follows from their name, they are numbers that summarize and tell us about our dataset. First, let's import pandas and the gym_subscribers.csv file which contains the data we will use in this article. This gym_subscribers.csv file is available in PyProDev's GitHub repository.

import pandas as pd
print('Using pandas verson:', pd.__version__)
output:
Using pandas verson: 1.3.5
#import data from csv file
subscribers = pd.read_csv('gym_subscribers.csv')
print(type(subscribers))
print('size of subscribers dataset:', subscribers.shape)
output:
<class 'pandas.core.frame.DataFrame'>
size of subscribers dataset: (20, 7)
  • Explore our DataFrame by displaying the rows of the subscribers DataFrame.
# display the head of the subscribers DataFrame
display(subscribers)
NameGenderSkin ColorHeight(cm)Weight(kg)Date of BirthSubscription Type
0WilliamMaleWhite152641993-09-16Monthly
1JamieMaleBlack162911999-11-05Yearly
2JamieFemaleBrown152472003-05-14Yearly
3EthanTransgenderWhite160411998-06-19Monthly
4SamanthaFemaleBrown160411991-10-04Half-yearly
5AshleyFemaleWhite167502000-04-25Monthly
6AmeliaFemaleWhite173961992-12-31Half-yearly
7EmmaFemaleWhite157601991-03-22Half-yearly
8IsabellaFemaleBrown166741999-09-13Monthly
9MadisonFemaleBlack1601131991-06-13Yearly
10AlexanderMaleBlack147452001-12-29Half-yearly
11MichaelMaleBlack175641993-07-29Yearly
12SarahFemaleBlack172451996-10-19Half-yearly
13BenjaminMaleBrown170501997-06-06Monthly
14SophiaFemaleBlack155692000-08-07Half-yearly
15JamesMaleWhite1871181995-12-12Half-yearly
16DanielMaleBrown160642001-07-10Half-yearly
17JosephMaleBlack1801321999-08-27Yearly
18EmilyFemaleWhite1751041992-02-29Half-yearly
19DavidMaleBrown1931191996-11-24Monthly

Let's see the information of columns in subscribers using info().

subscribers.info()
output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Name               20 non-null     object
 1   Gender             20 non-null     object
 2   Skin Color         20 non-null     object
 3   Height(cm)         20 non-null     int64 
 4   Weight(kg)         20 non-null     int64 
 5   Date of Birth      20 non-null     object
 6   Subscription Type  20 non-null     object
dtypes: int64(2), object(5)
memory usage: 1.2+ KB

Summarizing Numerical Data

One of the most common summary statistics for numeric data is the mean, which is one way of telling us where the center of our data is. We can calculate the mean of a column by selecting the column with square brackets and calling .mean(). There are lots of other summary statistics that we can compute on columns, like

  • .median() which returns the median of the column,

  • .min() which returns the minimum of the column,

  • .max() which returns the maximum of the column,

  • .var() which returns the variance of the column and

  • .std() which returns the standard deviation of the column.

Calculating summary statistics allows us to get a better sense of our data, even if there's a lot of it.

print('Calculating on Weight(kg) column...')
print('Mean:', subscribers['Weight(kg)'].mean())
print('Median:', subscribers['Weight(kg)'].median())
print('Min:', subscribers['Weight(kg)'].min())
print('Max:', subscribers['Weight(kg)'].max())
print('Variance:', subscribers['Weight(kg)'].var())
print('Standard deviation:', subscribers['Weight(kg)'].std())
output:
Calculating on Weight(kg) column...
Mean: 74.35
Median: 64.0
Min: 41
Max: 132
Variance: 883.0815789473684
Standard deviation: 29.716688559584973

Summarizing Dates

Summary statistics can also be calculated on date columns that have values with the standard date format. Some summary statistics like mean don't make sense on dates, but others are super helpful when we can find the oldest person's date of birth by taking the minimum of the date of birth column.

print(subscribers['Date of Birth'].min())
output:
1991-03-22

Similarly, we can take the maximum and find the youngest person in out dataset.

print(subscribers['Date of Birth'].max())
ouput:
2003-05-14

Super summarizing! Taking the minimum and maximum of a column of dates is handy for figuring out what birthday range of our gym subscribers.

Aggregate Data Using agg() Method

While pandas and NumPy have tons of functions, sometimes, we may need a different function to summarize our data. The aggregate, agg(), method allows us to compute our custom summary statistics on a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making our aggregations super-efficient. Here, we create a function called get_second_largest that computes the largest value in a DataFrame column. Now we will aggregate the weight column using .agg(), passing in the name of our function, get_second_largest. It gives us the second-largest weight of the persons in our dataset.

def get_second_largest(data):
    sorted_list = data.sort_values(ascending=False).tolist()
    return sorted_list[1] # 2nd largest is at the index 1
print(subscribers['Weight(kg)'].agg(get_second_largest))
output:
119

Aggregating On Multiple Columns

agg() can also be used on more than one column. By selecting the weight and height columns before calling agg(), we get the thirty percentile for both columns.

print('Second largest values:')
print(subscribers[['Weight(kg)', 'Height(cm)']].agg(get_second_largest))
output:
Second largest values:
Weight(kg)    119
Height(cm)    187
dtype: int64

Multiple Summary Statistics

We can also use agg() to get multiple summary statistics at once. Here's another function called get_third_largest that computes the third largest value from the given data. We can pass a list of functions into agg(), in this case, get_second_largest and get_third_largest, which will return the second and third-largest values from the given data.

def get_third_largest(data):
    sorted_list = data.sort_values(ascending=False).tolist()
    return sorted_list[2] # 3rd largest is at the index 2
print(subscribers['Weight(kg)'].agg([get_second_largest, get_third_largest]))
output:
get_second_largest    119
get_third_largest     118
Name: Weight(kg), dtype: int64

Multiple Summary Statistics On Multiple Columns

The utility of agg() shines when we do multiple summary statistics on multiple columns at the same time. Let's find the second and third largest values from the weight and height columns.

print(subscribers[['Weight(kg)', 'Height(cm)']].agg([get_second_largest, get_third_largest]))
                    Weight(kg)  Height(cm)
get_second_largest         119         187
get_third_largest          118         180

Excellent efficiency! The .agg() method makes it easy to compute multiple statistics on multiple columns, all in just one line of code.

Cumulative Statistics

Cumulative statistics can also help track summary statistics over time. pandas also has methods for computing cumulative statistics, for example, the cumulative sum, cumsum(). Calling cumsum() on a column returns not just one number, but a number for each row of the DataFrame. The first number returned, or the number at the zero index is the first person's weight. The next number is the sum of the first and second persons' weights. The third number is the sum of the first, second, and third persons' weights, and so on. The last number is the sum of all the persons' weights. Compute the cumulative sum of Weight(kg) and add it as a new column of subscribers called cumsum_Weight(kg).

subscribers['cumsum_Weight(kg)'] = subscribers['Weight(kg)'].cumsum()
display(subscribers)
NameGenderSkin ColorHeight(cm)Weight(kg)Date of BirthSubscription Typecumsum_Weight(kg)
0WilliamMaleWhite152641993-09-16Monthly64
1JamieMaleBlack162911999-11-05Yearly155
2JamieFemaleBrown152472003-05-14Yearly202
3EthanTransgenderWhite160411998-06-19Monthly243
4SamanthaFemaleBrown160411991-10-04Half-yearly284
5AshleyFemaleWhite167502000-04-25Monthly334
6AmeliaFemaleWhite173961992-12-31Half-yearly430
7EmmaFemaleWhite157601991-03-22Half-yearly490
8IsabellaFemaleBrown166741999-09-13Monthly564
9MadisonFemaleBlack1601131991-06-13Yearly677
10AlexanderMaleBlack147452001-12-29Half-yearly722
11MichaelMaleBlack175641993-07-29Yearly786
12SarahFemaleBlack172451996-10-19Half-yearly831
13BenjaminMaleBrown170501997-06-06Monthly881
14SophiaFemaleBlack155692000-08-07Half-yearly950
15JamesMaleWhite1871181995-12-12Half-yearly1068
16DanielMaleBrown160642001-07-10Half-yearly1132
17JosephMaleBlack1801321999-08-27Yearly1264
18EmilyFemaleWhite1751041992-02-29Half-yearly1368
19DavidMaleBrown1931191996-11-24Monthly1487

pandas also has methods for other cumulative statistics, such as

  • cumulative maximum, cummax() and

  • cumulative minimum, cummin()

These all return an entire column of a DataFrame, rather than a single number. Compute the cumulative maximum of Weight(kg), and add it as a column called cummax_Weight(kg).

subscribers['cummax_Weight(kg)'] = subscribers['Weight(kg)'].cummax()
display(subscribers)
NameGenderSkin ColorHeight(cm)Weight(kg)Date of BirthSubscription Typecumsum_Weight(kg)cummax_Weight(kg)
0WilliamMaleWhite152641993-09-16Monthly6464
1JamieMaleBlack162911999-11-05Yearly15591
2JamieFemaleBrown152472003-05-14Yearly20291
3EthanTransgenderWhite160411998-06-19Monthly24391
4SamanthaFemaleBrown160411991-10-04Half-yearly28491
5AshleyFemaleWhite167502000-04-25Monthly33491
6AmeliaFemaleWhite173961992-12-31Half-yearly43096
7EmmaFemaleWhite157601991-03-22Half-yearly49096
8IsabellaFemaleBrown166741999-09-13Monthly56496
9MadisonFemaleBlack1601131991-06-13Yearly677113
10AlexanderMaleBlack147452001-12-29Half-yearly722113
11MichaelMaleBlack175641993-07-29Yearly786113
12SarahFemaleBlack172451996-10-19Half-yearly831113
13BenjaminMaleBrown170501997-06-06Monthly881113
14SophiaFemaleBlack155692000-08-07Half-yearly950113
15JamesMaleWhite1871181995-12-12Half-yearly1068118
16DanielMaleBrown160642001-07-10Half-yearly1132118
17JosephMaleBlack1801321999-08-27Yearly1264132
18EmilyFemaleWhite1751041992-02-29Half-yearly1368132
19DavidMaleBrown1931191996-11-24Monthly1487132

Print only the Weight(kg), cumsum_Weight(kg) and cummax_Weight(kg) columns.

# See the columns we calculated
print(subscribers[["Weight(kg)", "cumsum_Weight(kg)", "cummax_Weight(kg)"]])
output:
    Weight(kg)  cumsum_Weight(kg)  cummax_Weight(kg)
0           64                 64                 64
1           91                155                 91
2           47                202                 91
3           41                243                 91
4           41                284                 91
5           50                334                 91
6           96                430                 96
7           60                490                 96
8           74                564                 96
9          113                677                113
10          45                722                113
11          64                786                113
12          45                831                113
13          50                881                113
14          69                950                113
15         118               1068                118
16          64               1132                118
17         132               1264                132
18         104               1368                132
19         119               1487                132

Awesome aggregation! Not all functions that calculate on columns return a single number. Some, like the cumulative statistic functions, return a whole column.

Counting

So far, in this article, we've learned how to summarize numeric variables. Next, we'll learn how to summarize categorical data using counting.

Avoid Double Counting

Counting subscribers manually is impossible when they're going around the gym or when our dataset is big. It's hard to keep track of who we have and haven't counted! Let's count how many subscribers we have.

Dropping Duplicates

Removing duplicates is an essential skill to get accurate counts because we don't want to count the same thing multiple times. Let's ask pandas to drop rows with duplicate names from the dataset as we don't want to count the same name twice. We can do this using the drop_duplicates() method. It takes an argument, subset, which is the column we want pandas to find our duplicates based on. In this case, we want all the unique names to find how many subscribers are there.

subscribers.drop_duplicates(subset='Name')
NameGenderSkin ColorHeight(cm)Weight(kg)Date of BirthSubscription Typecumsum_Weight(kg)cummax_Weight(kg)
0WilliamMaleWhite152641993-09-16Monthly6464
1JamieMaleBlack162911999-11-05Yearly15591
3EthanTransgenderWhite160411998-06-19Monthly24391
4SamanthaFemaleBrown160411991-10-04Half-yearly28491
5AshleyFemaleWhite167502000-04-25Monthly33491
6AmeliaFemaleWhite173961992-12-31Half-yearly43096
7EmmaFemaleWhite157601991-03-22Half-yearly49096
8IsabellaFemaleBrown166741999-09-13Monthly56496
9MadisonFemaleBlack1601131991-06-13Yearly677113
10AlexanderMaleBlack147452001-12-29Half-yearly722113
11MichaelMaleBlack175641993-07-29Yearly786113
12SarahFemaleBlack172451996-10-19Half-yearly831113
13BenjaminMaleBrown170501997-06-06Monthly881113
14SophiaFemaleBlack155692000-08-07Half-yearly950113
15JamesMaleWhite1871181995-12-12Half-yearly1068118
16DanielMaleBrown160642001-07-10Half-yearly1132118
17JosephMaleBlack1801321999-08-27Yearly1264132
18EmilyFemaleWhite1751041992-02-29Half-yearly1368132
19DavidMaleBrown1931191996-11-24Monthly1487132

Now we have a list of persons where each one appears once. But where did Jamie(Female) go? If we go back to the top of the article, we will notice that our original dataset has Jamie(Male) and Jamie(Female). drop_duplicates() method dropped Jamie(Female) as we didn't mention to consider the gender column also when validating duplicates.

Dropping Duplicates With Multiple Column Pairs

Since Jamie(Male) and Jamie(Female) have the same name but different genders, we have to drop the rows with pairs of both Name and Gender. To base our duplicate dropping on multiple columns, we can pass a list of column names to the subset argument, in this case, Name and Gender.

unique_persons = subscribers.drop_duplicates(subset=['Name', 'Gender'])

Print the number of persons by counting rows.

print('Number of unique subscribers:', unique_persons.shape[0])
output:
Number of unique subscribers: 20

Now both Jamie(Male) and Jamie(Female) have been included and the total number of unique subscribers is 20 as we expected.

Counting Categorical Variables

Counting is a great way to get an overview of our data and to spot curiosities that we might not notice. Here, we will count the persons of each subscription type. We'll subset the Subscription Type column and use the value_counts() method.

subscribers['Subscription Type'].value_counts()
output:
Half-yearly    9
Monthly        6
Yearly         5
Name: Subscription Type, dtype: int64

Count Proportions

The normalize argument can be used to turn the counts into proportions of the total.

subscribers['Subscription Type'].value_counts(normalize=True)
output:
Half-yearly    0.45
Monthly        0.30
Yearly         0.25
Name: Subscription Type, dtype: float64

Looks like the Half-yearly subscription type is the most popular one and subscribed by 45% of the total.

Grouped Summary Statistics

So far, we've been calculating summary statistics for all rows of a dataset, but summary statistics can be useful to compare different groups. While computing summary statistics of entire columns may be useful, we can also gain many insights from summaries of individual groups.

For example, is the average weight of one subscription group larger than the other groups? Which gender is more likely to subscribe to which subscription type? We can answer these questions with what we've learned so far. We can subset the persons into groups based on their subscription types, and take the mean of each.

print(subscribers[subscribers['Subscription Type']=='Yearly']['Weight(kg)'].mean())
print(subscribers[subscribers['Subscription Type']=='Half-yearly']['Weight(kg)'].mean())
print(subscribers[subscribers['Subscription Type']=='Monthly']['Weight(kg)'].mean())
output:
89.4
71.33333333333333
66.33333333333333

But that's a lot of work, and the duplicated code can easily introduce copy-and-paste bugs.

Grouped Summaries Using groupby()

That's where the groupby() method comes in. We can group by the Subscription Type variable, select the weight column and take the mean. This will give us the mean weight for each subscriber group. This is just one line of code compared to the three we had to write before to get the same results.

subscribers.groupby('Subscription Type')['Weight(kg)'].mean()
output:
Subscription Type
Half-yearly    71.333333
Monthly        66.333333
Yearly         89.400000
Name: Weight(kg), dtype: float64

Great grouping! We were able to do the same calculation as in the previous one while writing much less code.

Multiple Grouped Summary Statistics

Earlier, we saw that the .agg() method is useful to compute multiple statistics on multiple variables. It also works with grouped data. Here, we pass a list of functions into agg() after grouping by Subscription Type. The following code gives us the minimum and maximum of the persons' weights in different subscription groups.

# Import numpy with the alias np
import numpy as np
print('Using numpy version:', np.__version__)
output:
Using numpy version: 1.21.6
print('Calculating on Weight(kg) column...')
print(subscribers.groupby('Subscription Type')['Weight(kg)'].agg([np.min, np.max]))
output:
Calculating on Weight(kg) column...
                   amin  amax
Subscription Type            
Half-yearly          41   118
Monthly              41   119
Yearly               47   132

Grouping By Multiple Variables

We can also group by multiple columns and calculate summary statistics. Here, we group by Gender and Subscription Type, select the weight column and take the mean. This gives us the mean weight in each group.

print(subscribers.groupby(['Gender', 'Subscription Type'])['Weight(kg)'].mean())
output:
Gender       Subscription Type
Female       Half-yearly          69.166667
             Monthly              62.000000
             Yearly               80.000000
Male         Half-yearly          75.666667
             Monthly              77.666667
             Yearly               95.666667
Transgender  Monthly              41.000000
Name: Weight(kg), dtype: float64

Multiple Summary Statistics on Multiple-Column Groups

We can also group by multiple columns and aggregate multiple statistics on these groups.

print('Calculating mean...')
print(subscribers.groupby(['Gender', 'Subscription Type'])[['Weight(kg)', 'Height(cm)']].mean())
output:
Calculating mean...
                               Weight(kg)  Height(cm)
Gender      Subscription Type                        
Female      Half-yearly         69.166667  165.333333
            Monthly             62.000000  166.500000
            Yearly              80.000000  156.000000
Male        Half-yearly         75.666667  164.666667
            Monthly             77.666667  171.666667
            Yearly              95.666667  172.333333
Transgender Monthly             41.000000  160.000000

Pivot tables

Pivot tables are another way of aggregating data, and calculating grouped summary statistics. Previously, we learned we can group the persons by gender and calculated their weights' mean using groupby() method.

print(subscribers.groupby('Gender')['Weight(kg)'].mean())
output:
Gender
Female         69.9
Male           83.0
Transgender    41.0
Name: Weight(kg), dtype: float64

Let's see how to create pivot tables in pandas using .pivot_table() method, an alternative to .groupby(). Here, we'll use .pivot_table() to replicate the calculations we performed above. The values argument is the column that we want to summarize and the index column is the column that we want to group by. By default, pivot_table takes the mean value for each group.

print('Calculating mean...')
print(subscribers.pivot_table(index='Gender', values='Weight(kg)'))
output:
Calculating mean...
             Weight(kg)
Gender                 
Female             69.9
Male               83.0
Transgender        41.0

Different Statistics Using aggfunc Argument

If we want a different summary statistic, we can use the aggfunc argument and pass it a function. Here, we find the maximum weight in each gender group using NumPy's max function.

print('Calculating max...')
print(subscribers.pivot_table(index='Gender',values='Weight(kg)', aggfunc=np.max))
output:
Calculating max...
             Weight(kg)
Gender                 
Female              113
Male                132
Transgender          41

Multiple Statistics Using aggfunc Argument

To get multiple summary statistics at a time, we can pass a list of functions to the aggfunc argument. Here, we find the maximum and minimum weights from each gender group.

print(subscribers.pivot_table(index='Gender', values='Weight(kg)', aggfunc=[np.min, np.max]))
output:
                  amin       amax
            Weight(kg) Weight(kg)
Gender                           
Female              41        113
Male                45        132
Transgender         41         41

Pivot On Two Variables

We can compute the maximum weight grouped by two variables, gender and subscription type. But we will do this using the pivot_table method. To group by two variables, we can pass a second variable name into the columns argument. While the result looks a little different than what we had before, it contains the same numbers. There are NaNs, or missing values, because there is only one transgender who subscribes monthly in our dataset.

print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)'))
output:
Subscription Type  Half-yearly    Monthly     Yearly
Gender                                              
Female               69.166667  62.000000  80.000000
Male                 75.666667  77.666667  95.666667
Transgender                NaN  41.000000        NaN

Filling Missing Values in Pivot Tables

Instead of having lots of missing values in our pivot table, fill_value replaces missing values with a real value which is known as imputation. What to replace missing values with is a big topic and we will write its detail in another article, but the simplest thing to do is to substitute a dummy value. Here, all of the NaNs get filled in with zeros.

print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)', fill_value = 0))
output:
Subscription Type  Half-yearly    Monthly     Yearly
Gender                                              
Female               69.166667  62.000000  80.000000
Male                 75.666667  77.666667  95.666667
Transgender           0.000000  41.000000   0.000000

Pivot Tables With Margins

margins is a shortcut for when we pivoted by two variables but also wanted to pivot by each of those variables separately. If we set the margins argument to True, the last row and last column of the pivot table contain the mean of all the values in the column or row, not including the missing values that were filled in with zeros.

print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)', fill_value = 0, margins=True))
output:
Subscription Type  Half-yearly    Monthly     Yearly    All
Gender                                                     
Female               69.166667  62.000000  80.000000  69.90
Male                 75.666667  77.666667  95.666667  83.00
Transgender           0.000000  41.000000   0.000000  41.00
All                  71.333333  66.333333  89.400000  74.35

For example, in the last row of the Monthly column, we can see that the mean weight of all Monthly subscribers is 66.33kg. In the last column of the Male row, the mean weight of all male persons is 83kg. The value in the bottom right, in the last row and last column, is the mean weight of all the persons in the dataset.

Using margins=True allows us to see a summary statistic for multiple levels of the dataset as in

  • the entire dataset,

  • grouped by one variable, and

  • grouped by two variables

We now mastered pivot table skills that can help us compute summaries at multiple grouped levels in one line of code.

Conclusion

In this article, we learned data aggregation, and summary statistics on Dataset, and mastered grouped summary statistics and pivot tables.