Mastering Data Aggregation, Summary Statistics on Groups and Pivot Table
Table of contents
In this article, we’ll learn data aggregation, calculate summary statistics on Dataset, and master grouped summary statistics and pivot tables.
What is Data Aggregation
Data aggregation is the process of collecting, combining, and summarizing data from multiple sources into a single, more meaningful representation. This helps to identify patterns, trends, and relationships in large and complex datasets and can be used for data analysis and reporting purposes. Data aggregation includes calculating the summary statistics of data, grouping data based on categories, and creating pivot tables.
Summary Statistics
Starting with summary statistics, as follows from their name, they are numbers that summarize and tell us about our dataset. First, let's import pandas
and the gym_subscribers.csv
file which contains the data we will use in this article. This gym_subscribers.csv
file is available in PyProDev's GitHub repository.
import pandas as pd
print('Using pandas verson:', pd.__version__)
output:
Using pandas verson: 1.3.5
#import data from csv file
subscribers = pd.read_csv('gym_subscribers.csv')
print(type(subscribers))
print('size of subscribers dataset:', subscribers.shape)
output:
<class 'pandas.core.frame.DataFrame'>
size of subscribers dataset: (20, 7)
- Explore our DataFrame by displaying the rows of the
subscribers
DataFrame.
# display the head of the subscribers DataFrame
display(subscribers)
Name | Gender | Skin Color | Height(cm) | Weight(kg) | Date of Birth | Subscription Type | |
0 | William | Male | White | 152 | 64 | 1993-09-16 | Monthly |
1 | Jamie | Male | Black | 162 | 91 | 1999-11-05 | Yearly |
2 | Jamie | Female | Brown | 152 | 47 | 2003-05-14 | Yearly |
3 | Ethan | Transgender | White | 160 | 41 | 1998-06-19 | Monthly |
4 | Samantha | Female | Brown | 160 | 41 | 1991-10-04 | Half-yearly |
5 | Ashley | Female | White | 167 | 50 | 2000-04-25 | Monthly |
6 | Amelia | Female | White | 173 | 96 | 1992-12-31 | Half-yearly |
7 | Emma | Female | White | 157 | 60 | 1991-03-22 | Half-yearly |
8 | Isabella | Female | Brown | 166 | 74 | 1999-09-13 | Monthly |
9 | Madison | Female | Black | 160 | 113 | 1991-06-13 | Yearly |
10 | Alexander | Male | Black | 147 | 45 | 2001-12-29 | Half-yearly |
11 | Michael | Male | Black | 175 | 64 | 1993-07-29 | Yearly |
12 | Sarah | Female | Black | 172 | 45 | 1996-10-19 | Half-yearly |
13 | Benjamin | Male | Brown | 170 | 50 | 1997-06-06 | Monthly |
14 | Sophia | Female | Black | 155 | 69 | 2000-08-07 | Half-yearly |
15 | James | Male | White | 187 | 118 | 1995-12-12 | Half-yearly |
16 | Daniel | Male | Brown | 160 | 64 | 2001-07-10 | Half-yearly |
17 | Joseph | Male | Black | 180 | 132 | 1999-08-27 | Yearly |
18 | Emily | Female | White | 175 | 104 | 1992-02-29 | Half-yearly |
19 | David | Male | Brown | 193 | 119 | 1996-11-24 | Monthly |
Let's see the information of columns in subscribers
using info()
.
subscribers.info()
output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 20 non-null object
1 Gender 20 non-null object
2 Skin Color 20 non-null object
3 Height(cm) 20 non-null int64
4 Weight(kg) 20 non-null int64
5 Date of Birth 20 non-null object
6 Subscription Type 20 non-null object
dtypes: int64(2), object(5)
memory usage: 1.2+ KB
Summarizing Numerical Data
One of the most common summary statistics for numeric data is the mean, which is one way of telling us where the center of our data is. We can calculate the mean of a column by selecting the column with square brackets and calling .mean()
. There are lots of other summary statistics that we can compute on columns, like
.median()
which returns the median of the column,.min()
which returns the minimum of the column,.max()
which returns the maximum of the column,.var()
which returns the variance of the column and.std()
which returns the standard deviation of the column.
Calculating summary statistics allows us to get a better sense of our data, even if there's a lot of it.
print('Calculating on Weight(kg) column...')
print('Mean:', subscribers['Weight(kg)'].mean())
print('Median:', subscribers['Weight(kg)'].median())
print('Min:', subscribers['Weight(kg)'].min())
print('Max:', subscribers['Weight(kg)'].max())
print('Variance:', subscribers['Weight(kg)'].var())
print('Standard deviation:', subscribers['Weight(kg)'].std())
output:
Calculating on Weight(kg) column...
Mean: 74.35
Median: 64.0
Min: 41
Max: 132
Variance: 883.0815789473684
Standard deviation: 29.716688559584973
Summarizing Dates
Summary statistics can also be calculated on date columns that have values with the standard date format. Some summary statistics like mean don't make sense on dates, but others are super helpful when we can find the oldest person's date of birth by taking the minimum of the date of birth column.
print(subscribers['Date of Birth'].min())
output:
1991-03-22
Similarly, we can take the maximum and find the youngest person in out dataset.
print(subscribers['Date of Birth'].max())
ouput:
2003-05-14
Super summarizing! Taking the minimum and maximum of a column of dates is handy for figuring out what birthday range of our gym subscribers.
Aggregate Data Using agg()
Method
While pandas and NumPy have tons of functions, sometimes, we may need a different function to summarize our data. The aggregate, agg()
, method allows us to compute our custom summary statistics on a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making our aggregations super-efficient. Here, we create a function called get_second_largest
that computes the largest value in a DataFrame column. Now we will aggregate the weight column using .agg()
, passing in the name of our function, get_second_largest
. It gives us the second-largest weight of the persons in our dataset.
def get_second_largest(data):
sorted_list = data.sort_values(ascending=False).tolist()
return sorted_list[1] # 2nd largest is at the index 1
print(subscribers['Weight(kg)'].agg(get_second_largest))
output:
119
Aggregating On Multiple Columns
agg()
can also be used on more than one column. By selecting the weight and height columns before calling agg()
, we get the thirty percentile for both columns.
print('Second largest values:')
print(subscribers[['Weight(kg)', 'Height(cm)']].agg(get_second_largest))
output:
Second largest values:
Weight(kg) 119
Height(cm) 187
dtype: int64
Multiple Summary Statistics
We can also use agg()
to get multiple summary statistics at once. Here's another function called get_third_largest
that computes the third largest value from the given data. We can pass a list of functions into agg()
, in this case, get_second_largest
and get_third_largest
, which will return the second and third-largest values from the given data.
def get_third_largest(data):
sorted_list = data.sort_values(ascending=False).tolist()
return sorted_list[2] # 3rd largest is at the index 2
print(subscribers['Weight(kg)'].agg([get_second_largest, get_third_largest]))
output:
get_second_largest 119
get_third_largest 118
Name: Weight(kg), dtype: int64
Multiple Summary Statistics On Multiple Columns
The utility of agg()
shines when we do multiple summary statistics on multiple columns at the same time. Let's find the second and third largest values from the weight and height columns.
print(subscribers[['Weight(kg)', 'Height(cm)']].agg([get_second_largest, get_third_largest]))
Weight(kg) Height(cm)
get_second_largest 119 187
get_third_largest 118 180
Excellent efficiency! The .agg()
method makes it easy to compute multiple statistics on multiple columns, all in just one line of code.
Cumulative Statistics
Cumulative statistics can also help track summary statistics over time. pandas
also has methods for computing cumulative statistics, for example, the cumulative sum, cumsum()
. Calling cumsum()
on a column returns not just one number, but a number for each row of the DataFrame. The first number returned, or the number at the zero index is the first person's weight. The next number is the sum of the first and second persons' weights. The third number is the sum of the first, second, and third persons' weights, and so on. The last number is the sum of all the persons' weights. Compute the cumulative sum of Weight(kg)
and add it as a new column of subscribers
called cumsum_Weight(kg)
.
subscribers['cumsum_Weight(kg)'] = subscribers['Weight(kg)'].cumsum()
display(subscribers)
Name | Gender | Skin Color | Height(cm) | Weight(kg) | Date of Birth | Subscription Type | cumsum_Weight(kg) | |
0 | William | Male | White | 152 | 64 | 1993-09-16 | Monthly | 64 |
1 | Jamie | Male | Black | 162 | 91 | 1999-11-05 | Yearly | 155 |
2 | Jamie | Female | Brown | 152 | 47 | 2003-05-14 | Yearly | 202 |
3 | Ethan | Transgender | White | 160 | 41 | 1998-06-19 | Monthly | 243 |
4 | Samantha | Female | Brown | 160 | 41 | 1991-10-04 | Half-yearly | 284 |
5 | Ashley | Female | White | 167 | 50 | 2000-04-25 | Monthly | 334 |
6 | Amelia | Female | White | 173 | 96 | 1992-12-31 | Half-yearly | 430 |
7 | Emma | Female | White | 157 | 60 | 1991-03-22 | Half-yearly | 490 |
8 | Isabella | Female | Brown | 166 | 74 | 1999-09-13 | Monthly | 564 |
9 | Madison | Female | Black | 160 | 113 | 1991-06-13 | Yearly | 677 |
10 | Alexander | Male | Black | 147 | 45 | 2001-12-29 | Half-yearly | 722 |
11 | Michael | Male | Black | 175 | 64 | 1993-07-29 | Yearly | 786 |
12 | Sarah | Female | Black | 172 | 45 | 1996-10-19 | Half-yearly | 831 |
13 | Benjamin | Male | Brown | 170 | 50 | 1997-06-06 | Monthly | 881 |
14 | Sophia | Female | Black | 155 | 69 | 2000-08-07 | Half-yearly | 950 |
15 | James | Male | White | 187 | 118 | 1995-12-12 | Half-yearly | 1068 |
16 | Daniel | Male | Brown | 160 | 64 | 2001-07-10 | Half-yearly | 1132 |
17 | Joseph | Male | Black | 180 | 132 | 1999-08-27 | Yearly | 1264 |
18 | Emily | Female | White | 175 | 104 | 1992-02-29 | Half-yearly | 1368 |
19 | David | Male | Brown | 193 | 119 | 1996-11-24 | Monthly | 1487 |
pandas
also has methods for other cumulative statistics, such as
cumulative maximum,
cummax()
andcumulative minimum,
cummin()
These all return an entire column of a DataFrame, rather than a single number. Compute the cumulative maximum of Weight(kg)
, and add it as a column called cummax_Weight(kg)
.
subscribers['cummax_Weight(kg)'] = subscribers['Weight(kg)'].cummax()
display(subscribers)
Name | Gender | Skin Color | Height(cm) | Weight(kg) | Date of Birth | Subscription Type | cumsum_Weight(kg) | cummax_Weight(kg) | |
0 | William | Male | White | 152 | 64 | 1993-09-16 | Monthly | 64 | 64 |
1 | Jamie | Male | Black | 162 | 91 | 1999-11-05 | Yearly | 155 | 91 |
2 | Jamie | Female | Brown | 152 | 47 | 2003-05-14 | Yearly | 202 | 91 |
3 | Ethan | Transgender | White | 160 | 41 | 1998-06-19 | Monthly | 243 | 91 |
4 | Samantha | Female | Brown | 160 | 41 | 1991-10-04 | Half-yearly | 284 | 91 |
5 | Ashley | Female | White | 167 | 50 | 2000-04-25 | Monthly | 334 | 91 |
6 | Amelia | Female | White | 173 | 96 | 1992-12-31 | Half-yearly | 430 | 96 |
7 | Emma | Female | White | 157 | 60 | 1991-03-22 | Half-yearly | 490 | 96 |
8 | Isabella | Female | Brown | 166 | 74 | 1999-09-13 | Monthly | 564 | 96 |
9 | Madison | Female | Black | 160 | 113 | 1991-06-13 | Yearly | 677 | 113 |
10 | Alexander | Male | Black | 147 | 45 | 2001-12-29 | Half-yearly | 722 | 113 |
11 | Michael | Male | Black | 175 | 64 | 1993-07-29 | Yearly | 786 | 113 |
12 | Sarah | Female | Black | 172 | 45 | 1996-10-19 | Half-yearly | 831 | 113 |
13 | Benjamin | Male | Brown | 170 | 50 | 1997-06-06 | Monthly | 881 | 113 |
14 | Sophia | Female | Black | 155 | 69 | 2000-08-07 | Half-yearly | 950 | 113 |
15 | James | Male | White | 187 | 118 | 1995-12-12 | Half-yearly | 1068 | 118 |
16 | Daniel | Male | Brown | 160 | 64 | 2001-07-10 | Half-yearly | 1132 | 118 |
17 | Joseph | Male | Black | 180 | 132 | 1999-08-27 | Yearly | 1264 | 132 |
18 | Emily | Female | White | 175 | 104 | 1992-02-29 | Half-yearly | 1368 | 132 |
19 | David | Male | Brown | 193 | 119 | 1996-11-24 | Monthly | 1487 | 132 |
Print only the Weight(kg)
, cumsum_Weight(kg)
and cummax_Weight(kg)
columns.
# See the columns we calculated
print(subscribers[["Weight(kg)", "cumsum_Weight(kg)", "cummax_Weight(kg)"]])
output:
Weight(kg) cumsum_Weight(kg) cummax_Weight(kg)
0 64 64 64
1 91 155 91
2 47 202 91
3 41 243 91
4 41 284 91
5 50 334 91
6 96 430 96
7 60 490 96
8 74 564 96
9 113 677 113
10 45 722 113
11 64 786 113
12 45 831 113
13 50 881 113
14 69 950 113
15 118 1068 118
16 64 1132 118
17 132 1264 132
18 104 1368 132
19 119 1487 132
Awesome aggregation! Not all functions that calculate on columns return a single number. Some, like the cumulative statistic functions, return a whole column.
Counting
So far, in this article, we've learned how to summarize numeric variables. Next, we'll learn how to summarize categorical data using counting.
Avoid Double Counting
Counting subscribers manually is impossible when they're going around the gym or when our dataset is big. It's hard to keep track of who we have and haven't counted! Let's count how many subscribers we have.
Dropping Duplicates
Removing duplicates is an essential skill to get accurate counts because we don't want to count the same thing multiple times. Let's ask pandas
to drop rows with duplicate names from the dataset as we don't want to count the same name twice. We can do this using the drop_duplicates()
method. It takes an argument, subset
, which is the column we want pandas
to find our duplicates based on. In this case, we want all the unique names to find how many subscribers are there.
subscribers.drop_duplicates(subset='Name')
Name | Gender | Skin Color | Height(cm) | Weight(kg) | Date of Birth | Subscription Type | cumsum_Weight(kg) | cummax_Weight(kg) | |
0 | William | Male | White | 152 | 64 | 1993-09-16 | Monthly | 64 | 64 |
1 | Jamie | Male | Black | 162 | 91 | 1999-11-05 | Yearly | 155 | 91 |
3 | Ethan | Transgender | White | 160 | 41 | 1998-06-19 | Monthly | 243 | 91 |
4 | Samantha | Female | Brown | 160 | 41 | 1991-10-04 | Half-yearly | 284 | 91 |
5 | Ashley | Female | White | 167 | 50 | 2000-04-25 | Monthly | 334 | 91 |
6 | Amelia | Female | White | 173 | 96 | 1992-12-31 | Half-yearly | 430 | 96 |
7 | Emma | Female | White | 157 | 60 | 1991-03-22 | Half-yearly | 490 | 96 |
8 | Isabella | Female | Brown | 166 | 74 | 1999-09-13 | Monthly | 564 | 96 |
9 | Madison | Female | Black | 160 | 113 | 1991-06-13 | Yearly | 677 | 113 |
10 | Alexander | Male | Black | 147 | 45 | 2001-12-29 | Half-yearly | 722 | 113 |
11 | Michael | Male | Black | 175 | 64 | 1993-07-29 | Yearly | 786 | 113 |
12 | Sarah | Female | Black | 172 | 45 | 1996-10-19 | Half-yearly | 831 | 113 |
13 | Benjamin | Male | Brown | 170 | 50 | 1997-06-06 | Monthly | 881 | 113 |
14 | Sophia | Female | Black | 155 | 69 | 2000-08-07 | Half-yearly | 950 | 113 |
15 | James | Male | White | 187 | 118 | 1995-12-12 | Half-yearly | 1068 | 118 |
16 | Daniel | Male | Brown | 160 | 64 | 2001-07-10 | Half-yearly | 1132 | 118 |
17 | Joseph | Male | Black | 180 | 132 | 1999-08-27 | Yearly | 1264 | 132 |
18 | Emily | Female | White | 175 | 104 | 1992-02-29 | Half-yearly | 1368 | 132 |
19 | David | Male | Brown | 193 | 119 | 1996-11-24 | Monthly | 1487 | 132 |
Now we have a list of persons where each one appears once. But where did Jamie(Female) go? If we go back to the top of the article, we will notice that our original dataset has Jamie(Male) and Jamie(Female). drop_duplicates()
method dropped Jamie(Female) as we didn't mention to consider the gender column also when validating duplicates.
Dropping Duplicates With Multiple Column Pairs
Since Jamie(Male) and Jamie(Female) have the same name but different genders, we have to drop the rows with pairs of both Name
and Gender
. To base our duplicate dropping on multiple columns, we can pass a list of column names to the subset
argument, in this case, Name
and Gender
.
unique_persons = subscribers.drop_duplicates(subset=['Name', 'Gender'])
Print the number of persons by counting rows.
print('Number of unique subscribers:', unique_persons.shape[0])
output:
Number of unique subscribers: 20
Now both Jamie(Male) and Jamie(Female) have been included and the total number of unique subscribers is 20 as we expected.
Counting Categorical Variables
Counting is a great way to get an overview of our data and to spot curiosities that we might not notice. Here, we will count the persons of each subscription type. We'll subset the Subscription Type
column and use the value_counts()
method.
subscribers['Subscription Type'].value_counts()
output:
Half-yearly 9
Monthly 6
Yearly 5
Name: Subscription Type, dtype: int64
Count Proportions
The normalize
argument can be used to turn the counts into proportions of the total.
subscribers['Subscription Type'].value_counts(normalize=True)
output:
Half-yearly 0.45
Monthly 0.30
Yearly 0.25
Name: Subscription Type, dtype: float64
Looks like the Half-yearly subscription type is the most popular one and subscribed by 45% of the total.
Grouped Summary Statistics
So far, we've been calculating summary statistics for all rows of a dataset, but summary statistics can be useful to compare different groups. While computing summary statistics of entire columns may be useful, we can also gain many insights from summaries of individual groups.
For example, is the average weight of one subscription group larger than the other groups? Which gender is more likely to subscribe to which subscription type? We can answer these questions with what we've learned so far. We can subset the persons into groups based on their subscription types, and take the mean of each.
print(subscribers[subscribers['Subscription Type']=='Yearly']['Weight(kg)'].mean())
print(subscribers[subscribers['Subscription Type']=='Half-yearly']['Weight(kg)'].mean())
print(subscribers[subscribers['Subscription Type']=='Monthly']['Weight(kg)'].mean())
output:
89.4
71.33333333333333
66.33333333333333
But that's a lot of work, and the duplicated code can easily introduce copy-and-paste bugs.
Grouped Summaries Using groupby()
That's where the groupby()
method comes in. We can group by the Subscription Type
variable, select the weight column and take the mean. This will give us the mean weight for each subscriber group. This is just one line of code compared to the three we had to write before to get the same results.
subscribers.groupby('Subscription Type')['Weight(kg)'].mean()
output:
Subscription Type
Half-yearly 71.333333
Monthly 66.333333
Yearly 89.400000
Name: Weight(kg), dtype: float64
Great grouping! We were able to do the same calculation as in the previous one while writing much less code.
Multiple Grouped Summary Statistics
Earlier, we saw that the .agg()
method is useful to compute multiple statistics on multiple variables. It also works with grouped data. Here, we pass a list of functions into agg()
after grouping by Subscription Type
. The following code gives us the minimum and maximum of the persons' weights in different subscription groups.
# Import numpy with the alias np
import numpy as np
print('Using numpy version:', np.__version__)
output:
Using numpy version: 1.21.6
print('Calculating on Weight(kg) column...')
print(subscribers.groupby('Subscription Type')['Weight(kg)'].agg([np.min, np.max]))
output:
Calculating on Weight(kg) column...
amin amax
Subscription Type
Half-yearly 41 118
Monthly 41 119
Yearly 47 132
Grouping By Multiple Variables
We can also group by multiple columns and calculate summary statistics. Here, we group by Gender
and Subscription Type
, select the weight column and take the mean. This gives us the mean weight in each group.
print(subscribers.groupby(['Gender', 'Subscription Type'])['Weight(kg)'].mean())
output:
Gender Subscription Type
Female Half-yearly 69.166667
Monthly 62.000000
Yearly 80.000000
Male Half-yearly 75.666667
Monthly 77.666667
Yearly 95.666667
Transgender Monthly 41.000000
Name: Weight(kg), dtype: float64
Multiple Summary Statistics on Multiple-Column Groups
We can also group by multiple columns and aggregate multiple statistics on these groups.
print('Calculating mean...')
print(subscribers.groupby(['Gender', 'Subscription Type'])[['Weight(kg)', 'Height(cm)']].mean())
output:
Calculating mean...
Weight(kg) Height(cm)
Gender Subscription Type
Female Half-yearly 69.166667 165.333333
Monthly 62.000000 166.500000
Yearly 80.000000 156.000000
Male Half-yearly 75.666667 164.666667
Monthly 77.666667 171.666667
Yearly 95.666667 172.333333
Transgender Monthly 41.000000 160.000000
Pivot tables
Pivot tables are another way of aggregating data, and calculating grouped summary statistics. Previously, we learned we can group the persons by gender and calculated their weights' mean using groupby()
method.
print(subscribers.groupby('Gender')['Weight(kg)'].mean())
output:
Gender
Female 69.9
Male 83.0
Transgender 41.0
Name: Weight(kg), dtype: float64
Let's see how to create pivot tables in pandas using .pivot_table()
method, an alternative to .groupby()
. Here, we'll use .pivot_table()
to replicate the calculations we performed above. The values
argument is the column that we want to summarize and the index
column is the column that we want to group by. By default, pivot_table
takes the mean value for each group.
print('Calculating mean...')
print(subscribers.pivot_table(index='Gender', values='Weight(kg)'))
output:
Calculating mean...
Weight(kg)
Gender
Female 69.9
Male 83.0
Transgender 41.0
Different Statistics Using aggfunc
Argument
If we want a different summary statistic, we can use the aggfunc
argument and pass it a function. Here, we find the maximum weight in each gender group using NumPy's max
function.
print('Calculating max...')
print(subscribers.pivot_table(index='Gender',values='Weight(kg)', aggfunc=np.max))
output:
Calculating max...
Weight(kg)
Gender
Female 113
Male 132
Transgender 41
Multiple Statistics Using aggfunc
Argument
To get multiple summary statistics at a time, we can pass a list of functions to the aggfunc
argument. Here, we find the maximum and minimum weights from each gender group.
print(subscribers.pivot_table(index='Gender', values='Weight(kg)', aggfunc=[np.min, np.max]))
output:
amin amax
Weight(kg) Weight(kg)
Gender
Female 41 113
Male 45 132
Transgender 41 41
Pivot On Two Variables
We can compute the maximum weight grouped by two variables, gender and subscription type. But we will do this using the pivot_table
method. To group by two variables, we can pass a second variable name into the columns
argument. While the result looks a little different than what we had before, it contains the same numbers. There are NaN
s, or missing values, because there is only one transgender who subscribes monthly in our dataset.
print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)'))
output:
Subscription Type Half-yearly Monthly Yearly
Gender
Female 69.166667 62.000000 80.000000
Male 75.666667 77.666667 95.666667
Transgender NaN 41.000000 NaN
Filling Missing Values in Pivot Tables
Instead of having lots of missing values in our pivot table, fill_value
replaces missing values with a real value which is known as imputation. What to replace missing values with is a big topic and we will write its detail in another article, but the simplest thing to do is to substitute a dummy value. Here, all of the NaN
s get filled in with zeros.
print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)', fill_value = 0))
output:
Subscription Type Half-yearly Monthly Yearly
Gender
Female 69.166667 62.000000 80.000000
Male 75.666667 77.666667 95.666667
Transgender 0.000000 41.000000 0.000000
Pivot Tables With Margins
margins
is a shortcut for when we pivoted by two variables but also wanted to pivot by each of those variables separately. If we set the margins
argument to True
, the last row and last column of the pivot table contain the mean of all the values in the column or row, not including the missing values that were filled in with zeros.
print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)', fill_value = 0, margins=True))
output:
Subscription Type Half-yearly Monthly Yearly All
Gender
Female 69.166667 62.000000 80.000000 69.90
Male 75.666667 77.666667 95.666667 83.00
Transgender 0.000000 41.000000 0.000000 41.00
All 71.333333 66.333333 89.400000 74.35
For example, in the last row of the Monthly column, we can see that the mean weight of all Monthly subscribers is 66.33kg. In the last column of the Male row, the mean weight of all male persons is 83kg. The value in the bottom right, in the last row and last column, is the mean weight of all the persons in the dataset.
Using margins=True
allows us to see a summary statistic for multiple levels of the dataset as in
the entire dataset,
grouped by one variable, and
grouped by two variables
We now mastered pivot table skills that can help us compute summaries at multiple grouped levels in one line of code.
Conclusion
In this article, we learned data aggregation, and summary statistics on Dataset, and mastered grouped summary statistics and pivot tables.
Connect & Discuss with us on LinkedIn