In this article, we’ll learn data aggregation, calculate summary statistics on Dataset, and master grouped summary statistics and pivot tables.

## What is Data Aggregation

Data aggregation is the process of collecting, combining, and summarizing data from multiple sources into a single, more meaningful representation. This helps to identify patterns, trends, and relationships in large and complex datasets and can be used for data analysis and reporting purposes. Data aggregation includes calculating the summary statistics of data, grouping data based on categories, and creating pivot tables.

## Summary Statistics

Starting with summary statistics, as follows from their name, they are numbers that summarize and tell us about our dataset. First, let's import `pandas`

and the `gym_subscribers.csv`

file which contains the data we will use in this article. This `gym_subscribers.csv`

file is available in PyProDev's GitHub repository.

```
import pandas as pd
print('Using pandas verson:', pd.__version__)
```

```
output:
Using pandas verson: 1.3.5
```

```
#import data from csv file
subscribers = pd.read_csv('gym_subscribers.csv')
print(type(subscribers))
print('size of subscribers dataset:', subscribers.shape)
```

```
output:
<class 'pandas.core.frame.DataFrame'>
size of subscribers dataset: (20, 7)
```

- Explore our DataFrame by displaying the rows of the
`subscribers`

DataFrame.

```
# display the head of the subscribers DataFrame
display(subscribers)
```

Name | Gender | Skin Color | Height(cm) | Weight(kg) | Date of Birth | Subscription Type | |

0 | William | Male | White | 152 | 64 | 1993-09-16 | Monthly |

1 | Jamie | Male | Black | 162 | 91 | 1999-11-05 | Yearly |

2 | Jamie | Female | Brown | 152 | 47 | 2003-05-14 | Yearly |

3 | Ethan | Transgender | White | 160 | 41 | 1998-06-19 | Monthly |

4 | Samantha | Female | Brown | 160 | 41 | 1991-10-04 | Half-yearly |

5 | Ashley | Female | White | 167 | 50 | 2000-04-25 | Monthly |

6 | Amelia | Female | White | 173 | 96 | 1992-12-31 | Half-yearly |

7 | Emma | Female | White | 157 | 60 | 1991-03-22 | Half-yearly |

8 | Isabella | Female | Brown | 166 | 74 | 1999-09-13 | Monthly |

9 | Madison | Female | Black | 160 | 113 | 1991-06-13 | Yearly |

10 | Alexander | Male | Black | 147 | 45 | 2001-12-29 | Half-yearly |

11 | Michael | Male | Black | 175 | 64 | 1993-07-29 | Yearly |

12 | Sarah | Female | Black | 172 | 45 | 1996-10-19 | Half-yearly |

13 | Benjamin | Male | Brown | 170 | 50 | 1997-06-06 | Monthly |

14 | Sophia | Female | Black | 155 | 69 | 2000-08-07 | Half-yearly |

15 | James | Male | White | 187 | 118 | 1995-12-12 | Half-yearly |

16 | Daniel | Male | Brown | 160 | 64 | 2001-07-10 | Half-yearly |

17 | Joseph | Male | Black | 180 | 132 | 1999-08-27 | Yearly |

18 | Emily | Female | White | 175 | 104 | 1992-02-29 | Half-yearly |

19 | David | Male | Brown | 193 | 119 | 1996-11-24 | Monthly |

Let's see the information of columns in `subscribers`

using `info()`

.

```
subscribers.info()
```

```
output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 20 non-null object
1 Gender 20 non-null object
2 Skin Color 20 non-null object
3 Height(cm) 20 non-null int64
4 Weight(kg) 20 non-null int64
5 Date of Birth 20 non-null object
6 Subscription Type 20 non-null object
dtypes: int64(2), object(5)
memory usage: 1.2+ KB
```

### Summarizing Numerical Data

One of the most common summary statistics for numeric data is the mean, which is one way of telling us where the center of our data is. We can calculate the mean of a column by selecting the column with square brackets and calling `.mean()`

. There are lots of other summary statistics that we can compute on columns, like

`.median()`

which returns the median of the column,`.min()`

which returns the minimum of the column,`.max()`

which returns the maximum of the column,`.var()`

which returns the variance of the column and`.std()`

which returns the standard deviation of the column.

Calculating summary statistics allows us to get a better sense of our data, even if there's a lot of it.

```
print('Calculating on Weight(kg) column...')
print('Mean:', subscribers['Weight(kg)'].mean())
print('Median:', subscribers['Weight(kg)'].median())
print('Min:', subscribers['Weight(kg)'].min())
print('Max:', subscribers['Weight(kg)'].max())
print('Variance:', subscribers['Weight(kg)'].var())
print('Standard deviation:', subscribers['Weight(kg)'].std())
```

```
output:
Calculating on Weight(kg) column...
Mean: 74.35
Median: 64.0
Min: 41
Max: 132
Variance: 883.0815789473684
Standard deviation: 29.716688559584973
```

### Summarizing Dates

Summary statistics can also be calculated on date columns that have values with the standard date format. Some summary statistics like mean don't make sense on dates, but others are super helpful when we can find the oldest person's date of birth by taking the minimum of the date of birth column.

```
print(subscribers['Date of Birth'].min())
```

```
output:
1991-03-22
```

Similarly, we can take the maximum and find the youngest person in out dataset.

```
print(subscribers['Date of Birth'].max())
```

```
ouput:
2003-05-14
```

Super summarizing! Taking the minimum and maximum of a column of dates is handy for figuring out what birthday range of our gym subscribers.

## Aggregate Data Using `agg()`

Method

While pandas and NumPy have tons of functions, sometimes, we may need a different function to summarize our data. The aggregate, `agg()`

, method allows us to compute our custom summary statistics on a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making our aggregations super-efficient. Here, we create a function called `get_second_largest`

that computes the largest value in a DataFrame column. Now we will aggregate the weight column using `.agg()`

, passing in the name of our function, `get_second_largest`

. It gives us the second-largest weight of the persons in our dataset.

```
def get_second_largest(data):
sorted_list = data.sort_values(ascending=False).tolist()
return sorted_list[1] # 2nd largest is at the index 1
```

```
print(subscribers['Weight(kg)'].agg(get_second_largest))
```

```
output:
119
```

### Aggregating On Multiple Columns

`agg()`

can also be used on more than one column. By selecting the weight and height columns before calling `agg()`

, we get the thirty percentile for both columns.

```
print('Second largest values:')
print(subscribers[['Weight(kg)', 'Height(cm)']].agg(get_second_largest))
```

```
output:
Second largest values:
Weight(kg) 119
Height(cm) 187
dtype: int64
```

### Multiple Summary Statistics

We can also use `agg()`

to get multiple summary statistics at once. Here's another function called `get_third_largest`

that computes the third largest value from the given data. We can pass a list of functions into `agg()`

, in this case, `get_second_largest`

and `get_third_largest`

, which will return the second and third-largest values from the given data.

```
def get_third_largest(data):
sorted_list = data.sort_values(ascending=False).tolist()
return sorted_list[2] # 3rd largest is at the index 2
```

```
print(subscribers['Weight(kg)'].agg([get_second_largest, get_third_largest]))
```

```
output:
get_second_largest 119
get_third_largest 118
Name: Weight(kg), dtype: int64
```

### Multiple Summary Statistics On Multiple Columns

The utility of `agg()`

shines when we do multiple summary statistics on multiple columns at the same time. Let's find the second and third largest values from the weight and height columns.

```
print(subscribers[['Weight(kg)', 'Height(cm)']].agg([get_second_largest, get_third_largest]))
```

```
Weight(kg) Height(cm)
get_second_largest 119 187
get_third_largest 118 180
```

Excellent efficiency! The `.agg()`

method makes it easy to compute multiple statistics on multiple columns, all in just one line of code.

## Cumulative Statistics

Cumulative statistics can also help track summary statistics over time. `pandas`

also has methods for computing cumulative statistics, for example, the cumulative sum, `cumsum()`

. Calling `cumsum()`

on a column returns not just one number, but a number for each row of the DataFrame. The first number returned, or the number at the zero index is the first person's weight. The next number is the sum of the first and second persons' weights. The third number is the sum of the first, second, and third persons' weights, and so on. The last number is the sum of all the persons' weights. Compute the cumulative sum of `Weight(kg)`

and add it as a new column of `subscribers`

called `cumsum_Weight(kg)`

.

```
subscribers['cumsum_Weight(kg)'] = subscribers['Weight(kg)'].cumsum()
display(subscribers)
```

Name | Gender | Skin Color | Height(cm) | Weight(kg) | Date of Birth | Subscription Type | cumsum_Weight(kg) | |

0 | William | Male | White | 152 | 64 | 1993-09-16 | Monthly | 64 |

1 | Jamie | Male | Black | 162 | 91 | 1999-11-05 | Yearly | 155 |

2 | Jamie | Female | Brown | 152 | 47 | 2003-05-14 | Yearly | 202 |

3 | Ethan | Transgender | White | 160 | 41 | 1998-06-19 | Monthly | 243 |

4 | Samantha | Female | Brown | 160 | 41 | 1991-10-04 | Half-yearly | 284 |

5 | Ashley | Female | White | 167 | 50 | 2000-04-25 | Monthly | 334 |

6 | Amelia | Female | White | 173 | 96 | 1992-12-31 | Half-yearly | 430 |

7 | Emma | Female | White | 157 | 60 | 1991-03-22 | Half-yearly | 490 |

8 | Isabella | Female | Brown | 166 | 74 | 1999-09-13 | Monthly | 564 |

9 | Madison | Female | Black | 160 | 113 | 1991-06-13 | Yearly | 677 |

10 | Alexander | Male | Black | 147 | 45 | 2001-12-29 | Half-yearly | 722 |

11 | Michael | Male | Black | 175 | 64 | 1993-07-29 | Yearly | 786 |

12 | Sarah | Female | Black | 172 | 45 | 1996-10-19 | Half-yearly | 831 |

13 | Benjamin | Male | Brown | 170 | 50 | 1997-06-06 | Monthly | 881 |

14 | Sophia | Female | Black | 155 | 69 | 2000-08-07 | Half-yearly | 950 |

15 | James | Male | White | 187 | 118 | 1995-12-12 | Half-yearly | 1068 |

16 | Daniel | Male | Brown | 160 | 64 | 2001-07-10 | Half-yearly | 1132 |

17 | Joseph | Male | Black | 180 | 132 | 1999-08-27 | Yearly | 1264 |

18 | Emily | Female | White | 175 | 104 | 1992-02-29 | Half-yearly | 1368 |

19 | David | Male | Brown | 193 | 119 | 1996-11-24 | Monthly | 1487 |

`pandas`

also has methods for other cumulative statistics, such as

cumulative maximum,

`cummax()`

andcumulative minimum,

`cummin()`

These all return an entire column of a DataFrame, rather than a single number. Compute the cumulative maximum of `Weight(kg)`

, and add it as a column called `cummax_Weight(kg)`

.

```
subscribers['cummax_Weight(kg)'] = subscribers['Weight(kg)'].cummax()
display(subscribers)
```

Name | Gender | Skin Color | Height(cm) | Weight(kg) | Date of Birth | Subscription Type | cumsum_Weight(kg) | cummax_Weight(kg) | |

0 | William | Male | White | 152 | 64 | 1993-09-16 | Monthly | 64 | 64 |

1 | Jamie | Male | Black | 162 | 91 | 1999-11-05 | Yearly | 155 | 91 |

2 | Jamie | Female | Brown | 152 | 47 | 2003-05-14 | Yearly | 202 | 91 |

3 | Ethan | Transgender | White | 160 | 41 | 1998-06-19 | Monthly | 243 | 91 |

4 | Samantha | Female | Brown | 160 | 41 | 1991-10-04 | Half-yearly | 284 | 91 |

5 | Ashley | Female | White | 167 | 50 | 2000-04-25 | Monthly | 334 | 91 |

6 | Amelia | Female | White | 173 | 96 | 1992-12-31 | Half-yearly | 430 | 96 |

7 | Emma | Female | White | 157 | 60 | 1991-03-22 | Half-yearly | 490 | 96 |

8 | Isabella | Female | Brown | 166 | 74 | 1999-09-13 | Monthly | 564 | 96 |

9 | Madison | Female | Black | 160 | 113 | 1991-06-13 | Yearly | 677 | 113 |

10 | Alexander | Male | Black | 147 | 45 | 2001-12-29 | Half-yearly | 722 | 113 |

11 | Michael | Male | Black | 175 | 64 | 1993-07-29 | Yearly | 786 | 113 |

12 | Sarah | Female | Black | 172 | 45 | 1996-10-19 | Half-yearly | 831 | 113 |

13 | Benjamin | Male | Brown | 170 | 50 | 1997-06-06 | Monthly | 881 | 113 |

14 | Sophia | Female | Black | 155 | 69 | 2000-08-07 | Half-yearly | 950 | 113 |

15 | James | Male | White | 187 | 118 | 1995-12-12 | Half-yearly | 1068 | 118 |

16 | Daniel | Male | Brown | 160 | 64 | 2001-07-10 | Half-yearly | 1132 | 118 |

17 | Joseph | Male | Black | 180 | 132 | 1999-08-27 | Yearly | 1264 | 132 |

18 | Emily | Female | White | 175 | 104 | 1992-02-29 | Half-yearly | 1368 | 132 |

19 | David | Male | Brown | 193 | 119 | 1996-11-24 | Monthly | 1487 | 132 |

Print only the `Weight(kg)`

, `cumsum_Weight(kg)`

and `cummax_Weight(kg)`

columns.

```
# See the columns we calculated
print(subscribers[["Weight(kg)", "cumsum_Weight(kg)", "cummax_Weight(kg)"]])
```

```
output:
Weight(kg) cumsum_Weight(kg) cummax_Weight(kg)
0 64 64 64
1 91 155 91
2 47 202 91
3 41 243 91
4 41 284 91
5 50 334 91
6 96 430 96
7 60 490 96
8 74 564 96
9 113 677 113
10 45 722 113
11 64 786 113
12 45 831 113
13 50 881 113
14 69 950 113
15 118 1068 118
16 64 1132 118
17 132 1264 132
18 104 1368 132
19 119 1487 132
```

Awesome aggregation! Not all functions that calculate on columns return a single number. Some, like the cumulative statistic functions, return a whole column.

## Counting

So far, in this article, we've learned how to summarize numeric variables. Next, we'll learn how to summarize categorical data using counting.

### Avoid Double Counting

Counting subscribers manually is impossible when they're going around the gym or when our dataset is big. It's hard to keep track of who we have and haven't counted! Let's count how many subscribers we have.

### Dropping Duplicates

Removing duplicates is an essential skill to get accurate counts because we don't want to count the same thing multiple times. Let's ask `pandas`

to drop rows with duplicate names from the dataset as we don't want to count the same name twice. We can do this using the `drop_duplicates()`

method. It takes an argument, `subset`

, which is the column we want `pandas`

to find our duplicates based on. In this case, we want all the unique names to find how many subscribers are there.

```
subscribers.drop_duplicates(subset='Name')
```

Name | Gender | Skin Color | Height(cm) | Weight(kg) | Date of Birth | Subscription Type | cumsum_Weight(kg) | cummax_Weight(kg) | |

0 | William | Male | White | 152 | 64 | 1993-09-16 | Monthly | 64 | 64 |

1 | Jamie | Male | Black | 162 | 91 | 1999-11-05 | Yearly | 155 | 91 |

3 | Ethan | Transgender | White | 160 | 41 | 1998-06-19 | Monthly | 243 | 91 |

4 | Samantha | Female | Brown | 160 | 41 | 1991-10-04 | Half-yearly | 284 | 91 |

5 | Ashley | Female | White | 167 | 50 | 2000-04-25 | Monthly | 334 | 91 |

6 | Amelia | Female | White | 173 | 96 | 1992-12-31 | Half-yearly | 430 | 96 |

7 | Emma | Female | White | 157 | 60 | 1991-03-22 | Half-yearly | 490 | 96 |

8 | Isabella | Female | Brown | 166 | 74 | 1999-09-13 | Monthly | 564 | 96 |

9 | Madison | Female | Black | 160 | 113 | 1991-06-13 | Yearly | 677 | 113 |

10 | Alexander | Male | Black | 147 | 45 | 2001-12-29 | Half-yearly | 722 | 113 |

11 | Michael | Male | Black | 175 | 64 | 1993-07-29 | Yearly | 786 | 113 |

12 | Sarah | Female | Black | 172 | 45 | 1996-10-19 | Half-yearly | 831 | 113 |

13 | Benjamin | Male | Brown | 170 | 50 | 1997-06-06 | Monthly | 881 | 113 |

14 | Sophia | Female | Black | 155 | 69 | 2000-08-07 | Half-yearly | 950 | 113 |

15 | James | Male | White | 187 | 118 | 1995-12-12 | Half-yearly | 1068 | 118 |

16 | Daniel | Male | Brown | 160 | 64 | 2001-07-10 | Half-yearly | 1132 | 118 |

17 | Joseph | Male | Black | 180 | 132 | 1999-08-27 | Yearly | 1264 | 132 |

18 | Emily | Female | White | 175 | 104 | 1992-02-29 | Half-yearly | 1368 | 132 |

19 | David | Male | Brown | 193 | 119 | 1996-11-24 | Monthly | 1487 | 132 |

Now we have a list of persons where each one appears once. But where did Jamie(Female) go? If we go back to the top of the article, we will notice that our original dataset has Jamie(Male) and Jamie(Female). `drop_duplicates()`

method dropped Jamie(Female) as we didn't mention to consider the gender column also when validating duplicates.

### Dropping Duplicates With Multiple Column Pairs

Since Jamie(Male) and Jamie(Female) have the same name but different genders, we have to drop the rows with pairs of both `Name`

and `Gender`

. To base our duplicate dropping on multiple columns, we can pass a list of column names to the `subset`

argument, in this case, `Name`

and `Gender`

.

```
unique_persons = subscribers.drop_duplicates(subset=['Name', 'Gender'])
```

Print the number of persons by counting rows.

```
print('Number of unique subscribers:', unique_persons.shape[0])
```

```
output:
Number of unique subscribers: 20
```

Now both Jamie(Male) and Jamie(Female) have been included and the total number of unique subscribers is 20 as we expected.

### Counting Categorical Variables

Counting is a great way to get an overview of our data and to spot curiosities that we might not notice. Here, we will count the persons of each subscription type. We'll subset the `Subscription Type`

column and use the `value_counts()`

method.

```
subscribers['Subscription Type'].value_counts()
```

```
output:
Half-yearly 9
Monthly 6
Yearly 5
Name: Subscription Type, dtype: int64
```

### Count Proportions

The `normalize`

argument can be used to turn the counts into proportions of the total.

```
subscribers['Subscription Type'].value_counts(normalize=True)
```

```
output:
Half-yearly 0.45
Monthly 0.30
Yearly 0.25
Name: Subscription Type, dtype: float64
```

Looks like the Half-yearly subscription type is the most popular one and subscribed by 45% of the total.

## Grouped Summary Statistics

So far, we've been calculating summary statistics for all rows of a dataset, but summary statistics can be useful to compare different groups. While computing summary statistics of entire columns may be useful, we can also gain many insights from summaries of individual groups.

For example, is the average weight of one subscription group larger than the other groups? Which gender is more likely to subscribe to which subscription type? We can answer these questions with what we've learned so far. We can subset the persons into groups based on their subscription types, and take the mean of each.

```
print(subscribers[subscribers['Subscription Type']=='Yearly']['Weight(kg)'].mean())
print(subscribers[subscribers['Subscription Type']=='Half-yearly']['Weight(kg)'].mean())
print(subscribers[subscribers['Subscription Type']=='Monthly']['Weight(kg)'].mean())
```

```
output:
89.4
71.33333333333333
66.33333333333333
```

But that's a lot of work, and the duplicated code can easily introduce copy-and-paste bugs.

### Grouped Summaries Using `groupby()`

That's where the `groupby()`

method comes in. We can group by the `Subscription Type`

variable, select the weight column and take the mean. This will give us the mean weight for each subscriber group. This is just one line of code compared to the three we had to write before to get the same results.

```
subscribers.groupby('Subscription Type')['Weight(kg)'].mean()
```

```
output:
Subscription Type
Half-yearly 71.333333
Monthly 66.333333
Yearly 89.400000
Name: Weight(kg), dtype: float64
```

Great grouping! We were able to do the same calculation as in the previous one while writing much less code.

### Multiple Grouped Summary Statistics

Earlier, we saw that the `.agg()`

method is useful to compute multiple statistics on multiple variables. It also works with grouped data. Here, we pass a list of functions into `agg()`

after grouping by `Subscription Type`

. The following code gives us the minimum and maximum of the persons' weights in different subscription groups.

```
# Import numpy with the alias np
import numpy as np
print('Using numpy version:', np.__version__)
```

```
output:
Using numpy version: 1.21.6
```

```
print('Calculating on Weight(kg) column...')
print(subscribers.groupby('Subscription Type')['Weight(kg)'].agg([np.min, np.max]))
```

```
output:
Calculating on Weight(kg) column...
amin amax
Subscription Type
Half-yearly 41 118
Monthly 41 119
Yearly 47 132
```

### Grouping By Multiple Variables

We can also group by multiple columns and calculate summary statistics. Here, we group by `Gender`

and `Subscription Type`

, select the weight column and take the mean. This gives us the mean weight in each group.

```
print(subscribers.groupby(['Gender', 'Subscription Type'])['Weight(kg)'].mean())
```

```
output:
Gender Subscription Type
Female Half-yearly 69.166667
Monthly 62.000000
Yearly 80.000000
Male Half-yearly 75.666667
Monthly 77.666667
Yearly 95.666667
Transgender Monthly 41.000000
Name: Weight(kg), dtype: float64
```

### Multiple Summary Statistics on Multiple-Column Groups

We can also group by multiple columns and aggregate multiple statistics on these groups.

```
print('Calculating mean...')
print(subscribers.groupby(['Gender', 'Subscription Type'])[['Weight(kg)', 'Height(cm)']].mean())
```

```
output:
Calculating mean...
Weight(kg) Height(cm)
Gender Subscription Type
Female Half-yearly 69.166667 165.333333
Monthly 62.000000 166.500000
Yearly 80.000000 156.000000
Male Half-yearly 75.666667 164.666667
Monthly 77.666667 171.666667
Yearly 95.666667 172.333333
Transgender Monthly 41.000000 160.000000
```

## Pivot tables

Pivot tables are another way of aggregating data, and calculating grouped summary statistics. Previously, we learned we can group the persons by gender and calculated their weights' mean using `groupby()`

method.

```
print(subscribers.groupby('Gender')['Weight(kg)'].mean())
```

```
output:
Gender
Female 69.9
Male 83.0
Transgender 41.0
Name: Weight(kg), dtype: float64
```

Let's see how to create pivot tables in pandas using `.pivot_table()`

method, an alternative to `.groupby()`

. Here, we'll use `.pivot_table()`

to replicate the calculations we performed above. The `values`

argument is the column that we want to summarize and the `index`

column is the column that we want to group by. By default, `pivot_table`

takes the mean value for each group.

```
print('Calculating mean...')
print(subscribers.pivot_table(index='Gender', values='Weight(kg)'))
```

```
output:
Calculating mean...
Weight(kg)
Gender
Female 69.9
Male 83.0
Transgender 41.0
```

### Different Statistics Using `aggfunc`

Argument

If we want a different summary statistic, we can use the `aggfunc`

argument and pass it a function. Here, we find the maximum weight in each gender group using NumPy's `max`

function.

```
print('Calculating max...')
print(subscribers.pivot_table(index='Gender',values='Weight(kg)', aggfunc=np.max))
```

```
output:
Calculating max...
Weight(kg)
Gender
Female 113
Male 132
Transgender 41
```

### Multiple Statistics Using `aggfunc`

Argument

To get multiple summary statistics at a time, we can pass a list of functions to the `aggfunc`

argument. Here, we find the maximum and minimum weights from each gender group.

```
print(subscribers.pivot_table(index='Gender', values='Weight(kg)', aggfunc=[np.min, np.max]))
```

```
output:
amin amax
Weight(kg) Weight(kg)
Gender
Female 41 113
Male 45 132
Transgender 41 41
```

### Pivot On Two Variables

We can compute the maximum weight grouped by two variables, gender and subscription type. But we will do this using the `pivot_table`

method. To group by two variables, we can pass a second variable name into the `columns`

argument. While the result looks a little different than what we had before, it contains the same numbers. There are `NaN`

s, or missing values, because there is only one transgender who subscribes monthly in our dataset.

```
print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)'))
```

```
output:
Subscription Type Half-yearly Monthly Yearly
Gender
Female 69.166667 62.000000 80.000000
Male 75.666667 77.666667 95.666667
Transgender NaN 41.000000 NaN
```

### Filling Missing Values in Pivot Tables

Instead of having lots of missing values in our pivot table, `fill_value`

replaces missing values with a real value which is known as imputation. What to replace missing values with is a big topic and we will write its detail in another article, but the simplest thing to do is to substitute a dummy value. Here, all of the `NaN`

s get filled in with zeros.

```
print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)', fill_value = 0))
```

```
output:
Subscription Type Half-yearly Monthly Yearly
Gender
Female 69.166667 62.000000 80.000000
Male 75.666667 77.666667 95.666667
Transgender 0.000000 41.000000 0.000000
```

### Pivot Tables With Margins

`margins`

is a shortcut for when we pivoted by two variables but also wanted to pivot by each of those variables separately. If we set the `margins`

argument to `True`

, the last row and last column of the pivot table contain the mean of all the values in the column or row, not including the missing values that were filled in with zeros.

```
print(subscribers.pivot_table(index='Gender', columns='Subscription Type', values='Weight(kg)', fill_value = 0, margins=True))
```

```
output:
Subscription Type Half-yearly Monthly Yearly All
Gender
Female 69.166667 62.000000 80.000000 69.90
Male 75.666667 77.666667 95.666667 83.00
Transgender 0.000000 41.000000 0.000000 41.00
All 71.333333 66.333333 89.400000 74.35
```

For example, in the last row of the Monthly column, we can see that the mean weight of all Monthly subscribers is 66.33kg. In the last column of the Male row, the mean weight of all male persons is 83kg. The value in the bottom right, in the last row and last column, is the mean weight of all the persons in the dataset.

Using `margins=True`

allows us to see a summary statistic for multiple levels of the dataset as in

the entire dataset,

grouped by one variable, and

grouped by two variables

We now mastered pivot table skills that can help us compute summaries at multiple grouped levels in one line of code.

## Conclusion

In this article, we learned data aggregation, and summary statistics on Dataset, and mastered grouped summary statistics and pivot tables.