Python

Pandas Groupby Apply

The most popular python library for data manipulation and analysis is pandas. Data analysis frequently requires the data to be divided into groups to execute different operations on each group. The split-apply-combine strategy is used by the GroupBy function in Pandas. This involves separating an object, using functions on the object, and combining the results. In this post to manipulate grouped data in a variety of ways, the groupby function will be used. One of the commonly used methods used for data preprocessing is the apply() method. Applying a function to every item in a Series of pandas, and to every dataframe’s column or row in pandas, is made simpler. The apply() method in pandas will be covered in this article along with the groupby() function.

Hot to Use the Apply() Function on Grouped Data

We can use the apply() function for various functions to the DataFrame’s rows and columns. The objects supplied to the function are objects of series whose index value is either the index of the DataFrame rows(axis=0) or the number of columns in the DataFrame (axis=1). Using this method will return the DataFrame or series along the specified axis. If we want to change a certain column without changing any other columns, we can use this function. The DataFrame.apply() method’s syntax is given below.

Syntax: DataFrame.apply(func, axis= 0, raw= False, result_type= None, args= (), **kwds)

Where:

func: Represents the function that will be applied to each row or column

axis: Specifies the direction along which the method will be applied: 0 or “index” applies the function to all columns, while 1 or “columns” applies the function to all rows.

result_type: Offers the options “reduce”, “expand”, “broadcast”, and “None”. The ‘None’ is the default value.

These only work with axis=1 (columns):

expand: Columns will be created from the results that resemble lists.

reduce: In contrast to “extend,” this returns a Series whenever it is feasible rather than expanding results that resemble lists.

broadcast: The original columns and index will be kept and it will broadcast the results to the DataFrame in its original shape.

Let’s also have a look at the syntax of the groupby() function to group the data:

Syntax: DataFrame.groupby(by= None, axis= 0, level= None, as_index= True, sort= True, group_keys= True, squeeze= NoDefault.no_default, observed= False, dropna= True)

by: function, list of labels, mapping, or label. For the groupby, it is used to create the groups. The groups will be determined from the dict values or series if one of those is given. It uses the values as-is to generate the groups if a ndarray or list with an equal length to the chosen axis is given. To group by the dataframe columns in oneself, a list of labels or a single label may be given/passed. Keep in mind that each tuple is viewed as a key (single).

axis: {‘index’ or 0, ‘columns’ or 1}, 0 by default. Split along columns or rows.

level: name of the level, sequence of such, or int. default None. Group data are based on a specific level or levels if the axis index is a MultiIndex.

as_index: bool, it is True by default. Return an object with group names as the index for aggregated output. Applicable only to DataFrame input. Effectively, “SQL-style” grouped output is “as index=False”.

sort: bool, it is True by default. Group keys in order. By turning this off, your performance will improve.

group_keys: bool, it is True by default. Add group keys to the index when calling apply to identify the parts.

squeeze: bool, it is False by default. If possible, reduce the return type’s dimensionality; if not, it returns a consistent type.

observed: bool, it is False by default. Only if one or more of the groupers is categorical will this apply. Only display output value for categorical groupers if True and show all output values if False.

dropna: bool, it is True by default. If True and the group keys have NA values, the NA values and the corresponding row and column will be removed.

Let’s demonstrate some examples which will help you learn how to use groupby() and apply() function together in pandas.

Example # 1: Determine the Frequency of Values in a Dataframe Column

To find the frequency of values in a dataframe column, we must require a dataframe first. The dataframe will be created using the pd.DataFrame() function.

We have created a dataframe. Let’s find the frequency of distinct data values in the column ‘group’. Before finding the frequency, we will first group the data with the help of the groupby() method. Then, we will define a function inside the apply() function to find the frequency.

The data in the column ‘group’ is now grouped. To find the frequency, we have used built-in functions and attributes of pandas: count() and shape[]. Inside the apply function, the lambda function is used to execute the specified function or expression. The function has determined the frequency for category X is 0.57 and for Y is 0.42. Group X appears in 57% of rows; whereas group Y appears in 42%.

Example # 2: Determine the Maximum Value in a Dataframe Column

We can use the groupby() along with the apply() function to determine the maximum values for each grouped data. Again, we will create a dataframe so we can find the maximum value after grouping the data of the column.

First, we will group the data of column ‘team’ to create categories. Then, we will use the apply() function inside which we will use the aggregation function ‘max’ to find the maximum value in column ‘points’ for each category.

The column team is grouped into three categories. Then, the apply() function has determined the max value of column ‘point’ for each category. The category ‘ace’ has the max value of 14; whereas 15 and 17 are the max values for groups beta and champ, respectively.

Example # 3: Performing Custom Calculations by Using Apply() Function After Grouping the Data

Instead of using only inbuilt aggregation functions of pandas we can also define custom functions or create an expression inside the apply() function to perform custom calculations. Let’s create a dataframe from which we will group the data of a specific column. Then, we will perform calculations on it.

Let’s calculate the average difference between values for each group.

The data in the column ‘student’ is grouped into three categories ‘Billy’, ‘Jim’, and ‘Mandy’. The average difference between column ‘total_marks’ and ‘obtained_marks’ is obtained by subtracting the values of the ‘obtained_marks’ column from column ‘total_marks’ and applying the mean() function to it. The mean difference value for the category ‘Billy’ is 3.66. Whereas, the mean difference for grouped values Jim and Mandy is 3 and 4, respectively.

Conclusion

In this tutorial, we have discussed how to use groupby() and apply() functions together in pandas. We have seen the syntax of both functions along with their parameters to understand their functionality. We implemented a few examples in this tutorial to teach you how you can use the groupby() and apply() functions by using the built-in functions or by defining customized functions in pandas.

About the author

Aqsa Yasin

I am a self-motivated information technology professional with a passion for writing. I am a technical writer and love to write for all Linux flavors and Windows.