Python

Pandas Bins

Often, creating bins of your data can be beneficial when working with continuous numeric data (like sales, profits, or ages). Binning the data into distinct buckets allows us to obtain logical insight into your data. Numerous terminologies, including quantization, discrete binning, and discretization, refer to binning data.

In this tutorial, you will learn about the cut() and the qcut() functions, the two different Pandas methods for binning your data. You can bin data into equal-sized and custom-sized bins. While putting data into customized bins can help you get insight into logical categorization classifications, equal-sized bins make it simple to understand the distribution.

How To Bin Data in Pandas?

In pandas, binning of the data can be performed using the cut() and qcut() functions. When it is required to sort and segment the data values into bins, you use cut() method. A continuous variable is changed into a categorical variable using it. This function can also convert the elements of an array into various bins. The cut() method only works with the objects like one-dimensional arrays. The cut() method performs statistical analysis on a large set of scalar/numeric data.

The qcut() function is known as a “Quantile-based discretization” method. This means that qcut() is used to create equal-sized bins by dividing the underlying data. The qcut() function is also known as the “Quantile-based discretization function”. This means that the qcut() use to divide the underlying data into bins of equal sizes.

Syntax of cut() function:


Syntax of qcut() function:


Parameters:

x: Unidimensional array, the array which we want to bin

q: Number of quantiles

bins: Bin edges are defined for the segmentation

right: Ture, by default. It indicates whether the rightmost edge of the bins is included or not

labels: Can be a bool or array, and it is optional. Labels for the refilled bins are specified. The length must match that of the produced bins. If False, only the integer bin indicators are returned

retbins: bool, False by default. Whether the bins will be returned or not. When bins are supplied as a scalar, it is useful

Now we have seen the syntax of both functions. In the following examples, we will see how these functions work for binning the data:

Example # 1: Segmenting the Data Into Bins Using the cut() Function

First, we will create a DataFrame with at least one scalar set, so we can use the cut() function to divide the data inside that column into bins. Before creating the DataFrame, we will import the NumPy and pandas libraries to use their functionalities.


After importing the libraries, we used the pd.DataFrame() function to create our DataFrame. Inside the DataFrame() function, we have passed a dictionary and specified the key as “numbers”. We used the np.random.randint() function as it generates a scalar array as values of the key in the dictionary. We have specified the parameters of random.randint() as 1, 20, and 10, which means that it will generate 10 rows with the numbers between 1 and 20. To view the generated column in the DataFrame, we will use the print() function.


As you can see, a column ‘number’ with 10 random values is generated in our DataFrame.


In the previous code, the pd.cut() function is segmenting the values in the ‘numbers’ column of the ‘df’ DataFrame. The bin parameter is specified as a list of values [1, 5, 10, 15, 20]. That means bins will be created against each value in the ‘numbers’ column using these values (1, 5, 10, 15, 20). We have assigned these bins to a new column, ‘bins’ in the ‘df’ DataFrame.


As seen, bins are created against each value of the ‘number’ column. In the previous DataFrame, a square bracket, “[“or “]”, shows that a data point is part of the range. When standard parentheses, like “(“or”)” are used, it means the edges/data point is not a part of the group. The new column ‘bins’ showing that each number in the ‘numbers’ column lies within the range of bins. Note that in row at index 0 (5, 10] just means that the 5 < 6 <= 10. We can also check the frequency of each bin using the unique() function. To extract the distinct values from a series, use the pandas unique() function.


As shown by using the unique() function, There are four unique bins in our ‘bins’ column of the DataFrame, which are (5, 10], (15, 20], (1, 5], and (10, 15]. You can use parameters as described previously in the syntax to modify the output generated by the cut() function.

Example # 2: Labeling the Bins by Using the cut() Function

We can also label our bins by specifying the ‘label’ parameter in the cut() function. We will use the DataFrame created in the previous example to label the bins.


Let’s add the labels to the bins in the ‘bins’ column of our ‘df’ DataFrame.


In the pd.cut() function, we have defined the range of the bins. In labels, we specified the labels for the range of bins. If the range is between 1 to 9, the label “less than 10” will be shown in the “bins” column, and if the bin range is between 10 to 20, the cells of column ‘bins’ will show the label “10 or greater”.


As you can see, our bins are now labeled according to the specified range.

Example # 3: Equal-Sized Binning Using qcut() Function

Let’s create a sample DataFrame on which we will use the qcut(). There will be only two columns in our DataFrame: a ‘Name’ column and a ‘Score’ column. We will use the .from_dict() function to load the data in our ‘df’ DataFrame:


Now, we will use the qcut() to divide the data of the ‘Score’ column into equal-sized bins. The function requires only two inputs: the column to the bin and the number of quantiles to produce. The function will return a series of data, which might be used to create a new column containing bins.


As seen in the previous DataFrame, a square bracket, “[“or”]”, shows that a data point is part of the range. When standard parentheses, like “(“or”)” are used, it means the edges/data point is not a part of the group. For example, at index 1, we can see that the ‘12’ value of column ‘Score’ lies between the range of (10.8, 12.0], where the value 10.8 is not the part of the group, and 12 is the part of the group. We can also label our bins by specifying the ‘label’ parameter in the qcut() function as we did in example 2 using the cut() function. We can also modify the results of the qcut() function using the parameters described in the previous syntax.

Conclusion

In this tutorial, we have discussed the cut() and qcut()functions for binning data in pandas Python. We have seen the syntax of both functions and described their parameters to help you while using those functions. In the examples of this tutorial, we showed how to segment the data into bins, label the bins, and equal-sized binning data using cut() and qcut() functions. Now, you may be able to bin the data on your own by using these functions.

About the author

Aqsa Yasin

I am a self-motivated information technology professional with a passion for writing. I am a technical writer and love to write for all Linux flavors and Windows.