Python

Pandas Cut() Function

In data analysis, numerical data is ubiquitous. Frequently, you may encounter numerical data which is continuous on extremely vast sizes or severely distorted. It could generally be preferable to arrange the data into distinct periods. Once the data is broken down into useful divisions, the descriptive stats may be performed more effectively.

Transforming the statistical data into data sets is a breeze with Pandas’ built-in cut() function. Only the one-dimensional array-like elements are compatible with the cut() method. When we have a bunch of numerical data and needs to run some statistical assessment, the cut() method is handy.

Let’s imagine, for illustration, that we get a range of values from 5 to 15. Then, we divide these numbers into 2 categories and classify them. We refer to these collections as bins. As a result, we separate these data into bins 1 and 2, which are 5 to 10 and 10 to 15, respectively. Having both bins, we can evaluate which numbers are larger and which are small. Therefore, 10 to 15 are larger than 5 to 10, and vice versa. This leads to the terms “Lows” and “Highs” which refer to the lower values and the larger ones, respectively.

This approach is known as marking the data with Pandas’ cut() technique. Utilize the cut() function if you ever need to divide the data into segments and enter the numbers in bins. The said method is also beneficial for converting an infinite value to categorical data.

Pandas Cut() Method Syntax

The one-dimensional array that needs to be placed in the bin is represented by the “x” symbol. For the classification, “Bin” defines the bin boundaries. The “right” specifies if the rightmost boundary should be retained or not; the default setting is True. The “labels” help in representing as well as classifying the bins either highs or lows. It gives instructions for the labeling on the returning containers and should have the exact size as that of the resultant bins. Boolean or arrays are both acceptable in labels. The “retbins” determine if the bins should be returned or not. The term “precision” describes the level of accuracy used while preserving and presenting the labels for the bins. The “include lowest” determines if the initial interval is left comprehensive or not. Whenever the bins’ boundaries are not distinctive, “duplicates” specifies whether to throw a ValueError or remove a non-distinctive.

Example 1: Segmenting Values into Bins

We start the practical demonstration of the Pandas cut() function with the basic and simple example of putting the values of a data frame into the bins by segmenting them.

The first thing you need to do before you start working on the main code is to import the necessary libraries in Python. In this illustration, we imported two Python libraries which are “Panda” and “NumPy”.

The Pandas library enables us to utilize the Pandas functions including the cut() function which is our topic of discussion today. While the other library that we imported is NumPy which is among the top used Python tools for statistical computations. To fill the DataFrame object, we utilize the NumPy to create the arbitrary integers.

Now, we begin with the main code which can be seen in the previous image.

Here, we created a variable as “new_df” which stores an array of randomly generated numbers. The “pd.dataframe” is invoked to generate a DataFrame. It requires 2 parameters: the column title “value” and the “np.random.randint” function. The “np.random.randint” generates random numbers for the defined DataFrame. It takes three parameters – minimum value, maximum value, and the length/size of the array. We defined the minimum value as 5 and the max value as 50 and the length of the array is set to 10. So, it generates 10 random numbers ranging from 5 to 50. Then, we utilized the “print()” expression to print the DataFrame “new_df”.

Here, you can see a DataFrame with the column “values” having 10 values.

Now, we create another column as “value_bins” inside the existing DataFrame, i.e. new_df. We then call the Pandas cut(). We pass the parameters to the cut method. The “x” is assigned the name of the DataFrame/array that we need to place into the bin. In our example, it is “new_df[values]” where “value” is the name of the column on which the cut() is applied. The second parameter of the cut parameter that we used is the “bin” to define the edges of the bin. Here, we want to divide the data into 4 bins from (5, 20], (20, 30], (30, 40], (40, 50].

In the last print statement, we called the “unique()” function which generates an array of unique values.

The output image shows the DataFrame with bin. You may notice that “20” is also added to the bin. It is a result of the default inclusion of the rightmost edge. If we don’t need it, use the cut() method with the right=False option.

Example 2: Labeling the Bins

We can add labels to the bins with the Pandas cut() function.

For illustration purposes, we created a data frame with the Pandas DataFrame function as we created in the previous example. This DataFrame contains a column “number” which stores an array of size 10 with randomly generated values from 11 to 32. Then, we create another column in the same DataFrame and name it “numbers_labels”. We invoke the Pandas cut() function. Inside this function, we mention the name of the column of our DataFrame to apply the cut() function. As we need to cut and segment the data into 2 bins, we provide 2 boundaries of the bin as (11, 22], (22, 32].

The next thing is to define the labels of the bins. In the “labels” argument, we pass the two expressions as “Lows” and “Highs”.

We use the same procedure as before, but in addition to dividing the results into bins, we now label the bins as highs and lows.

The statistical values are differentiated into bins. Then, we can observe whichever numbers are larger and which are smaller. In the cut() function invocation, we set the right=False because we need 10 to be an element of Highs.

The output image shows the bins with “Lows” and “Highs” Labels. The small values are labeled as lows and the larger values are termed as highs.

Conclusion

This article is based on the Pandas cut() function. It includes the introduction to the Pandas cut() function as well as the need to use this method. We explained all the necessary details and make you familiar with the basics of the cut() function. We elaborated each parameter of this function in easy-to-understand terms. We performed the practical code examples implemented on Spyder to let you practice this method with them. In a similar way, you can practice the other parameters of the cut() function. We made an intentional effort to provide you with the best and most handy learning exercise and to help you learn new concepts in programming.

About the author

Aqsa Yasin

I am a self-motivated information technology professional with a passion for writing. I am a technical writer and love to write for all Linux flavors and Windows.