Python

Pandas – Convert Categorical Values to Int Values

The datasets for machine learning execution include both numerical and categorical variables. Categorical variables are string-type data that humans easily comprehend. Machines, on the other hand, cannot understand categorical inputs directly. Consequently, categorical content must be transformed into numerical values that machines can interpret.

In this tutorial, we will discuss three different ways to convert Categorical values to numeric values in Pandas DataFrame

Approach 1: Using replace()

In Python, replace() is used to change or convert the given values to new values that we specified. For replacing multiple values, We need to pass two lists as parameters. The first list will store the actual values to be replaced, and the second list will store the values that replace the values in the first list.

Syntax

DataFrame_object['column’]=DataFrame_object['column'].replace([value1,value2,...],[value1,value2,...])

Here, column is the name of the column in which we are replacing the values.

Consider the DataFrame

Let’s create the Pandas DataFrame named sets that hold 4 columns with 10 rows.

# Import the DataFrame from the pandas module

from pandas import DataFrame

 

# Create a DataFrame with 10 rows that hold 4 columns

sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],

'priority':['high','low','low','high','medium','high','medium','low','high','medium'],

'gender':['M','M','M','F','M','M','F','F','M','F'],

'age':[12,23,21,34,21,23,21,34,56,32]})

 

# Actual DataFrame

print(sets)

# Display data types of each column

print(sets.dtypes)

Output

code priority gender age

0 1 high M 12

1 2 low M 23

2 3 low M 21

3 4 high F 34

4 5 medium M 21

5 6 high M 23

6 7 medium F 21

7 8 low F 34

8 9 high M 56

9 0 medium F 32

code int64

priority object

gender object

age int64

dtype: object

We see two columns with the categorical type (object), i.e., priority and gender.

So we need to convert these to numeric/integer values.

Example 1

Let’s replace values in the gender column using replace() method.

# Import the DataFrame from the pandas module

from pandas import DataFrame

 

# Create a DataFrame with 10 rows that hold 4 columns

sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],

'priority':['high','low','low','high','medium','high','medium','low','high','medium'],

'gender':['M','M','M','F','M','M','F','F','M','F'],

'age':[12,23,21,34,21,23,21,34,56,32]})

# Convert categorical values to Numeric values in the gender column

sets['gender']=sets['gender'].replace(['M', 'F'],[1, 2])

print(sets)

Output

code priority gender age

0 1 high 1 12

1 2 low 1 23

2 3 low 1 21

3 4 high 2 34

4 5 medium 1 21

5 6 high 1 23

6 7 medium 2 21

7 8 low 2 34

8 9 high 1 56

9 0 medium 2 32

Explanation

We are replacing ‘M’ with 1 and ‘F’ with 2 and storing the column values again in the gender column. We can see that the gender column holds values within 1 or 2.

Example 2

Let’s replace values in the priority column using replace() method.

# Import the DataFrame from the pandas module

from pandas import DataFrame

 

# Create a DataFrame with 10 rows that hold 4 columns

sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],

'priority':['high','low','low','high','medium','high','medium','low','high','medium'],

'gender':['M','M','M','F','M','M','F','F','M','F'],

'age':[12,23,21,34,21,23,21,34,56,32]})

# Convert categorical values to Numeric values in the priority column

sets['priority']=sets['priority'].replace(['low', 'medium','high'],[0,1,2])

print(sets)

Output

code priority gender age

0 1 2 M 12

1 2 0 M 23

2 3 0 M 21

3 4 2 F 34

4 5 1 M 21

5 6 2 M 23

6 7 1 F 21

7 8 0 F 34

8 9 2 M 56

9 0 1 F 32

Explanation

There are three categories in the priority column. They are ‘low’, ‘high’, and ‘medium’. We are replacing ‘low’ with 0, ‘medium’ with 1, and ‘high’ with 2 and storing the column values again in the priority column.

Approach 2: Using apply(factorize())

The other method Pandas provided us is the ‘DataFrame.apply()’ function for converting all the categorical values into integers.

To convert multiple categorical columns into integers, we have followed this technique:

  1. We have to select all the columns that contain object datatype by employing the Pandas ‘DataFrame.select_dtypes().columns’ method.
  2. To convert these columns to integers, we have to use the Pandas ‘DataFrame.apply()’ with the ‘pandas.factorize()’ method.

The factorize method will take a value with an ‘object’ data type and convert it into ‘int’.

If you want to convert only a particular column’s categorical values to integers, then apply() is not used.

Syntax for Single Column

DataFrame_object['column’]=pandas.factorize(DataFrame_object['column’])[0]

Syntax for All Columns

DataFrame_object[DataFrame_object.select_dtypes(['object']).columns]= DataFrame_object[DataFrame_object.select_dtypes(['object']).columns].apply(lambda x: pandas.factorize(x)[0])

Note: Replacement will start from 0.

Example 1: Replace Single Column Categorical Values

Let’s replace values in the gender column using factorize() method.

# Import the DataFrame from the pandas module

from pandas import DataFrame

import pandas

# Create a DataFrame with 10 rows that hold 4 columns

sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],

'priority':['high','low','low','high','medium','high','medium','low','high','medium'],

'gender':['M','M','M','F','M','M','F','F','M','F'],

'age':[12,23,21,34,21,23,21,34,56,32]})

# Convert categorical values to integers using factorize() method present in gender column

sets['gender'] = pandas.factorize(sets['gender'])[0]

# Actual DataFrame

print(sets)

Output

code priority gender age

0 1 high 0 12

1 2 low 0 23

2 3 low 0 21

3 4 high 1 34

4 5 medium 0 21

5 6 high 0 23

6 7 medium 1 21

7 8 low 1 34

8 9 high 0 56

9 0 medium 1 32

Explanation

We are replacing ‘M’ with 0 and ‘F’ with 1 and storing the column values again in the gender column. Now, we can see that the gender column holds values within 0 or 1.

Example 2: Replace All Column Categorical Values

Let’s replace values in all the columns using apply(factorize()) method.

# Import the DataFrame from the pandas module

from pandas import DataFrame

import pandas

# Create a DataFrame with 10 rows that hold 4 columns

sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],

'priority':['high','low','low','high','medium','high','medium','low','high','medium'],

'gender':['M','M','M','F','M','M','F','F','M','F'],

'age':[12,23,21,34,21,23,21,34,56,32]})

 

# Replace all column categorical values

sets[sets.select_dtypes(['object']).columns]= sets[sets.select_dtypes(['object']).columns].apply(lambda x: pandas.factorize(x)[0])

print(sets)

Output

code priority gender age

0 1 0 0 12

1 2 1 0 23

2 3 1 0 21

3 4 0 1 34

4 5 2 0 21

5 6 0 0 23

6 7 2 1 21

7 8 1 1 34

8 9 0 0 56

9 0 2 1 32

Explanation

We can see in following:

  1. In the gender column, ‘M’ is replaced with 0, and ‘F’ is replaced with 1.
  2. Priority column, ‘high’ is replaced with 0, ‘low’ is replaced with 1, and ‘medium’ is replaced with 2.

So far, we have seen the replacement of categorical values to integer values in single or all columns with known categorical values. Suppose there is a huge dataset with more than one lakh records. How to replace the categorical values?

The solution is Label Encoding.

Let’s discuss this approach.

Approach 3: Using LabelEncoding

LabelEncoder() is the method present inside the sklearn module, which will convert categorical values of a particular column to an integer. We don’t need to specify the categorical values.

fit_transform() method is used along with LabelEncoder() that fits the transformed values.

In this technique, the replaced values start from 0, and the replacement is done in alphabetical order of the categorical values.

Syntax

DataFrame_object['column’]=LabelEncoder().fit_transform(DataFrame_object['column’])

Here, column is the name of the column in which we are replacing the values.

Example

Let’s replace values in the gender,priority columns using the following approach:

# Import the DataFrame from the pandas module

from pandas import DataFrame

import pandas

# Import LabelEncoder from sklearn module

from sklearn.preprocessing import LabelEncoder

# Create a DataFrame with 10 rows that hold 4 columns

sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],

'priority':['high','low','low','high','medium','high','medium','low','high','medium'],

'gender':['M','M','M','F','M','M','F','F','M','F'],

'age':[12,23,21,34,21,23,21,34,56,32]})

# Convert categorical values of gender column to numeric

sets['gender']=LabelEncoder().fit_transform(sets['gender'])

# Convert categorical values of priority column to numeric

sets['priority']=LabelEncoder().fit_transform(sets['priority'])

print(sets)

Output

code priority gender age

0 1 0 1 12

1 2 1 1 23

2 3 1 1 21

3 4 0 0 34

4 5 2 1 21

5 6 0 1 23

6 7 2 0 21

7 8 1 0 34

8 9 0 1 56

9 0 2 0 32

Explanation

  1. We are replacing ‘M’ with 1 and ‘F’ with 2 and storing the column values again in the gender column. Now, we can see that the gender column holds values within 1 or 2.
  2. There are three categories in the priority column. They are ‘low’, ‘high’, and ‘medium’. We are replacing ‘low’ with 0, ‘medium’ with 1, and ‘high’ with 2 and storing the column values again in the priority column.

Conclusion

Our guide revolves around converting categorical values into numerical values so that they can be made understandable by the machines as the object datatype cannot be processed by them. We have introduced you to the three approaches featured by the “Pandas” library to get the required datatype. Remember, you should use the LabelEncoding approach as you don’t know how many categories are present in the column of the Pandas DataFrame.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain