The datasets for machine learning execution include both numerical and categorical variables. Categorical variables are string-type data that humans easily comprehend. Machines, on the other hand, cannot understand categorical inputs directly. Consequently, categorical content must be transformed into numerical values that machines can interpret.
In this tutorial, we will discuss three different ways to convert Categorical values to numeric values in Pandas DataFrame
Approach 1: Using replace()
In Python, replace() is used to change or convert the given values to new values that we specified. For replacing multiple values, We need to pass two lists as parameters. The first list will store the actual values to be replaced, and the second list will store the values that replace the values in the first list.
Syntax
Here, column is the name of the column in which we are replacing the values.
Consider the DataFrame
Let’s create the Pandas DataFrame named sets that hold 4 columns with 10 rows.
from pandas import DataFrame
# Create a DataFrame with 10 rows that hold 4 columns
sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],
'priority':['high','low','low','high','medium','high','medium','low','high','medium'],
'gender':['M','M','M','F','M','M','F','F','M','F'],
'age':[12,23,21,34,21,23,21,34,56,32]})
# Actual DataFrame
print(sets)
# Display data types of each column
print(sets.dtypes)
Output
0 1 high M 12
1 2 low M 23
2 3 low M 21
3 4 high F 34
4 5 medium M 21
5 6 high M 23
6 7 medium F 21
7 8 low F 34
8 9 high M 56
9 0 medium F 32
code int64
priority object
gender object
age int64
dtype: object
We see two columns with the categorical type (object), i.e., priority and gender.
So we need to convert these to numeric/integer values.
Example 1
Let’s replace values in the gender column using replace() method.
from pandas import DataFrame
# Create a DataFrame with 10 rows that hold 4 columns
sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],
'priority':['high','low','low','high','medium','high','medium','low','high','medium'],
'gender':['M','M','M','F','M','M','F','F','M','F'],
'age':[12,23,21,34,21,23,21,34,56,32]})
# Convert categorical values to Numeric values in the gender column
sets['gender']=sets['gender'].replace(['M', 'F'],[1, 2])
print(sets)
Output
0 1 high 1 12
1 2 low 1 23
2 3 low 1 21
3 4 high 2 34
4 5 medium 1 21
5 6 high 1 23
6 7 medium 2 21
7 8 low 2 34
8 9 high 1 56
9 0 medium 2 32
Explanation
We are replacing ‘M’ with 1 and ‘F’ with 2 and storing the column values again in the gender column. We can see that the gender column holds values within 1 or 2.
Example 2
Let’s replace values in the priority column using replace() method.
from pandas import DataFrame
# Create a DataFrame with 10 rows that hold 4 columns
sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],
'priority':['high','low','low','high','medium','high','medium','low','high','medium'],
'gender':['M','M','M','F','M','M','F','F','M','F'],
'age':[12,23,21,34,21,23,21,34,56,32]})
# Convert categorical values to Numeric values in the priority column
sets['priority']=sets['priority'].replace(['low', 'medium','high'],[0,1,2])
print(sets)
Output
0 1 2 M 12
1 2 0 M 23
2 3 0 M 21
3 4 2 F 34
4 5 1 M 21
5 6 2 M 23
6 7 1 F 21
7 8 0 F 34
8 9 2 M 56
9 0 1 F 32
Explanation
There are three categories in the priority column. They are ‘low’, ‘high’, and ‘medium’. We are replacing ‘low’ with 0, ‘medium’ with 1, and ‘high’ with 2 and storing the column values again in the priority column.
Approach 2: Using apply(factorize())
The other method Pandas provided us is the ‘DataFrame.apply()’ function for converting all the categorical values into integers.
To convert multiple categorical columns into integers, we have followed this technique:
- We have to select all the columns that contain object datatype by employing the Pandas ‘DataFrame.select_dtypes().columns’ method.
- To convert these columns to integers, we have to use the Pandas ‘DataFrame.apply()’ with the ‘pandas.factorize()’ method.
The factorize method will take a value with an ‘object’ data type and convert it into ‘int’.
If you want to convert only a particular column’s categorical values to integers, then apply() is not used.
Syntax for Single Column
Syntax for All Columns
Note: Replacement will start from 0.
Example 1: Replace Single Column Categorical Values
Let’s replace values in the gender column using factorize() method.
from pandas import DataFrame
import pandas
# Create a DataFrame with 10 rows that hold 4 columns
sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],
'priority':['high','low','low','high','medium','high','medium','low','high','medium'],
'gender':['M','M','M','F','M','M','F','F','M','F'],
'age':[12,23,21,34,21,23,21,34,56,32]})
# Convert categorical values to integers using factorize() method present in gender column
sets['gender'] = pandas.factorize(sets['gender'])[0]
# Actual DataFrame
print(sets)
Output
0 1 high 0 12
1 2 low 0 23
2 3 low 0 21
3 4 high 1 34
4 5 medium 0 21
5 6 high 0 23
6 7 medium 1 21
7 8 low 1 34
8 9 high 0 56
9 0 medium 1 32
Explanation
We are replacing ‘M’ with 0 and ‘F’ with 1 and storing the column values again in the gender column. Now, we can see that the gender column holds values within 0 or 1.
Example 2: Replace All Column Categorical Values
Let’s replace values in all the columns using apply(factorize()) method.
from pandas import DataFrame
import pandas
# Create a DataFrame with 10 rows that hold 4 columns
sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],
'priority':['high','low','low','high','medium','high','medium','low','high','medium'],
'gender':['M','M','M','F','M','M','F','F','M','F'],
'age':[12,23,21,34,21,23,21,34,56,32]})
# Replace all column categorical values
sets[sets.select_dtypes(['object']).columns]= sets[sets.select_dtypes(['object']).columns].apply(lambda x: pandas.factorize(x)[0])
print(sets)
Output
0 1 0 0 12
1 2 1 0 23
2 3 1 0 21
3 4 0 1 34
4 5 2 0 21
5 6 0 0 23
6 7 2 1 21
7 8 1 1 34
8 9 0 0 56
9 0 2 1 32
Explanation
We can see in following:
- In the gender column, ‘M’ is replaced with 0, and ‘F’ is replaced with 1.
- Priority column, ‘high’ is replaced with 0, ‘low’ is replaced with 1, and ‘medium’ is replaced with 2.
So far, we have seen the replacement of categorical values to integer values in single or all columns with known categorical values. Suppose there is a huge dataset with more than one lakh records. How to replace the categorical values?
The solution is Label Encoding.
Let’s discuss this approach.
Approach 3: Using LabelEncoding
LabelEncoder() is the method present inside the sklearn module, which will convert categorical values of a particular column to an integer. We don’t need to specify the categorical values.
fit_transform() method is used along with LabelEncoder() that fits the transformed values.
In this technique, the replaced values start from 0, and the replacement is done in alphabetical order of the categorical values.
Syntax
Here, column is the name of the column in which we are replacing the values.
Example
Let’s replace values in the gender,priority columns using the following approach:
from pandas import DataFrame
import pandas
# Import LabelEncoder from sklearn module
from sklearn.preprocessing import LabelEncoder
# Create a DataFrame with 10 rows that hold 4 columns
sets = DataFrame({'code': [1,2,3,4,5,6,7,8,9,0],
'priority':['high','low','low','high','medium','high','medium','low','high','medium'],
'gender':['M','M','M','F','M','M','F','F','M','F'],
'age':[12,23,21,34,21,23,21,34,56,32]})
# Convert categorical values of gender column to numeric
sets['gender']=LabelEncoder().fit_transform(sets['gender'])
# Convert categorical values of priority column to numeric
sets['priority']=LabelEncoder().fit_transform(sets['priority'])
print(sets)
Output
0 1 0 1 12
1 2 1 1 23
2 3 1 1 21
3 4 0 0 34
4 5 2 1 21
5 6 0 1 23
6 7 2 0 21
7 8 1 0 34
8 9 0 1 56
9 0 2 0 32
Explanation
- We are replacing ‘M’ with 1 and ‘F’ with 2 and storing the column values again in the gender column. Now, we can see that the gender column holds values within 1 or 2.
- There are three categories in the priority column. They are ‘low’, ‘high’, and ‘medium’. We are replacing ‘low’ with 0, ‘medium’ with 1, and ‘high’ with 2 and storing the column values again in the priority column.
Conclusion
Our guide revolves around converting categorical values into numerical values so that they can be made understandable by the machines as the object datatype cannot be processed by them. We have introduced you to the three approaches featured by the “Pandas” library to get the required datatype. Remember, you should use the LabelEncoding approach as you don’t know how many categories are present in the column of the Pandas DataFrame.