Data Science

Normalization of Data in Python

Normalization of data is a technique that helps to get the result faster as the machine has to process a smaller range of data. Normalization is not an easy task because all your results depend upon the choice of your normalize method. So, if you have chosen the wrong method to normalize your data, you might get something different from your expectations.

The normalization also depends upon the data type like images, text, numeric, etc. So, every data type has a different method to normalize. So, in this article, we are focusing on numeric data.

Method 1: Using sklearn

The sklearn method is a very famous method to normalize the data.

In cell number [83]: We import all the required libraries, NumPy and sklearn. You can see that we import the preprocessing from the sklearn itself. That’s why this is the sklearn normalization method.

In cell number [84]: We created a NumPy array with some integer value that is not the same.

In cell number [85]: We called the normalize method from the preprocessing and passed the numpy_array, which we just created as a parameter.

In cell number [86]: We can see from the results, our all integer data are now normalized between 0 and 1.

Method 2: Normalize a particular column in a dataset using sklearn

We can also normalize the particular dataset column. In this, we are going to discuss about that.


In cell number [87]: We import the library pandas and sklearn.

In cell number [88]: We created a dummy CSV file, and we are now loading that CSV file with the help of the pandas (read_csv) package.

In cell number [89]: We print that CSV file which we recently loaded.

In cell number [90]: We read the particular column of the CSV file using the np. array and store the result to value_array.

In cell number [92], we called the normalize method from the preprocessing and passed the value_array parameter.

Method 3: Convert to normalize without using the columns to array (using sklearn)

In the previous method 2, we discussed how to a particular CSV file column we could normalize. But sometimes we need to normalize the whole dataset, then we can use the below method where we do normalize the whole dataset but along column-wise (axis = 0). If we mention the axis = 1, then it will do row-wise normalize. The axis = 1 is by default value.


In cell number [93]: We import the library pandas and sklearn.

In cell number [94]: We created a dummy CSV file (demo_numeric.csv), and we are now loading that CSV file with the help of the pandas (read_csv) package.

In cell number [95]: We print that CSV file which we recently loaded.

In cell number [96]: Now, we pass the whole CSV file along with one more extra parameter axis =0, which said to the library that the user wanted to normalize the whole dataset column-wise.

In cell number [97], we print the result and normalize data with a value between 0 and 1.

Method 4: Using MinMaxScaler()

The sklearn also provides another method of normalization, which we called it MinMaxScalar. This is also a very popular method because it is easy to use.


In cell number [98]: We import all the required packages.

In cell number [99]: We created a dummy CSV file (demo_numeric.csv), and we are now loading that CSV file with the help of the pandas (read_csv) package.

In cell number [100]: We print that CSV file which we recently loaded.

In cell number [101]: We called the MinMaxScalar from the preprocessing method and created an object (min_max_Scalar) for that. We did not pass any parameters because we need to normalize the data between 0 and 1. But if you want, you can add your values which will be seen in the next method.

In cell number [102]: We first read all the names of the columns for further use to display results. Then we call the fit_tranform from the created object min_max_Scalar and passed the CSV file into that.

In cell number [103]: We get the normalized results which are between 0 and 1.

Method 5: Using MinMaxScaler(feature_range=(x,y))

The sklearn also provides the option to change the normalized value of what you want. By default, they do normalize the value between 0 and 1. But there is a parameter which we called feature_range, which can set the normalized value according to our requirements.

In cell number [104]: We import all the required packages.

In cell number [105]: We created a dummy CSV file (demo_numeric.csv), and we are now loading that CSV file with the help of the pandas (read_csv) package.

In cell number [106]: We print that CSV file which we recently loaded.

In cell number [107]: We called the MinMaxScalar from the preprocessing method and created an object (min_max_Scalar) for that. But we also pass another parameter inside of the MinMaxScaler (feature_range). That parameter value we set 0 to 2. So now, the MinMaxScaler will normalize the data values between 0 to 2.

In cell number [108]: We first read all the names of the columns for further use to display results. Then we call the fit_tranform from the created object  min_max_Scalar and passed the CSV file into that.

In cell number [109]: We get the normalized results which are between 0 and 2.

Method 6: Using the maximum absolute scaling

We can also do normalize the data using pandas. These features are also very popular in normalizing the data. The maximum absolute scaling does normalize values between 0 and 1. We are applying here .max () and .abs() as shown below:

In cell number [110]: We import the pandas’ library.

In cell number [111]: We created a dummy dataframe and printed that dataframe.

In cell number [113]: We call each column and then divide the column values with the .max() and .abs().

In cell number [114]: We print the result, and from the result, we confirm that our data normalize between 0 and 1.

Method 7: Using the z-score method

The next method which we are going to discuss is the z-score method. This method converts the information to the distribution. This method calculates the mean of each column and then subtracts from each column and, at last, divides it with the standard deviation. This normalizes the data between -1 and 1.

In cell number [115]: We created a dummy dataframe and printed that dataframe.

In cell number [117]: We calculate the column’s mean and subtract it from the column. Then we divide the column value with the standard deviation.

In cell number [118]: We print the normalized data between -1 and 1.

Conclusion: We have seen different kinds of normalized methods. Among them, sklearn is very famous because of supporting machine learning. But that depends upon the requirements of the user. Sometimes pandas feature to normalize data is sufficient. We cannot say that there are only above normalize methods. There are numerous methods to do data normalization that also depend upon your data type like images, numeric, text, etc. We focus on this numeric data and Python.

About the author

Shekhar Pandey