Python

Filter NaN Pandas

How often do you come across NaN or Null values while working with datasets? When there are large datasets to work with, it is very common for some cells to contain Null or NaN values. NaN is representing the missing values in a dataset and it also stands for ‘Not a Number’.

Now, the question is, how to deal with those NaN values while working with Pandas in Python. How do Pandas see NaN values and how does it manage NaN values with other values? This article is designed around managing NaN values with pandas in Python.

Pandas in Python

Luckily, Pandas in Python programming language allow to filter out or exclude rows that contain NaN values using dataframe functions. Pandas DataFrames work on datetime, float, string, etc., type of column by using dataframe.notnull() and dataframe.dropna() functions.

Let us explain how to filter out rows from the dataset that contain NaN values using Pandas DataFrame in Python. Moreover, we will explain the use of dataframe.notnull() and dataframe.dropna() functions with the help of simple and easy examples. So, let us begin.

What are NaN Values?

The NaN stands for ‘Not a Number’ and almost every dataset contains NaN values. As data comes in various forms and shapes. The missing or blank values are represented as NaN and it is a special floating-point value. However, there are some other ways as well to represent the missing values in the dataset i.e., Python None and it is considered as either ‘Na’, ‘Not available’, or ‘Missing’.

How to Filter Out NaN Values from a Dataset Using Pandas DataFrame in Python?

Filtering out the NaN values from a dataset using a Pandas DataFrame is very simple and easy. Below are the steps that we are going to follow in the below examples to filter out NaN values.

  1. Create a dataset containing Nan values.
  2. Use dataframe.notnull() function to find the column index that contains non-null values. The notnull() function returns true for not null values and false for missing or null values.
  3. Call dataframe.dropna() to eliminate or filter out the rows containing NaN or missing values.
  4. The other way around is to use a pd.isnull() and series.notna() functions to filter out the rows containing NaN values in a specific column of a DataFrame. To drop NaN from a particular column, the data frame ‘df’ provides three different functions to drop NaN values from the data set and they are pd.isnull(), notna(), and notnull(). On the other hand, Series.notnull() is an alias for Series.notna() which detects the non null values in the dataset.
  5. Set up a threshold value for dropping the NaN’s.

Now, let’s proceed with the examples to learn how to drop Nan or missing values from a dataset using Pandas in Python.

Example 1:

By following the steps given above, first, we will create a dataframe that contains some null values. See the code below to learn how you can create a dataset containing null values. Here, we have imported the modules first and then created the DataFrame afterward. You can see that the DataFrame contains integer values as well as null values.

import pandas as pd

import numpy as np

df = pd.DataFrame([[00,11,22,33],

                  [None,55,None,pd.NaT],
                  [88,None,10,None],
                  [111,121,131,pd.NaT]],columns=list('WXYZ'))


df

df.dropna()

Here is the dataset which you have created just now:

As you can see, each row contains a null value except the first row. So technically, when we drop the NaN values, all rows should be filtered out and only the first row should be kept back. Now, let us apply the df.dropna() function to drop the rows containing Nan values.

Note that all rows are dropped and just the first row is left in the dataset.

Example 2:

In the previous example, we have dropped all the rows that contain the NaN value. What if you want to remove a specific value in the dataset but not every row? Well, as we have discussed above there are ways to remove only a certain value instead of eliminating all the rows containing NaN values.

This example will elaborate on the use of ‘subset’ to eliminate only a particular row containing NaN value. The initial steps are the same as the above example which is creating a DataFrame with NaN values. Let us see the code below:

import pandas as pd

import numpy as np

df = pd.DataFrame([[00,11,22,33],

                  [None,55,None,pd.NaT],
                  [88,None,10,None],
                  [111,121,131,pd.NaT]],columns=list('WXYZ'))


df

df.dropna(subset=['Y'])

If you observe the output, it contains all the rows which were originally present in the dataset except the second row as it contained NaN value at index ‘ Y’. the subset method finds the index value of NaN at the ‘Y’ sunset and eliminates that row. This is how you can eliminate a specific row containing a NaN value while keeping all other rows.

Example 3:

In this example, we will explain the use of the df.notnull() function to eliminate the NaN values from the dataset. It works the same as df.dropna(), however, the syntax is a little different. See the code below to check the working of the df.notnull() function.

import pandas as pd

import numpy as np

df = pd.DataFrame([[00,11,22,33],

                  [None,55,None,pd.NaT],
                  [88,None,10,None],
                  [111,121,131,pd.NaT]],columns=list('WXYZ'))


df

df[df.notnull().all(1)]

As you can see, the result is the same as in the first example. Because df.notnull() and df.dropna() work almost exactly the same.

Example 4:

Now, let’s connect with the ‘subset’ to eliminate the Nan value from a specific position while keeping all other rows. In this example, we will use the df.notnull() function with ‘subset’ to eliminate the NaN value from a specific position. See the code below to learn the working and syntax of the notnull() function.

import pandas as pd

import numpy as np

df = pd.DataFrame([[00,11,22,33],

                  [None,55,None,pd.NaT],
                  [88,None,10,None],
                  [111,121,131,pd.NaT]],columns=list('WXYZ'))


df

df[df['Y'].notnull()]

As you can notice, we have provided the same subset ‘Y’ as we have provided in example 2 and if you observe that the result is the same. The notnull() function has eliminated the row where NaN is present in the column ‘Y’ while other rows remain the same.

Conclusion

This article is all about removing or eliminating NaN or null values from the dataset using Pandas in Python. We have demonstrated different DataFrame functions to elaborate on how to remove the NaN values from a dataset. All four examples can be implemented on any Python compiler.

About the author

Kalsoom Bibi

Hello, I am a freelance writer and usually write for Linux and other technology related content