Apache Spark

PySpark Isnull() and IsNull() Functions

Data preprocessing is an important step in Machine Learning and Data Science. The results will be affected if the data contains null/none values. Without preprocessing, the machine performance decreases and gives inaccurate results. In this case, we first need to check if the data that is present in the PySpark DataFrame in every column is null or not. Let’s see how to check the null values in this guide using the isnull() and isNull() functions. Both results are the same but utilizing these functions are different. Quickly look into them.

Topic of Contents:

    1. Check for NULL Data Using IsNull()
    2. Filter the NULL Data with IsNull()
    3. Check for NULL Data Using Isnull()
    4. Filter NULL Data with Isnull()
    5. Return the NULL Data Using Spark-SQL

Data:

In this entire guide, we will use this PySpark DataFrame (5 rows and 6 columns). Make sure that you need to run the following code in your environment. NULL values are created using the “None” keyword. Here, we create the PySpark DataFrame with null values.

import pyspark
from pyspark.sql import SparkSession,Row
linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# create the dataframe that store Farming details
farming_df = linuxhint_spark_app.createDataFrame([Row(Area='Urban',Land=None,Soil='Black',
Field_count=100,Crop_Name='Paddy',Quantity=34000),
                                                  Row(Area='Urban',Land='Rich',Soil='Red',Field_count=20,Crop_Name='Wheat',
Quantity=200000),
                                                  Row(Area='Rural',Land='Poor',Soil=None,Field_count=None,Crop_Name='Paddy',
Quantity=24000),
                                                  Row(Area=None,Land='Poor',Soil='Black',Field_count=400,Crop_Name='Paddy',
Quantity=None),
                                                  Row(Area=None,Land='Poor',Soil='Red',Field_count=None,Crop_Name=None,
Quantity=None)
                                                  ])

farming_df.show()

 
Output:

Pyspark.sql.Column.isNull()

The isNull() function in PySpark checks for null values in the column. True is returned if the row value is null. Otherwise, false is returned. Here, we can specify the column along with the PySpark DataFrame object. It won’t accept any parameter.

Syntax:

Column.isNULL()

 

Check for NULL Data Using IsNull()

In PySpark, select() is the simple method which displays the specified column that is present in the PySpark DataFrame. Select() is used to select the specified column. We check for null values by applying the isNull() function on this column.

Example 1: Check on the Single Column

Check for null values in the “Area” column.

farming_df.select(farming_df['Area'].isNull()).show()

 
Output:


The last two rows in the “Area” column are null. Therefore, true is returned. The first three rows are not null, so false is returned.

Example 2: Check on Multiple Columns

Check for null values in the “Area” or “Field_count” columns.

farming_df.select(farming_df['Area'].isNull()  
farming_df['Field_count'].isNull()).show()

 
Output:


The last two rows in the “Area” column are null, so true is returned. The first three rows are not null, so false is returned.

Filter the NULL Data with IsNull()

If you want to get the entire row/s that have null values in any column, use isNull() with filter() or where().

Example 1: Filter Using the Conditions

    1. Return the records based on null values that are present in the “Area” column.
    2. Return the records based on null values that are present in “Area” and “Field_count” columns.
    3. Return the records based on null values that are present in “Area”, “Field_count” and “Soil”  columns.

 

# isNull() with filter()

# Single Condition
farming_df.filter(farming_df['Area'].isNull()).show()

# Multiple Conditions

farming_df.filter(farming_df['Area'].isNull() & farming_df['Field_count'].isNull()).show()
farming_df.filter(farming_df['Area'].isNull() & farming_df['Field_count'].isNull() & farming_df['Soil'].isNull()).show()

 
Output:

    1. There are only two rows with the “Area” column which is null.
    2. There is only one row with “Area” and “Field_count” columns which is null.
    3. No row in the DataFrame are null where the null values are present in the “Area”, “Soil”, and “Field_count” columns.

Example 2: Filter Using the Expressions

    1. Return the records with the “Quantity” which is null.
    2. Return the records based on null values that are present in “Quantity” and “Crop_Name” columns.

 

# isNull() with filter() by specifying single expression
farming_df.filter("Quantity is NULL").show()

# Multiple expressions
farming_df.filter("Quantity is NULL and Crop_Name is NULL").show()

 
Output:

    1. There are only two rows with the “Quantity” column which is null.
    2. There is only one row with the “Area” and “Crop_Name” columns which is null.

Pyspark.sql.functions.isnull()

The isnull() function in PySpark checks for null values in the column. It is similar to isNull(), but this function is available in the pyspark.sql.functions module. So, we need to import isnull() from this module. Also, we need to pass the column name to be checked for null values as an argument.

Syntax:

pyspark.sql.functions.isnull(Column)

 

Check for NULL Data Using Isnull()

Similarly, we can use the select() method to select the specified column. We check for null values in a particular column by passing the column name to the isnull() function.

Example:

Check for null values in the “Quantity” column.

from pyspark.sql.functions import isnull
farming_df.select(isnull(farming_df.Quantity)).show()

 
Output:


The last two rows in the “Quantity” column are null, so true is returned. The first three rows are not null, so false is returned.

Filter the NULL Data with Isnull()

Like isNull(), we can use the filter() or where() functions to filter the null values. Let’s use the where() function in the following example.

Example:

    1. Return the records where the “Soil” is null.
    2. Return the records where the “Soil” and “Crop_Name” are null.
from pyspark.sql.functions import isnull
# isnull() with where()

# Single Condition
farming_df.where(farming_df['Soil'].isNull()).show()

# Multiple Conditions
farming_df.where(isnull(farming_df['Soil']) & isnull(farming_df['Crop_Name'])).show()

 
Output:


There is only one null value in the “Soil” column, so the related record is returned. There is no record with “Soil” and “Crop_Name” as null, so the empty DataFrame is returned.

Return the NULL Data with IsNull() Using Spark-SQL

As we know, Spark provides writing the SQL functions. Here, we create a temporary view on our DataFrame and select the records based on null values in single or multiple columns. To select the records, the SELECT command is used to check for null using the “IS NULL” expression in the WHERE Clause.

Example:

First, create a view named “AGRI” and:

    1. Select all records with “Area” which is null.
    2. Select all records with “Area” and “Land” which are null.
# Create VIEW for the above DataFrame
farming_df.createOrReplaceTempView("AGRI")

# Single condition
linuxhint_spark_app.sql("SELECT * FROM AGRI where Area IS NULL").show()

# Multiple conditions
linuxhint_spark_app.sql("SELECT * FROM AGRI where Area IS NULL AND Land is NULL").show()

 
Output:

Conclusion

We can check for null values in PySpark using two different functions. The isNull() function is directly applied on the column to check for null values. But we need to import the isnull() function from the pyspark,sql,functions module. We learned how to filter the rows in the PySpark DataFrame using the filter() and where() functions in both the scenarios. Lastly, we learned how to return the records based on null values using the ISNULL expression in the WHERE clause through spark.sql().

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain