PySpark Filter DataFrame Using Values from a List

Filtering the DataFrame through the values/elements that are specified in a list is one of the simplest approaches because of the isin() operator. In this guide, we will see how to return the rows from the PySpark DataFrame. The isin() operator in Python is used to check whether the elements are present in any Python data structure like list/tuple. If yes, the related rows are returned based on the values that are present in the list (Here, we use the list data structure).

Scenario 1: Filter the DataFrame Using the Values from a List with the Isin() Operator

Scenario 2: Filter the DataFrame Using the Values from a List with Not and Isin() Operators

Let’s dive into the examples by considering all scenarios. Make sure that you installed the PySpark in your environment. The command is:

pip install pyspark

Utilize the DataFrame

In this entire guide, we will use this PySpark DataFrame for all the examples. This DataFrame (agri_df) holds 5 rows with 5 columns. The columns in this DataFrame are [“Soil_Type”, “Irrigation_availability”, “Acres”, “Soil_Status”, and “Country”].

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# farming data with 5 rows and 5 columns

agri =[{'Soil_Type':'Black','Irrigation_availability':'No','Acres':2500,'Soil_status':'Dry',
'Country':'USA'},

{'Soil_Type':'Black','Irrigation_availability':'Yes','Acres':3500,'Soil_status':'Wet',
'Country':'India'},

{'Soil_Type':'Red','Irrigation_availability':'Yes','Acres':210,'Soil_status':'Dry',
'Country':'UK'},

{'Soil_Type':'Other','Irrigation_availability':'No','Acres':1000,'Soil_status':'Wet',
'Country':'USA'},

{'Soil_Type':'Sand','Irrigation_availability':'No','Acres':500,'Soil_status':'Dry',
'Country':'India'}]

# create the dataframe from the above data

agri_df = linuxhint_spark_app.createDataFrame(agri)

agri_df.show()

Output:

Scenario 1: Filter the DataFrame Using the Values from a List with the Isin() Operator

The isin() operator in Python is used to check whether the elements are present in the list. If yes, the related rows are returned based on the values that are present in the list. We specify this operator inside the filter() function. We can display the filtered DataFrame using the show() method.

Syntax:

pyspark_DataFrame_object.filter((pyspark_DataFrame_object["column"]).isin(List1))

Here, the pyspark_DataFrame_object is the PySpark DataFrame input and the List1 holds the elements.

Example 1:

Filter the previous agri_df – DataFrame on the “Soil_Type” column based on the elements in List1.

# Create a List with the elements - 'Black', 'Red'.

List1=[ 'Black', 'Red']

# Filter the above agri_df-DataFrame on the ‘Soil_Type’ column based on the elements in the above list (List1).

agri_df.filter((agri_df["Soil_Type"]).isin(List1)).show()

Output:

There are three records in the DataFrame with “Soil_Type” which are Black and Red.

Example 2:

Filter the previous agri_df – DataFrame on the “Country” column based on the elements in List1. The elements in the list are “USA” and “JAPAN”.

# Create a List with the elements - 'USA','JAPAN'.

List1=['USA','JAPAN']

# Filter the above agri_df-DataFrame on the "Country" column based on the elements in the above list (List1).

agri_df.filter((agri_df["Country"]).isin(List1)).show()

Output:

There are only two rows with the “USA” country. There is no country with “JAPAN” in the “Country” column DataFrame.

Example 3:

Filter the previous agri_df-DataFrame on the “Acres” column based on the values in the list which are [2500, 3500, 2000].

# Create List with the values - 2500,3500,2000.

List1=[2500,3500,2000]

# Filter the above agri_df-DataFrame on the "Acres" column based on the values in the above list (List1).

agri_df.filter((agri_df["Acres"]).isin(List1)).show()

Output:

The rows exist for the values of 2500 and 3500 in the “Acres” column. So, the corresponding records are returned.

Example 4:

Let’s specify the multiple conditions in this example. Using the “and(&)” operator, we filter the “No” from the Irrigation_availabilty column as well as the Soil_Type column which is “Black” by specifying “No” and “Black” in two different lists.

agri_df.show()

# Multiple conditions

agri_df.filter((agri_df["Irrigation_availability"]).isin(['No']) & (agri_df["Soil_Type"]).isin(['Black'])).show()

Output:

There is only one record with “No” in Irrigation_availability and “Black” in Soil_Type.