Apache Spark

PySpark Filter DataFrame Using Values from a List

Filtering the DataFrame through the values/elements that are specified in a list is one of the simplest approaches because of the isin() operator. In this guide, we will see how to return the rows from the PySpark DataFrame. The isin() operator in Python is used to check whether the elements are present in any Python data structure like list/tuple. If yes, the related rows are returned based on the values that are present in the list (Here, we use the list data structure).

Scenario 1: Filter the DataFrame Using the Values from a List with the Isin() Operator

Scenario 2: Filter the DataFrame Using the Values from a List with Not and Isin() Operators

Let’s dive into the examples by considering all scenarios. Make sure that you installed the PySpark in your environment. The command is:

pip install pyspark

Utilize the DataFrame

In this entire guide, we will use this PySpark DataFrame for all the examples. This DataFrame (agri_df) holds 5 rows with 5 columns. The columns in this DataFrame are [“Soil_Type”, “Irrigation_availability”, “Acres”, “Soil_Status”, and “Country”].

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# farming data with 5 rows and 5 columns

agri =[{'Soil_Type':'Black','Irrigation_availability':'No','Acres':2500,'Soil_status':'Dry',
'Country':'USA'},

{'Soil_Type':'Black','Irrigation_availability':'Yes','Acres':3500,'Soil_status':'Wet',
'Country':'India'},

{'Soil_Type':'Red','Irrigation_availability':'Yes','Acres':210,'Soil_status':'Dry',
'Country':'UK'},

{'Soil_Type':'Other','Irrigation_availability':'No','Acres':1000,'Soil_status':'Wet',
'Country':'USA'},

{'Soil_Type':'Sand','Irrigation_availability':'No','Acres':500,'Soil_status':'Dry',
'Country':'India'}]

 

# create the dataframe from the above data

agri_df = linuxhint_spark_app.createDataFrame(agri)

agri_df.show()

Output:

Scenario 1: Filter the DataFrame Using the Values from a List with the Isin() Operator

The isin() operator in Python is used to check whether the elements are present in the list. If yes, the related rows are returned based on the values that are present in the list. We specify this operator inside the filter() function. We can display the filtered DataFrame using the show() method.

Syntax:

pyspark_DataFrame_object.filter((pyspark_DataFrame_object["column"]).isin(List1))

Here, the pyspark_DataFrame_object is the PySpark DataFrame input and the List1 holds the elements.

Example 1:

Filter the previous agri_df – DataFrame on the “Soil_Type” column based on the elements in List1.

# Create a List with the elements - 'Black', 'Red'.

List1=[ 'Black', 'Red']

# Filter the above agri_df-DataFrame on the ‘Soil_Type’ column based on the elements in the above list (List1).

agri_df.filter((agri_df["Soil_Type"]).isin(List1)).show()

Output:

There are three records in the DataFrame with “Soil_Type” which are Black and Red.

Example 2:

Filter the previous agri_df – DataFrame on the “Country” column based on the elements in List1. The elements in the list are “USA” and “JAPAN”.

# Create a List with the elements - 'USA','JAPAN'.

List1=['USA','JAPAN']

# Filter the above agri_df-DataFrame on the "Country" column based on the elements in the above list (List1).

agri_df.filter((agri_df["Country"]).isin(List1)).show()

Output:

There are only two rows with the “USA” country. There is no country with “JAPAN” in the “Country” column DataFrame.

Example 3:

Filter the previous agri_df-DataFrame on the “Acres” column based on the values in the list which are [2500, 3500, 2000].

# Create List with the values - 2500,3500,2000.

List1=[2500,3500,2000]

# Filter the above agri_df-DataFrame on the "Acres" column based on the values in the above list (List1).

agri_df.filter((agri_df["Acres"]).isin(List1)).show()

Output:

The rows exist for the values of 2500 and 3500 in the “Acres” column. So, the corresponding records are returned.

Example 4:

Let’s specify the multiple conditions in this example. Using the “and(&)” operator, we filter the “No” from the Irrigation_availabilty column as well as the Soil_Type column which is “Black” by specifying “No” and “Black” in two different lists.

agri_df.show()

# Multiple conditions

agri_df.filter((agri_df["Irrigation_availability"]).isin(['No']) & (agri_df["Soil_Type"]).isin(['Black'])).show()

Output:

There is only one record with “No” in Irrigation_availability and “Black” in Soil_Type.

Scenario 2: Filter the DataFrame Using the Values from a List with the Not and Isin() Operators

If we specify the “not” operator (~) along with the isin() operator, the records that do not meet the criteria are returned.

Syntax:

pyspark_DataFrame_object.filter(~(pyspark_DataFrame_object["column"]).isin(List1))

Here, the pyspark_DataFrame_object is the PySpark DataFrame input and the List1 holds the elements.

Example 1:

Filter the previous agri_df – DataFrame on the “Irrigation_availabilty” column based on the element in List1.

Here, the list holds only one element which is “No”. Each row in the “Irrigation_availabilty” is compared with “No”. If the element in the list matches with “No” in this column, it will be ignored.

# Filter the above agri_df-DataFrame on "Irrigation_availability" column

agri_df.filter(~(agri_df["Irrigation_availability"]).isin(['No'])).show()

Output:

Example 2:

Filter the previous agri_df – DataFrame on the “Acres” column based on the element in List1.

Here, the list holds three elements [2500, 3500, 2000]. Each row in the “Acres” column is compared with each value. If the element in the list matches with any of these values in this column, it will be ignored.

# Create List with the values - 2500,3500,2000.

List1=[2500,3500,2000]

# Filter the above agri_df-DataFrame on the "Acres" column that are not in List1.

agri_df.filter(~(agri_df["Acres"]).isin(List1)).show()

Output:

Example 3:

Now, filter the DataFrame where the “Irrigation_availability” is not equal to “No” and the “Soil_Type” is not equal to “Black”.

# Multiple conditions

agri_df.filter(~(agri_df["Irrigation_availability"]).isin(['No']) & ~(agri_df["Soil_Type"]).isin(['Black'])).show()

Output:

Conclusion

In Python, the isin() operator is used to check whether the elements are present in any Python Data structure like list/tuple etc. Here, we utilized this operator to filter the records in the DataFrame based on the specified values in the list. We also learned the different examples by passing multiple conditions inside the filter() function.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain