Scenario 1: Filter the DataFrame Using the Values from a List with the Isin() Operator
Scenario 2: Filter the DataFrame Using the Values from a List with Not and Isin() Operators
Let’s dive into the examples by considering all scenarios. Make sure that you installed the PySpark in your environment. The command is:
Utilize the DataFrame
In this entire guide, we will use this PySpark DataFrame for all the examples. This DataFrame (agri_df) holds 5 rows with 5 columns. The columns in this DataFrame are [“Soil_Type”, “Irrigation_availability”, “Acres”, “Soil_Status”, and “Country”].
from pyspark.sql import SparkSession
linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()
# farming data with 5 rows and 5 columns
agri =[{'Soil_Type':'Black','Irrigation_availability':'No','Acres':2500,'Soil_status':'Dry',
'Country':'USA'},
{'Soil_Type':'Black','Irrigation_availability':'Yes','Acres':3500,'Soil_status':'Wet',
'Country':'India'},
{'Soil_Type':'Red','Irrigation_availability':'Yes','Acres':210,'Soil_status':'Dry',
'Country':'UK'},
{'Soil_Type':'Other','Irrigation_availability':'No','Acres':1000,'Soil_status':'Wet',
'Country':'USA'},
{'Soil_Type':'Sand','Irrigation_availability':'No','Acres':500,'Soil_status':'Dry',
'Country':'India'}]
# create the dataframe from the above data
agri_df = linuxhint_spark_app.createDataFrame(agri)
agri_df.show()
Output:
Scenario 1: Filter the DataFrame Using the Values from a List with the Isin() Operator
The isin() operator in Python is used to check whether the elements are present in the list. If yes, the related rows are returned based on the values that are present in the list. We specify this operator inside the filter() function. We can display the filtered DataFrame using the show() method.
Syntax:
Here, the pyspark_DataFrame_object is the PySpark DataFrame input and the List1 holds the elements.
Example 1:
Filter the previous agri_df – DataFrame on the “Soil_Type” column based on the elements in List1.
List1=[ 'Black', 'Red']
# Filter the above agri_df-DataFrame on the ‘Soil_Type’ column based on the elements in the above list (List1).
agri_df.filter((agri_df["Soil_Type"]).isin(List1)).show()
Output:
There are three records in the DataFrame with “Soil_Type” which are Black and Red.
Example 2:
Filter the previous agri_df – DataFrame on the “Country” column based on the elements in List1. The elements in the list are “USA” and “JAPAN”.
List1=['USA','JAPAN']
# Filter the above agri_df-DataFrame on the "Country" column based on the elements in the above list (List1).
agri_df.filter((agri_df["Country"]).isin(List1)).show()
Output:
There are only two rows with the “USA” country. There is no country with “JAPAN” in the “Country” column DataFrame.
Example 3:
Filter the previous agri_df-DataFrame on the “Acres” column based on the values in the list which are [2500, 3500, 2000].
List1=[2500,3500,2000]
# Filter the above agri_df-DataFrame on the "Acres" column based on the values in the above list (List1).
agri_df.filter((agri_df["Acres"]).isin(List1)).show()
Output:
The rows exist for the values of 2500 and 3500 in the “Acres” column. So, the corresponding records are returned.
Example 4:
Let’s specify the multiple conditions in this example. Using the “and(&)” operator, we filter the “No” from the Irrigation_availabilty column as well as the Soil_Type column which is “Black” by specifying “No” and “Black” in two different lists.
# Multiple conditions
agri_df.filter((agri_df["Irrigation_availability"]).isin(['No']) & (agri_df["Soil_Type"]).isin(['Black'])).show()
Output:
There is only one record with “No” in Irrigation_availability and “Black” in Soil_Type.
Scenario 2: Filter the DataFrame Using the Values from a List with the Not and Isin() Operators
If we specify the “not” operator (~) along with the isin() operator, the records that do not meet the criteria are returned.
Syntax:
Here, the pyspark_DataFrame_object is the PySpark DataFrame input and the List1 holds the elements.
Example 1:
Filter the previous agri_df – DataFrame on the “Irrigation_availabilty” column based on the element in List1.
Here, the list holds only one element which is “No”. Each row in the “Irrigation_availabilty” is compared with “No”. If the element in the list matches with “No” in this column, it will be ignored.
agri_df.filter(~(agri_df["Irrigation_availability"]).isin(['No'])).show()
Output:
Example 2:
Filter the previous agri_df – DataFrame on the “Acres” column based on the element in List1.
Here, the list holds three elements [2500, 3500, 2000]. Each row in the “Acres” column is compared with each value. If the element in the list matches with any of these values in this column, it will be ignored.
List1=[2500,3500,2000]
# Filter the above agri_df-DataFrame on the "Acres" column that are not in List1.
agri_df.filter(~(agri_df["Acres"]).isin(List1)).show()
Output:
Example 3:
Now, filter the DataFrame where the “Irrigation_availability” is not equal to “No” and the “Soil_Type” is not equal to “Black”.
agri_df.filter(~(agri_df["Irrigation_availability"]).isin(['No']) & ~(agri_df["Soil_Type"]).isin(['Black'])).show()
Output:
Conclusion
In Python, the isin() operator is used to check whether the elements are present in any Python Data structure like list/tuple etc. Here, we utilized this operator to filter the records in the DataFrame based on the specified values in the list. We also learned the different examples by passing multiple conditions inside the filter() function.