It can be used with either the filter clause or where clause. We will see them one by one with the different examples.
Syntax
dataframe_object.where(dataframe_obj.column.contains(value/string))
Where,
dataframe_object is the PySpark DataFrame.
Parameter:
The contains() function takes one parameter.
It can be a value or string that the contains() function will check if the specified value is present in the DataFrame column or not.
Return:
Based on this column value, the entire row is returned.
First, we will create the PySpark DataFrame with 10 rows and 5 columns.
from pyspark.sql import *
spark_app = SparkSession.builder.appName('_').getOrCreate()
students =[(4,'sravan',23,'PHP','Testing'),
(2,'sravan',23,'Oracle','Testing'),
(46,'mounika',22,'.NET','HTML'),
(12,'deepika',21,'Oracle','HTML'),
(46,'mounika',22,'Oracle','Testing'),
(12,'chandrika',23,'Hadoop','C#'),
(12,'chandrika',22,'Oracle','Testing'),
(45,'sravan',23,'Oracle','C#'),
(4,'deepika',21,'PHP','C#'),
(46,'mounika',22,'.NET','Testing')
]
dataframe_obj = spark_app.createDataFrame( students,['subject_id','name','age','technology1','technology2'])
dataframe_obj.show()
Output:
Now, let’s apply the contains() function on the PySpark DataFrame to return the results.
Example 1
We will provide the string ‘sravan’ in the name column inside the contains() method and return all rows matching this string.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.name.contains('sravan')).show()
#check for string- sravan in name column and return rows with name - sravan.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.name.contains('sravan')).show()
Output:
Explanation
You can see that sravan is found three times, and rows were returned.
Example 2
We will provide the string ‘PHP’ in the technology1 column inside the contains() method and return all rows matching this string.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.technology1.contains('PHP')).show()
#check for string- PHP in the technology1 column and return rows with technology1 - PHP.
Print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.technology1.contains('PHP')).show()
Output:
Explanation
You can see that PHP is found two times in the technology1 column and rows were returned.
Example 3
We will provide the value 46 in the subject_id column inside the contains() method and return all rows matching this value.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.subject_id.contains(46)).show()
#check for value - 46 in subject_id column and return rows with subject_id - 46.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.subject_id.contains(46)).show()
Output:
Explanation
You can see that 46 is found three times in the subject_id column and rows were returned.
Example 4
We will provide the value 1000 in the subject_id column inside the contains() method and return all rows matching this value.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.subject_id.contains(1000)).show()
#check for value - 1000 in subject_id column and return rows with subject_id - 1000.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.subject_id.contains(1000)).show()
Output:
Explanation
You can see that 1000 is not found in the subject_id column. So, no rows were returned.
Conclusion
This PySpark tutorial discussed that it is possible to filter the rows present in the DataFrame using the contains() method. We saw four different examples to understand this concept better. It is possible to use this method using the where() and filter() functions.