Apache Spark

PySpark contains() Function

If we want to return the values from the DataFrame column in PySpark, then the contains() method available in PySpark is used to return the rows based on the values specified inside it.

It can be used with either the filter clause or where clause. We will see them one by one with the different examples.

Syntax

dataframe_object.filter(dataframe_obj.column.contains(value/string))
dataframe_object.where(dataframe_obj.column.contains(value/string))

Where,
dataframe_object is the PySpark DataFrame.

Parameter:
The contains() function takes one parameter.

It can be a value or string that the contains() function will check if the specified value is present in the DataFrame column or not.

Return:
Based on this column value, the entire row is returned.

First, we will create the PySpark DataFrame with 10 rows and 5 columns.

import pyspark
from pyspark.sql import *
spark_app = SparkSession.builder.appName('_').getOrCreate()
students =[(4,'sravan',23,'PHP','Testing'),
        (2,'sravan',23,'Oracle','Testing'),
        (46,'mounika',22,'.NET','HTML'),
        (12,'deepika',21,'Oracle','HTML'),
        (46,'mounika',22,'Oracle','Testing'),
        (12,'chandrika',23,'Hadoop','C#'),
        (12,'chandrika',22,'Oracle','Testing'),
        (45,'sravan',23,'Oracle','C#'),
        (4,'deepika',21,'PHP','C#'),
        (46,'mounika',22,'.NET','Testing')
            ]
 
dataframe_obj = spark_app.createDataFrame( students,['subject_id','name','age','technology1','technology2'])

dataframe_obj.show()

Output:

Now, let’s apply the contains() function on the PySpark DataFrame to return the results.

Example 1
We will provide the string ‘sravan’ in the name column inside the contains() method and return all rows matching this string.

#check for string- sravan in the name column and return rows with name - sravan.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.name.contains('sravan')).show()

#check for string- sravan in name column and return rows with name - sravan.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.name.contains('sravan')).show()

Output:

Explanation
You can see that sravan is found three times, and rows were returned.

Example 2
We will provide the string ‘PHP’ in the technology1 column inside the contains() method and return all rows matching this string.

#check for string- PHP in the technology1 column and return rows with technology1 - PHP.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.technology1.contains('PHP')).show()

#check for string- PHP in the technology1 column and return rows with technology1 - PHP.
Print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.technology1.contains('PHP')).show()

Output:

Explanation
You can see that PHP is found two times in the technology1 column and rows were returned.

Example 3
We will provide the value 46 in the subject_id column inside the contains() method and return all rows matching this value.

#check for value - 46 in subject_id column and return rows with subject_id - 46.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.subject_id.contains(46)).show()

#check for value - 46 in subject_id column and return rows with subject_id - 46.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.subject_id.contains(46)).show()

Output:

Explanation
You can see that 46 is found three times in the subject_id column and rows were returned.

Example 4
We will provide the value 1000 in the subject_id column inside the contains() method and return all rows matching this value.

#check for value - 1000 in subject_id column and return rows with subject_id - 1000.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.subject_id.contains(1000)).show()
 
#check for value - 1000 in subject_id column and return rows with subject_id - 1000.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.subject_id.contains(1000)).show()

Output:

Explanation
You can see that 1000 is not found in the subject_id column. So, no rows were returned.

Conclusion

This PySpark tutorial discussed that it is possible to filter the rows present in the DataFrame using the contains() method. We saw four different examples to understand this concept better. It is possible to use this method using the where() and filter() functions.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain