Apache Spark

PySpark RDD – lookup(), collectAsMap()

In Python, PySpark is a Spark module used to provide a similar kind of processing like spark. RDD stands for Resilient Distributed Datasets. We can call RDD as a fundamental data structure in Apache Spark. Pair RDD stores the elements/values in the form of key-value pairs. It will store the key-value pair in the format (key,value).

In this article we will talk about the PySpark RDD methods lookup() and collectAsMap() which are both actions to return values from the RDD.

To demonstrate these methods, we need to import RDD from the pyspark.rdd module. In PySpark to create an RDD, we can use the parallelize() method.

Syntax:

spark_app.sparkContext.parallelize(data)

Where data can be a one dimensional (linear data) or two-dimensional data (row-column data).

PySpark RDD – lookup()

lookup() is an action in pair RDD, which is used to return all the values that are associated with a key in a list. It is performed on single pair RDD. It takes a key as a parameter.

Syntax:

RDD_data.lookup(key)

Parameter:

key refers to the key present in the pair RDD.

Example:

In this example, we will get the look up for the keys- python,javascript and linux.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create 6 - subject and rating pairs
subjects_rating =spark_app.sparkContext.parallelize([('python',4),('javascript',2),('linux',5),('C#',4),
('javascript',4),('python',3)])

#actual pair RDD
print("pair RDD: ",subjects_rating.collect())

#get lookup for the key-python
print("lookup for the python: ",subjects_rating.lookup('python'))

#get lookup for the key-javascript
print("lookup for the javascript: ",subjects_rating.lookup('javascript'))

#get lookup for the key-linux
print("lookup for the linux: ",subjects_rating.lookup('linux'))

Output:

pair RDD: [('python', 4), ('javascript', 2), ('linux', 5), ('C#', 4), ('javascript', 4), ('python', 3)]
lookup for the python: [4, 3]
lookup for the javascript: [2, 4]
lookup for the linux: [5]

From the above output, we can see that there are 2 values that exists with key-python, so it returned 4 and 3. There are 2 values that exists with key-javascript, so it returned 2 and 4. There is only 1 value that exists with key-linux, so it returned 1.

PySpark RDD – collectAsMap()

collectAsMap() is an action in pair RDD which is used to return all the values in the form of a map (key:value) pair. It is used to provide lookup. It takes no parameter.

Syntax:

RDD_data.collectAsMap()

Example:

In this example, we will get values from RDD using collectAsMap().

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create 6 - subject and rating pairs
subjects_rating =spark_app.sparkContext.parallelize([('linux',5),('C#',4),
('javascript',4),('python',53)])

#apply collectAsMap() to return the RDD
print(subjects_rating.collectAsMap())

Output:

{'linux': 5, 'C#': 4, 'javascript': 4, 'python': 53}

We can see that RDD is returned in the form of key:value pairs.

Note: that if there are multiple keys with different values, then collectASMap() will collect by returning the updated value with respect to the key.

Example:

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create 6 - subject and rating pairs
subjects_rating =spark_app.sparkContext.parallelize([('linux',5),('C#',4),('javascript',4),
('python',53),('linux',45),('C#',44),])

#apply collectAsMap() to return the RDD
print(subjects_rating.collectAsMap())

Output:

{'linux': 45, 'C#': 44, 'javascript': 4, 'python': 53}

We can see that linux and C# keys occurred two times. The second time the values are 45 and 44. Hence, the collectASMap() returns with the new values.

Conclusion

In this PySpark RDD tutorial, we saw how to apply lookup() and collectAsMap() actions on pair RDD. lookup() is used to return the values associated with the key in a list by taking the key as a parameter and collectAsMap() returns the RDD in the form of Map.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain