Apache Spark

PySpark RDD – Actions

In Python, PySpark is a Spark module used to provide a similar kind of processing like spark.

RDD stands for Resilient Distributed Datasets. We can call RDD as a fundamental data structure in Apache Spark.

We need to import RDD from the pyspark.rdd module.

In PySpark to create an RDD, we can use the parallelize() method.

Syntax:

spark_app.sparkContext.parallelize(data)

Where:

data can be a one dimensional (linear data) or two dimensional data (row-column data).

RDD Actions:

An action in RDD is an operation that is applied on an RDD to return a single value. In other words, we can say that an action will result from the provided data by doing some operation on the given RDD.

Let’s see the Actions that are performed on the given RDD.

We will discuss it one by one.

For all actions, we considered the students RDD as shown below:

[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

1. collect()

collect() action in RDD is used to return the data from the given RDD.
Syntax:

RDD_data.collect()

Where, RDD data is the RDD

Example:

In this example, we will see how to perform collect() action on the students RDD.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =spark_app.sparkContext.parallelize([{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}])

#perform the collect action
print(students.collect())

Output:

[{'rollno': '001', 'name': 'sravan', 'age': 23, 'height': 5.79, 'weight': 67, 'address': 'guntur'},
{'rollno': '002', 'name': 'ojaswi', 'age': 16, 'height': 3.79, 'weight': 34, 'address': 'hyd'},
{'rollno': '003', 'name': 'gnanesh chowdary', 'age': 7, 'height': 2.79, 'weight': 17, 'address': 'patna'},
{'rollno': '004', 'name': 'rohith', 'age': 9, 'height': 3.69, 'weight': 28, 'address': 'hyd'},
{'rollno': '005', 'name': 'sridevi', 'age': 37, 'height': 5.59, 'weight': 54, 'address': 'hyd'}]

You can notice that all the data is returned with the collect() method.

2. count()

count() action in RDD is used to return the total number of elements/values from the given RDD.

Syntax:

RDD_data.count()

Where RDD data is the RDD

Example:

In this example, we will see how to perform count() action on the students RDD:

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =spark_app.sparkContext.parallelize([{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}])

#perform count action
print(students.count())

Output:

5

You can notice that the total number of elements is returned with the count() method.

3. first()

first() action in RDD is used to return the first element/value from the given RDD.

Syntax:

RDD_data.first()

Where RDD data is the RDD

Example:

In this example, we will see how to perform first() action on the students RDD.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =spark_app.sparkContext.parallelize([{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}])

#Apply first() action
print(students.first())

Output:

{'rollno': '001', 'name': 'sravan', 'age': 23, 'height': 5.79, 'weight': 67, 'address': 'guntur'}

You can notice that the first element is returned with the first() method.

4. take()

take() action in RDD is used to return the n values from the top of the given RDD. It takes one parameter – n. Where it refers to an integer that specifies the number of elements to return from RDD.

Syntax:

RDD_data.take(n)

Parameter:

n- Refers to an integer that specifies the number of elements to return from RDD.

Example:

In this example, we will see how to perform take() action on the students RDD by returning only 2 values.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =spark_app.sparkContext.parallelize([{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}])

#perform take action to return only top 2 rows
print(students.take(2))

Output:

[{'rollno': '001', 'name': 'sravan', 'age': 23, 'height': 5.79, 'weight': 67, 'address': 'guntur'},
{'rollno': '002', 'name': 'ojaswi', 'age': 16, 'height': 3.79, 'weight': 34, 'address': 'hyd'}]

You can notice that the first 2 elements are returned with the take() method.

5. saveAsTextFile()

saveAsTextFile() action is used to store the RDD data into a Text file. It takes the file name as parameter such that file is saved with the specified filename.

Syntax:

RDD_data.saveAsTextFile(‘file_name.txt)

Parameter:

file_name – The file is saved with the specified filename.

Example:

In this example, we will see how to perform saveAsTextFile() action on the students RDD by storing the file.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

# import RDD from pyspark.rdd
from pyspark.rdd import RDD

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =spark_app.sparkContext.parallelize([{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}])

#perform saveAsTextFile() action to save RDD into text file.
students.saveAsTextFile('students_file.txt')

Output:

You can see that students_file starts downloading.

Conclusion

In this PySpark tutorial, you see what an RDD is and how to perform different actions available on RDD. The actions that are performed on RDD are: count() to return the total number of elements in RDD, collect() to return the values present in RDD, first(), and take() to return first valued and saveAsTextFile() action to save the RDD into a text file.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain