Apache Spark

PySpark – Row Class

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame.

Row class in PySpark is used to create Row for the PySpark DataFrame. We can create a Row by using the Row() function.

This is available in the pyspark.sql module. So we have to import Row from this module.

Syntax:

Row(column_name=’value’,……….)

Where,

  1. column_name is the column for the PySpark Dataframe
  2. value is the row value for a particular column

we can specify any number of columns in the Row class.

If we want to create several Rows, then we have to specify the Row class inside a list separated by a comma operator.

Syntax:

[Row(column_name=’value’,……….), Row(column_name=’value’,……….)

,……………………..]

To create Pyspark DataFrame from this Row, we simply pass the Row list to the createDataFrame() method.

If we want to display the PySpark DataFrame in Row format, we have to use the collect() method.

This method is used to get the data in a row by row format

Syntax:

Dataframe.collect()

Where Dataframe is the input PySpark DataFrame.

Example :

This example will create 5 rows using the Row class with 6 columns and display the dataframe using the collect() method.

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session and Row
from pyspark.sql import SparkSession,Row

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

#create  rows
row_data=[Row(rollno='001', name='sravan', age=23, height=5.79, weight=67, address='guntur'),
 Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd'),
 Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna'),
 Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd'),
 Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd')]
 
#create the dataframe from row_data
df = spark_app.createDataFrame(row_data)

# display the dataframe
#by rows
df.collect()

Output:

[Row(rollno='001', name='sravan', age=23, height=5.79, weight=67, address='guntur'),

Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd'),

Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna'),

Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd'),

Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd')]

We can also define the Columns first and then pass the values to the Rows.

This is done by using the Row name. We will define the columns with Row name and using this we can add values to the Row

Syntax:

Row_Name=Row(“column_name1”,column_name2”,…………….,”column_name n)

[Row_Name(value1,value2,………,valuen),…………………….., Row_Name(value1,value2,………,valuen)]

Example:

In this example, we are going to add 6 columns with row name as Students with names as “rollno”,”name”,”age”,”height”,”weight”,”address” and adding 5 values to this students Row.

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session and Row
from pyspark.sql import SparkSession,Row

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create a Row with 6 columns
students =Row("rollno","name","age","height","weight","address")

#create values for the rows
row_data=[students('001','sravan',23,5.79,67,'guntur'),
          students('002','ojaswi',16,3.79,34,'hyd'),
          students('003','gnanesh chowdary',7,2.79,17,'patna'),
          students('004','rohith',9,3.69,28,'hyd'),
          students('005','sridevi',37,5.59,54,'hyd')]
 
#create the dataframe from row_data
df = spark_app.createDataFrame(row_data)

# display the dataframe
#by rows
df.collect()

Output:

[Row(rollno='001', name='sravan', age=23, height=5.79, weight=67, address='guntur'),

Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd'),

Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna'),

Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd'),

Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd')]

Creating Nested Row

Row inside a Row is known as Nested Row. We can create the nested Row inside the Row is similar to normal Row Creation

Syntax:

[Row(column_name=Row(column_name=’value’,……….),……….),

Row(column_name= Row(column_name=’value’,……….),

……………………..]

Example:

In this example, we will create DataFrame similar to above, but we are adding a column named subjects to each Row and adding java and PHP values using nested Row.

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session and Row
from pyspark.sql import SparkSession,Row

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

#create  rows
row_data=[Row(rollno='001', name='sravan', age=23, height=5.79, weight=67, address='guntur',subjects=Row(subject1='java',subject2='php')),
 Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd',subjects=Row(subject1='java',subject2='php')),
 Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna',subjects=Row(subject1='java',subject2='php')),
 Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd',subjects=Row(subject1='java',subject2='php')),
 Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd',subjects=Row(subject1='java',subject2='php'))]
 
#create the dataframe from row_data
df = spark_app.createDataFrame(row_data)

# display the dataframe
#by rows
df.collect()

Output:

[Row(rollno='001', name='sravan', age=23, height=5.79, weight=67, address='guntur', subjects=Row(subject1='java', subject2='php')),

Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd', subjects=Row(subject1='java', subject2='php')),

Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna', subjects=Row(subject1='java', subject2='php')),

Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd', subjects=Row(subject1='java', subject2='php')),

Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd', subjects=Row(subject1='java', subject2='php'))]

Conclusion:

This article discussed the Row class and how to create PySpark DataFrame using the Row class. At last, we discussed Nested Row Class.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain