Row class in PySpark is used to create Row for the PySpark DataFrame. We can create a Row by using the Row() function.
This is available in the pyspark.sql module. So we have to import Row from this module.
Syntax:
Where,
- column_name is the column for the PySpark Dataframe
- value is the row value for a particular column
we can specify any number of columns in the Row class.
If we want to create several Rows, then we have to specify the Row class inside a list separated by a comma operator.
Syntax:
,……………………..]
To create Pyspark DataFrame from this Row, we simply pass the Row list to the createDataFrame() method.
If we want to display the PySpark DataFrame in Row format, we have to use the collect() method.
This method is used to get the data in a row by row format
Syntax:
Where Dataframe is the input PySpark DataFrame.
Example :
This example will create 5 rows using the Row class with 6 columns and display the dataframe using the collect() method.
import pyspark
#import SparkSession for creating a session and Row
from pyspark.sql import SparkSession,Row
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
#create rows
row_data=[Row(rollno='001', name='sravan', age=23, height=5.79, weight=67, address='guntur'),
Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd'),
Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna'),
Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd'),
Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd')]
#create the dataframe from row_data
df = spark_app.createDataFrame(row_data)
# display the dataframe
#by rows
df.collect()
Output:
Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd'),
Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna'),
Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd'),
Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd')]
We can also define the Columns first and then pass the values to the Rows.
This is done by using the Row name. We will define the columns with Row name and using this we can add values to the Row
Syntax:
[Row_Name(value1,value2,………,valuen),…………………….., Row_Name(value1,value2,………,valuen)]
Example:
In this example, we are going to add 6 columns with row name as Students with names as “rollno”,”name”,”age”,”height”,”weight”,”address” and adding 5 values to this students Row.
import pyspark
#import SparkSession for creating a session and Row
from pyspark.sql import SparkSession,Row
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create a Row with 6 columns
students =Row("rollno","name","age","height","weight","address")
#create values for the rows
row_data=[students('001','sravan',23,5.79,67,'guntur'),
students('002','ojaswi',16,3.79,34,'hyd'),
students('003','gnanesh chowdary',7,2.79,17,'patna'),
students('004','rohith',9,3.69,28,'hyd'),
students('005','sridevi',37,5.59,54,'hyd')]
#create the dataframe from row_data
df = spark_app.createDataFrame(row_data)
# display the dataframe
#by rows
df.collect()
Output:
Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd'),
Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna'),
Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd'),
Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd')]
Creating Nested Row
Row inside a Row is known as Nested Row. We can create the nested Row inside the Row is similar to normal Row Creation
Syntax:
Row(column_name= Row(column_name=’value’,……….),
……………………..]
Example:
In this example, we will create DataFrame similar to above, but we are adding a column named subjects to each Row and adding java and PHP values using nested Row.
import pyspark
#import SparkSession for creating a session and Row
from pyspark.sql import SparkSession,Row
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
#create rows
row_data=[Row(rollno='001', name='sravan', age=23, height=5.79, weight=67, address='guntur',subjects=Row(subject1='java',subject2='php')),
Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd',subjects=Row(subject1='java',subject2='php')),
Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna',subjects=Row(subject1='java',subject2='php')),
Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd',subjects=Row(subject1='java',subject2='php')),
Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd',subjects=Row(subject1='java',subject2='php'))]
#create the dataframe from row_data
df = spark_app.createDataFrame(row_data)
# display the dataframe
#by rows
df.collect()
Output:
Row(rollno='002', name='ojaswi', age=16, height=3.79, weight=34, address='hyd', subjects=Row(subject1='java', subject2='php')),
Row(rollno='003', name='gnanesh chowdary', age=7, height=2.79, weight=17, address='patna', subjects=Row(subject1='java', subject2='php')),
Row(rollno='004', name='rohith', age=9, height=3.69, weight=28, address='hyd', subjects=Row(subject1='java', subject2='php')),
Row(rollno='005', name='sridevi', age=37, height=5.59, weight=54, address='hyd', subjects=Row(subject1='java', subject2='php'))]
Conclusion:
This article discussed the Row class and how to create PySpark DataFrame using the Row class. At last, we discussed Nested Row Class.