Different Ways to Create PySpark DataFrame

In Python, PySpark is a Spark module used to provide a similar kind of processing like spark using DataFrame. In this article, we will discuss several ways to create PySpark DataFrame.

Method 1: Using Dictionary

Dictionary is a datastructure which will store the data in key,value pair format.

The key acts as column and value act as row value/data in the PySpark DataFrame. This has to be passed inside the list.

Structure:

[{‘key’ : value}]

We can also provide multiple dictionaries.

Structure:

[{‘key’ : value},{‘key’ : value},…….,{‘key’ : value}]

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{‘rollno’:’001’,’name’:’sravan’,’age’:23,
’height’:5.79,’weight’:67,’address’:’guntur’},
{‘rollno’:’002’,’name’:’ojaswi’,’age’:16,
’height’:3.79,’weight’:34,’address’:’hyd’},
{‘rollno’:’003’,’name’:’gnanesh chowdary’,’age’:7,
’height’:2.79,’weight’:17,’address’:’patna’},
{‘rollno’:’004’,’name’:’rohith’,’age’:9,
’height’:3.69,’weight’:28,’address’:’hyd’},
{‘rollno’:’005’,’name’:’sridevi’,’age’:37,
’height’:5.59,’weight’:54,’address’:’hyd’}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display the dataframe

df.show()

Output:

Method 2: Using list of tuples

Tuple is a data structure which will store the data in ().

We can pass the rows separated by comma in a tuple surrounded by a list.

Structure:

[(value1,value2,.,valuen)]

We can also provide multiple tuples in a list.

Structure:

[(value1,value2,.,valuen), (value1,value2,.,valuen), ………………,(value1,value2,.,valuen)]

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( list_of_tuple,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

Output:

Method 3: Using tuple of lists

List is a data structure which will store the data in [].

We can pass the rows separated by comma in a list surrounded by a tuple.

Structure:

([value1,value2,.,valuen])

We can also provide multiple lists in a tuple.

Structure:

([value1,value2,.,valuen], [value1,value2,.,valuen], ………………,[value1,value2,.,valuen])

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( tuple_of_list,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

Output:

Method 4: Using nested list

List is a datastructure which will store the data in [].

So, we can pass the rows separated by comma in a list surrounded by a list.

Structure:

[[value1,value2,.,valuen]]

We can also provide multiple lists in a list.

Structure:

[[value1,value2,.,valuen], [value1,value2,.,valuen], ………………,[value1,value2,.,valuen]]

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( nested_list,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

Output:

Method 5: Using nested tuple

Structure:

((value1,value2,.,valuen))

We can also provide multiple tuples in a tuple.

Structure:

((value1,value2,.,valuen), (value1,value2,.,valuen), ………………,(value1,value2,.,valuen))

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( nested_tuple,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

Output:

Conclusion

In this tutorial, we discussed five methods to create PySpark DataFrame: list of tuples, tuple of lists, nested tuple, nested list use, and columns list to provide column names. There is no need to provide the column names list while creating PySpark DataFrame using dictionary.

Different Ways to Create PySpark DataFrame

Method 1: Using Dictionary

Structure:

Structure:

Example:

Output:

Method 2: Using list of tuples

Structure:

Structure:

Syntax:

Example:

Output:

Method 3: Using tuple of lists

Structure:

Structure:

Syntax:

Example:

Output:

Method 4: Using nested list

Structure:

Structure:

Syntax:

Example:

Output:

Method 5: Using nested tuple

Structure:

Structure:

Syntax:

Example:

Output:

Conclusion

About the author

Gottumukkala Sravan Kumar