Apache Spark

Different Ways to Create PySpark DataFrame

In Python, PySpark is a Spark module used to provide a similar kind of processing like spark using DataFrame. In this article, we will discuss several ways to create PySpark DataFrame.

Method 1: Using Dictionary

Dictionary is a datastructure which will store the data in key,value pair format.

The key acts as column and value act as row value/data in the PySpark DataFrame. This has to be passed inside the list.

Structure:

[{‘key’ : value}]

We can also provide multiple dictionaries.

Structure:

[{‘key’ : value},{‘key’ : value},…….,{‘key’ : value}]

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{‘rollno’:’001,’name’:’sravan’,’age’:23,
 ’height’:5.79,’weight’:67,’address’:’guntur’},
               {‘rollno’:’002,’name’:’ojaswi’,’age’:16,
 ’height’:3.79,’weight’:34,’address’:’hyd’},
               {‘rollno’:’003,’name’:’gnanesh chowdary’,’age’:7,
 ’height’:2.79,’weight’:17,’address’:’patna’},
               {‘rollno’:’004,’name’:’rohith’,’age’:9,
 ’height’:3.69,’weight’:28,’address’:’hyd’},
               {‘rollno’:’005,’name’:’sridevi’,’age’:37,
 ’height’:5.59,’weight’:54,’address’:’hyd’}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display the dataframe

df.show()

Output:

Method 2: Using list of tuples

Tuple is a data structure which will store the data in ().

We can pass the rows separated by comma in a tuple surrounded by a list.

Structure:

[(value1,value2,.,valuen)]

We can also provide multiple tuples in a list.

Structure:

[(value1,value2,.,valuen), (value1,value2,.,valuen), ………………,(value1,value2,.,valuen)]

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( list_of_tuple,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[('001','sravan',23,5.79,67,'guntur'),
               ('002','ojaswi',16,3.79,34,'hyd'),
               ('003','gnanesh chowdary',7,2.79,17,'patna'),
               ('004','rohith',9,3.69,28,'hyd'),
               ('005','sridevi',37,5.59,54,'hyd')]

#assign the column names
column_names = ['rollno','name','age','height','weight','address']

# create the dataframe
df = spark_app.createDataFrame( students,column_names)

#display the dataframe
df.show()

Output:

Method 3: Using tuple of lists

List is a data structure which will store the data in [].

We can pass the rows separated by comma in a list surrounded by a tuple.

Structure:

([value1,value2,.,valuen])

We can also provide multiple lists in a tuple.

Structure:

([value1,value2,.,valuen], [value1,value2,.,valuen], ………………,[value1,value2,.,valuen])

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( tuple_of_list,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =(['001','sravan',23,5.79,67,'guntur'],
               ['002','ojaswi',16,3.79,34,'hyd'],
               ['003','gnanesh chowdary',7,2.79,17,'patna'],
               ['004','rohith',9,3.69,28,'hyd'],
               ['005','sridevi',37,5.59,54,'hyd'])

#assign the column names
column_names = ['rollno','name','age','height','weight','address']

# create the dataframe
df = spark_app.createDataFrame( students,column_names)

#display the dataframe
df.show()

Output:

Method 4: Using nested list

List is a datastructure which will store the data in [].

So, we can pass the rows separated by comma in a list surrounded by a list.

Structure:

[[value1,value2,.,valuen]]

We can also provide multiple lists in a list.

Structure:

[[value1,value2,.,valuen], [value1,value2,.,valuen], ………………,[value1,value2,.,valuen]]

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( nested_list,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[['001','sravan',23,5.79,67,'guntur'],
               ['002','ojaswi',16,3.79,34,'hyd'],
               ['003','gnanesh chowdary',7,2.79,17,'patna'],
               ['004','rohith',9,3.69,28,'hyd'],
               ['005','sridevi',37,5.59,54,'hyd']]

#assign the column names
column_names = ['rollno','name','age','height','weight','address']

# create the dataframe
df = spark_app.createDataFrame( students,column_names)

#display the dataframe
df.show()

Output:

Method 5: Using nested tuple

Structure:

((value1,value2,.,valuen))

We can also provide multiple tuples in a tuple.

Structure:

((value1,value2,.,valuen), (value1,value2,.,valuen), ………………,(value1,value2,.,valuen))

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( nested_tuple,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module
import pyspark

#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =(('001','sravan',23,5.79,67,'guntur'),
               ('002','ojaswi',16,3.79,34,'hyd'),
               ('003','gnanesh chowdary',7,2.79,17,'patna'),
               ('004','rohith',9,3.69,28,'hyd'),
               ('005','sridevi',37,5.59,54,'hyd'))

#assign the column names
column_names = ['rollno','name','age','height','weight','address']

# create the dataframe
df = spark_app.createDataFrame( students,column_names)

#display the dataframe
df.show()

Output:

Conclusion

In this tutorial, we discussed five methods to create PySpark DataFrame: list of tuples, tuple of lists, nested tuple, nested list use, and columns list to provide column names. There is no need to provide the column names list while creating PySpark DataFrame using dictionary.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain