Spark is a powerful data processing tool used to store and process data effectively and efficiently. It was introduced by the Apache team and is also known as Apache Spark.
We can relate the data in a tabular format. So the data structure used is DataFrame. Anyway, Spark will support Java, Scala, and Python Programming Languages. We will use Spark in Python Programming Language as of now.
We can call it as PySpark. In Python, PySpark is a Spark module used to provide a similar kind of Processing using DataFrame.
Installation
But we need only is to install PySpark in our System. To install any module, we have to use the pip command in Python. And the syntax is as follows.
Syntax:
Before using this PySpark, We have to import this module in our org, and our data will require a Spark app. So let’s import this module and create an app.
We can create an app using SparkSession by importing this class from the pyspark.sql module.
This will create a session for our app.
Now , create spark app from this session. We can create spark app using getOrCreate() method
Syntax:
It’s time to create an excellent data structure known as a dataframe that stores the given data in row and column format.
In PySpark , we can create a DataFrame from spark app with createDataFrame() method
Syntax:
Where input_data maybe a dictionary or a list to create a dataframe from this data, and if the input_data is a list of dictionaries, then the columns are no need provided; if it is a nested list, we have to provide the column names.
Let’s create the PySpark DataFrame
Code:
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# dataframe
df.show()
Output
In the above code, we created the dictionary with 5 rows and 6 columns and passed this dictionary to the createDataFrame() method to generate the dataframe. Finally, we are displaying the dataframe with the show() method. This method will display the dataframe in a tabular format.
Let’s display the columns in PySpark DataFrame.
We can get the column names in a list format using the columns method.
Syntax:
Example 2:
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
# dataframe columns
df.columns
Output:
Conclusion
In this article, we discussed how to create PySpark DataFrame along with Installation and how we can get the columns in the dataframe. And we used the show() method to display the dataframe in tabular format.