Apache Spark

PySpark Introduction

Data is increasing day by day. We need a huge amount of memory to store and process this data. This should be efficient and easy to manage. So Big data technology came into the picture by providing Spark.

Spark is a powerful data processing tool used to store and process data effectively and efficiently. It was introduced by the Apache team and is also known as Apache Spark.

We can relate the data in a tabular format. So the data structure used is DataFrame. Anyway, Spark will support Java, Scala, and Python Programming Languages. We will use Spark in Python Programming Language as of now.

We can call it as PySpark. In Python, PySpark is a Spark module used to provide a similar kind of Processing using DataFrame.

Installation

But we need only is to install PySpark in our System. To install any module, we have to use the pip command in Python. And the syntax is as follows.

Syntax:

pip install pyspark

Before using this PySpark, We have to import this module in our org, and our data will require a Spark app. So let’s import this module and create an app.

We can create an app using SparkSession by importing this class from the pyspark.sql module.

This will create a session for our app.

Now , create spark app from this session. We can create spark app using getOrCreate() method

Syntax:

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

It’s time to create an excellent data structure known as a dataframe that stores the given data in row and column format.

In PySpark , we can create a DataFrame from spark app with createDataFrame() method

Syntax:

Spark_app.createDataFrame(input_data,columns)

Where input_data maybe a dictionary or a list to create a dataframe from this data, and if the input_data is a list of dictionaries, then the columns are no need provided; if it is a nested list, we have to provide the column names.

Let’s create the PySpark DataFrame

Code:

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

# dataframe
df.show()

Output

Capture.PNG

In the above code, we created the dictionary with 5 rows and 6 columns and passed this dictionary to the createDataFrame() method to generate the dataframe. Finally, we are displaying the dataframe with the show() method. This method will display the dataframe in a tabular format.

Let’s display the columns in PySpark DataFrame.

We can get the column names in a list format using the columns method.

Syntax:

dataframe.columns

Example 2:

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

# dataframe columns
df.columns

Output:

['address', 'age', 'height', 'name', 'rollno', 'weight']

Conclusion

In this article, we discussed how to create PySpark DataFrame along with Installation and how we can get the columns in the dataframe. And we used the show() method to display the dataframe in tabular format.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain