Apache Spark

PySpark Read JSON()

While working with PySpark DataFrames, it needs to be stored in the PySpark DataFrame if you want to process the JSON data. After storing in the DataFrame, we can apply the different operations and methods on the data. Also, there are so many advantages if we convert JSON to PySpark DataFrame since it is simple and we can transform/partition the data in a simpler way.

Topic of Contents:

Reading JSON into the PySpark DataFrame Using Pandas.read_json()

Reading JSON to PySpark DataFrame Using Spark.read.json()

Reading JSON to PySpark DataFrame Using the PySpark SQL

In this tutorial, we will look at how to read JSON into the PySpark DataFrame using the pandas.read_json(), spark.read.json(), and spark.sql. In all the scenarios, we will look at the different examples by considering the different JSON formats.

Install the PySpark library before implementing the following examples.

pip install pyspark

After the successful installation, you can see the output as follows:

Reading JSON into the PySpark DataFrame Using Pandas.read_json()

In PySpark, the createDataFrame() method is used to create the DataFrame directly. Here, we just need to pass the JSON file/path to the JSON file through the pandas.read_json() method. This read_json() method takes the filename/path which is available in the Pandas module. This is why it is necessary to import and utilize the Pandas module.

Syntax:

spark_app.createDataFrame(pandas.read_json('file_name.json'))

Example:

Let’s create a JSON file named “student_skill.json” that holds 2 records. Here, the keys/columns are “Student 1” and “Student 2”. The rows are name, age, skill1, and skill2.

import pyspark

import pandas

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Using pandas.read_json()

candidate_skills = linuxhint_spark_app.createDataFrame(pandas.read_json('student_skill.json'))

candidate_skills.show()

Output:

We can see that the JSON data is converted to PySpark DataFrame with specified columns and rows.

2. Reading JSON to PySpark DataFrame Using Spark.read.json()

The read.json() is a method that is similar to read_json() in Pandas. Here, read.json() takes a path to JSON or directly to JSON file, and directly loads it into the PySpark DataFrame. There is no need to use the createDataFrame() method in this scenario. If you want to read multiple JSON files at a time, we need to pass a list of JSON file names through a list that is separated by comma. All the JSON records are stored in a single DataFrame.

Syntax:

Single File - spark_app.read.json('file_name.json')

Multiple Files - spark_app.read.json(['file1.json','file2.json',...])

Scenario 1: Read JSON Having Single Line

If your JSON file is in the record1, record2, record3… (single line) formats, we can call it as JSON with single lines. Spark processes these records and stores it in the PySpark DataFrame as rows. Each record is a row in the PySpark DataFrame.

Let’s create a JSON file named “candidate_skills.json” that holds 3 records. Read this JSON into the PySpark DataFrame.

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Read candidate_skills.json into the PySpark DataFrame

candidate_skills = linuxhint_spark_app.read.json('candidate_skills.json')

candidate_skills.show()

Output:

We can see that the JSON data is converted to PySpark DataFrame with specified records and column names.

Scenario 2: Read JSON Having Multiple Lines

If your JSON file has multiple lines, you need to use the read.option().json() method to pass the multiline parameter which has to be set to true. This allows us to load JSON having multiple lines into the PySpark DataFrame.

read.option("multiline","true").json("file_name.json")

Let’s create a JSON file named “multi.json” that holds 3 records. Read this JSON into the PySpark DataFrame.

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Read multi.json (having multiple lines) into the PySpark DataFrame

candidate_skills = linuxhint_spark_app.read.option("multiline","true").json("multi.json")

candidate_skills.show()

Output:

Scenario 3: Read Multiple JSON

We already discussed at the starting phase of this tutorial regarding multiple JSON files. If you want to read multiple JSON files at a time and store them in a single PySpark DataFrame, we need to pass a list of file names to the read.json() method.

Let’s create two JSON files named “candidate_skills.json” and “candidate_skills2.json” and load them into the PySpark DataFrame.

The “candidate_skills.json” file holds three records.

The “candidate_skill2.json” file holds only a single record.

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Read candidate_skills and candidate_skills2 files at a time into the PySpark DataFrame

candidate_skills = linuxhint_spark_app.read.json(['candidate_skills.json','candidate_skills2.json'])

candidate_skills.show()

Output:

Finally, the DataFrame holds four records. The first three records belong to the first JSON and the last records belong to the second JSON.

Reading JSON to PySpark DataFrame Using Spark.read.json()

The read.json() is a method that is similar to read_json() in Pandas. Here, read.json() takes a path to JSON or directly to JSON file and loads it directly into the PySpark DataFrame. There is no need to use the createDataFrame() method in this scenario. If you want to read multiple JSON files at a time, we need to pass a list of JSON file names through a list that is separated by comma. All the JSON records are stored in a single DataFrame.

Syntax:

Single File - spark_app.read.json('file_name.json')

Multiple Files - spark_app.read.json(['file1.json','file2.json',...])

Scenario 1: Read JSON Having Single Line

If your JSON file is in the record1, record2, record3… (single line) format, we can call it as JSON with single lines. Spark processes these records and stores it in the PySpark DataFrame as rows. Each record is a row in the PySpark DataFrame.

Let’s create a JSON file named “candidate_skills.json” that holds 3 records. Read this JSON into the PySpark DataFrame.

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Read candidate_skills.json into the PySpark DataFrame

candidate_skills = linuxhint_spark_app.read.json('candidate_skills.json')

candidate_skills.show()

Output:

We can see that the JSON data is converted to PySpark DataFrame with specified records and column names.

Reading JSON to PySpark DataFrame Using the PySpark SQL

It can be possible to create a temporary view of our JSON data using the PySpark SQL. Directly, we can provide the JSON at the time of creating the temporary view. Look at the following syntax. After that, we can use the SELECT command to display the PySpark DataFrame.

Syntax:

spark_app.sql("CREATE TEMPORARY VIEW VIEW_NAME USING json OPTIONS (path ‘file_name.json')")

Here, the “VIEW_NAME” is the view of the JSON data and the “file_name” is the name of the JSON file.

Example 1:

Consider the JSON file that is used in the previous examples – “candidate_skills.json”. Select all the rows from the DataFrame using SELECT with the “*” operator. Here, * selects all the columns from the PySpark DataFrame.

import pyspark

import pandas

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Using spark.sql to create VIEW from the JSON

candidate_skills = linuxhint_spark_app.sql("CREATE TEMPORARY VIEW Candidate_data USING json OPTIONS (path 'candidate_skills.json')")

# Use SELECT query to select all records from the Candidate_data.

linuxhint_spark_app.sql("SELECT * from Candidate_data").show()

Output:

The total number of records in the PySpark DataFrame (readed from JSON) is 3.

Example 2:

Now, filter the records in the PySpark DataFrame based on the age column. Use the “greater than” operator on the age to get the rows with an age that is greater than 22.

# Use SELECT query to select records with age > 22.

linuxhint_spark_app.sql("SELECT * from Candidate_data where age>22").show()

Output:

There is only one record in the PySpark DataFrame with an age that is greater than 22.

Conclusion

We learned the three different ways to read the JSON into the PySpark DataFrame. First, we learned how to use the read_json() method that is available in the Pandas module to read JSON to PySpark DataFrame. Next, we learned how to read the single/multi line JSON files using the spark.read.json() method with option(). To read multiple JSON files at a time, we need to pass a list of file names to this method. Using PySpark SQL, the JSON file is read into the temporary view and the DataFrame is displayed using the SELECT query.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain