Apache Spark

Convert PySpark Pandas DataFrame to Different Formats

“In Python, PySpark is a Spark module that provides a similar kind of Processing to spark using DataFrame, which will store the given data in row and column format.

PySpark – pandas DataFrame represents the pandas DataFrame, but it holds the PySpark DataFrame internally.

Pandas support DataFrame data structure, and pandas are imported from the pyspark module.

Before that, you have to install the pyspark module.”

Command

pip install pyspark

Syntax to import:

from pyspark import pandas

After that, we can create or use the dataframe from the pandas’ module.

Syntax to create pandas DataFrame:

pyspark.pandas.DataFrame()

We can pass a dictionary or list of lists with values.

Let’s create a pandas DataFrame through pyspark with three columns and five rows.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

print(pyspark_pandas)

Output:

Now, we will go into our tutorial.

We will see different formats in which the above created pyspark pandas dataframe is converted.

pyspark.pandas.DataFrame.to_html()

PySpark pandas dataframe is converted to html format such that column names are placed under <th> tag, and column values are placed under <td> tag.

Syntax:

pyspark_pandas.to_html()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 1

In this example, we will convert the above pyspark pandas dataframe to html format.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

#convert pyspark_pandas to html

print(pyspark_pandas.to_html())

Output:

You can see that column names are placed inside <th> tags and values are placed inside <td> tags.

pyspark.pandas.DataFrame.to_json()

PySpark pandas dataframe is converted to json format such that column names will act as keys and column values will be values.

Syntax:

pyspark_pandas.to_json()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 2

In this example, we will convert the above pyspark pandas dataframe to json format.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

#convert pyspark_pandas to json

print(pyspark_pandas.to_json())

Output:

[{"mark1":90,"mark2":100,"mark3":91},{"mark1":56,"mark2":67,"mark3":92},{"mark1":78,"mark2":96,"mark3":98},{"mark1":54,"mark2":89,"mark3":97},{"mark1":67,"mark2":32,"mark3":87}]

You can see that column names are keys.

pyspark.pandas.DataFrame.to_numpy()

PySpark pandas dataframe is converted to array format using the to_numpy() method.

Syntax:

pyspark_pandas.to_numpy()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 3

In this example, we will convert the above pyspark pandas dataframe to array format.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

 

#convert to numpy array

print(pyspark_pandas.to_numpy())

Output:

[[ 90 100 91]

[ 56 67 92]

[ 78 96 98]

[ 54 89 97]

[ 67 32 87]]

You can see that values are stored in the form of a 2-D array with five rows and three columns.

pyspark.pandas.DataFrame.to_pandas()

PySpark pandas dataframe is converted to pandas dataframe using the to_pandas() method.

Syntax:

pyspark_pandas.to_pandas()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 4

In this example, we will convert the above pyspark pandas dataframe to a pandas dataframe.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

 

#convert into pandas

print(pyspark_pandas.to_pandas())

Output:

mark1 mark2 mark3

0 90 100 91

1 56 67 92

2 78 96 98

3 54 89 97

4 67 32 87

You can see that values are stored in the form of a pandas dataframe with five rows and three columns.

pyspark.pandas.DataFrame.to_markdown()

PySpark pandas dataframe is converted to markdown using the to_markdown() method.

Syntax:

pyspark_pandas.to_markdown()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 5

In this example, we will convert the above pyspark pandas dataframe to markdown format.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

 

#display in markdown format

print(pyspark_pandas.to_markdown())

Output:

You can see that the pyspark pandas dataframe is converted to markdown format.

pyspark.pandas.DataFrame.to_dict()

PySpark pandas dataframe is converted to a dictionary using the to_dict() method.column names will be keys.

Syntax:

pyspark_pandas.to_dict()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 6

In this example, we will convert the above pyspark pandas dataframe to a dictionary using the to_dict() method.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

 

#convert into dictionary

print(pyspark_pandas.to_dict())

Output:

{'mark1': {0: 90, 1: 56, 2: 78, 3: 54, 4: 67}, 'mark2': {0: 100, 1: 67, 2: 96, 3: 89, 4: 32}, 'mark3': {0: 91, 1: 92, 2: 98, 3: 97, 4: 87}}

You can see that the pyspark pandas dataframe is converted to a dictionary with keys as column names.

pyspark.pandas.DataFrame.to_records()

PySpark pandas dataframe is converted to a record using the to_records() method. Here, for each row of the record, an id is placed that starts from 1.

Syntax:

pyspark_pandas.to_records()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 7

In this example, we will convert the above pyspark pandas dataframe to a record using the to_records() method.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

 

#convert to records

print(pyspark_pandas.to_records())

Output:

[(0, 90, 100, 91) (1, 56, 67, 92) (2, 78, 96, 98) (3, 54, 89, 97)

(4, 67, 32, 87)]

pyspark.pandas.DataFrame.to_latex()

PySpark pandas dataframe is converted to a record using to_latex() method.

Syntax:

pyspark_pandas.to_latex()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 8

In this example, we will convert the above pyspark pandas dataframe to latex format.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

 

#convert to latex

print(pyspark_pandas.to_latex())

Output:

We can see that the pyspark pandas dataframe is converted to latex format.

pyspark.pandas.DataFrame.to_spark()

PySpark pandas dataframe is converted to a spark dataframe using the to_spark() method. It uses the show() method to display the dataframe in tabular format.

Syntax:

pyspark_pandas.to_spark()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 9

In this example, we will convert the above pyspark pandas dataframe to a spark dataframe.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

 

#convert to spark

pyspark_pandas.to_spark().show()

Output:

We can see that the pyspark pandas dataframe is converted to a spark dataframe.

pyspark.pandas.DataFrame.to_string()

PySpark pandas dataframe is converted to a string using the to_string() method. It displays in a tabular format.

Syntax:

pyspark_pandas.to_string()

Where pyspark_pandas is the pyspark pandas dataframe.

Example: 10

In this example, we will convert the above pyspark pandas dataframe to a string

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

 

#convert to string format

print(pyspark_pandas.to_string())

Output:

mark1 mark2 mark3

0 90 100 91

1 56 67 92

2 78 96 98

3 54 89 97

4 67 32 87

We can see that the pyspark pandas dataframe is converted to a string with a tabular format.

Conclusion

In this tutorial, we saw the different formats that convert the pyspark pandas dataframe.

to_html() converts the pyspark pandas dataframe into html format. If you want to convert it into a numpy array, you can choose the to_numpy() method. If you want to convert it into a pandas dataframe, you can choose the to_pandas() method.

to_latex() formats the pyspark pandas dataframe into latex, to_markdown formats the pyspark pandas dataframe into markdown. If you want the column to be a key, you can prefer to_dict() and to_json().

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain