Apache Spark

PySpark – Pandas DataFrame : add_prefix() And add_suffix()

In Python, PySpark is a Spark module that provides a similar kind of Processing to spark using DataFrame, which will store the given data in row and column format. PySpark – pandas DataFrame represents the pandas DataFrame, but it holds the PySpark DataFrame internally. Pandas support DataFrame data structure, and pandas is imported from the pyspark module.

In this tutorial we will show the Pandas DataFrame add_prefix() And add_suffix() methods that are used to dd prefixes and suffixes to a particular column or all columns of a DataFrame.

Syntax to import:

Here is the syntax to import pandas from pyspark:

from pyspark import pandas

After that, we can create or use the dataframe from the pandas module.

Syntax to create pandas DataFrame:

pyspark.pandas.DataFrame()

We can pass a dictionary or list of lists with values.

Let’s create a pandas DataFrame through pyspark with four columns and five rows.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'student_lastname':['manasa','trisha','lehara','kapila','hyna'],'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

print(pyspark_pandas)

Output:

Now, we will go into our tutorial.

It is possible to add prefixes and suffixes to a particular column or all columns using the add_prefix() and add_suffix() methods. Let’s discuss them one by one.

add prefixes and suffixes to a particular column or all columnspyspark.pandas.DataFrame.add_prefix()[/cc]

add_prefix() is used to add a prefix string to each and every column at the beginning of the pyspark pandas dataframe. It is also possible to add a prefix to only a single column by specifying the column name. In this scenario, it will be added to row labels.

Syntax:

For entire dataframe  –

pyspark_pandas.add_prefix(string)

For particular column –

pyspark_pandas.column.add_prefix(string)[/cc\

Where, pyspark_pandas is the pyspark pandas dataframe.
<h2>Parameter:</h2>
A string is a prefix added to the column at the beginning.
<h2>Example 1</h2>
In this example, we are adding the prefix – “Linux_Hint” to all the above columns to create the pyspark pandas dataframe.

[cc lang="python" width="100%" height="100%" escaped="true" theme="blackboard" nowrap="0"]
#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'student_lastname':['manasa','trisha','lehara','kapila','hyna'],'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

#add the prefix - ‘Linux_Hint' to the entire dataframe
print(pyspark_pandas.add_prefix('Linux_Hint'))

Output:

We can see that the prefix is added to all the columns.

Example 2

Add prefix to the values in the mark1 column.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'student_lastname':['manasa','trisha','lehara','kapila','hyna'],'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

#add the prefix - ‘Linux_Hint' to the mark1 column values
print(pyspark_pandas.mark1.add_prefix('Linux_Hint'))

Output:

Linux_Hint0    90
Linux_Hint1    56
Linux_Hint2    78
Linux_Hint3    54
Linux_Hint4    67
Name: mark1, dtype: int64

We can see that the prefix is added to all the values in the mark1  column.

pyspark.pandas.DataFrame.add_suffix()

add_suffix() is used to add a suffix string to every column at the end of the pyspark pandas dataframe. It is also possible to add a suffix to only a single column by specifying the column name. In this scenario, it will be added to row labels.

Syntax:

For entire dataframe  –

pyspark_pandas.add_suffix(string)

For particular column –

pyspark_pandas.column.add_suffix(string)

Where, pyspark_pandas is the pyspark pandas dataframe.

Parameter:

A string is a suffix added to the column at the beginning.

Example 1

In this example, we are adding the suffix – “Linux_Hint” to all the columns above to create the pyspark pandas dataframe.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'student_lastname':['manasa','trisha','lehara','kapila','hyna'],'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

#add the suffix - 'Linux_Hint' to the entire dataframe
print(pyspark_pandas.add_suffix('Linux_Hint'))

Output:

We can see that the suffix is added to all the columns.

Example 2

Add suffix to the values in the mark1 column.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'student_lastname':['manasa','trisha','lehara','kapila','hyna'],'mark1':[90,56,78,54,67],'mark2':[100,67,96,89,32],'mark3':[91,92,98,97,87]})

#add the suffix - 'Linux_Hint' to the mark1 column values
print(pyspark_pandas.mark1.add_suffix('Linux_Hint'))

Output:

0Linux_Hint    90
1Linux_Hint    56
2Linux_Hint    78
3Linux_Hint    54
4Linux_Hint    67
Name: mark1, dtype: int64

We can see that the suffix is added to all the values in the mark1  column.

Conclusion

In this pyspark pandas tutorial, we saw how to add a prefix using add_prefix() and suffix using add_suffix()  to the pyspark pandas dataframe. It will be added to the column names when we specify the entire dataframe. If we apply the above methods to a particular column, the prefix/suffix will get added to the row positions.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain