Apache Spark

PySpark array_position() & array_repeat()

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame.

In this article we will discuss the array_position() and array_repeat() methods used to operate on arrays of data.

Before we start on that we need to disuss the StructType() and StructField() methods which are used to define the columns in the PySpark DataFrame. By using these methods, we can define the column names and the data types of the particular columns. And also the ArrayType method used to define an array.

Let’s discuss them one by one

StructType()

This method is used to define the structure of the PySpark dataframe. It will accept a list of data types along with column names for the given dataframe. This is known as the schema of the dataframe. It stores a collection of fields

StructField()

This method is used inside the StructType()  method of the PySpark dataframe. It will accept column names with the data type.

ArrayType()

This method is used to define the array structure of the PySpark dataframe. It will accept a list of data types. It stores a collection of fields. We can place datatypes  inside ArrayType().

So In this article, we have to create a dataframe with an array. Let’s create a dataframe with 2 columns. First column is Student_category that refers to the integer field to store student id’s and the second column – Student_full_name is used to store string values in an array created using ArrayType().

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#and import struct types and other data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType,ArrayType
from pyspark.sql.functions import array_contains
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
 
# consider an array with 5 elements
my_array_data = [(1, ['A']), (2, ['B','L','B']), (3, ['K','A','K']),(4, ['K']), (3, ['B','P'])]

#define the StructType and StructFields
#for the above data

schema = StructType([StructField("Student_category", IntegerType()),StructField("Student_full_name", ArrayType(StringType()))])

#create the dataframe and add schema to the dataframe
df = spark_app.createDataFrame(my_array_data, schema=schema)

df.show()

Output:

array_position()

array_position() is used to return the position of the value present in an array in each row of the array type column. It takes two parameters, the first parameter is the array column, and the second parameter is the value. It returns the value position in each array. Position starts with 1.

If there are multiple values in the same array, then it will return the position of the first value.

Syntax

array_position(array_column,"value")

Parameters

  1. array_column is the array column that has arrays with values
  2. value is present in the array.

array_position() function is used with the select() method to do the action.

Return
If the value is present in an array, it will return the position; otherwise, it will return 0.

Example 1
In this example, we will get the position of value – “K” from the Student_full_name column in the above created dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#and import struct types and other data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType,ArrayType
from pyspark.sql.functions import *
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
 
# consider an array with 5 elements
my_array_data = [(1, ['A']), (2, ['B','L','B']), (3, ['K','A','K']),(4, ['K']), (3, ['B','P'])]
 
 
#define the StructType and StructFields
#for the above data
 
schema = StructType([StructField("Student_category", IntegerType()),StructField("Student_full_name", ArrayType(StringType()))])
 
#create the dataframe and add schema to the dataframe
df = spark_app.createDataFrame(my_array_data, schema=schema)
 
#return the position of value  - K in each row of Student_full_name column
df.select("Student_full_name",array_position("Student_full_name",'K')).show()

Output

You can see in the second column that values positions in arrays in all rows were returned.

  1. In first row, K doesn’t exist – hence it returned 0
  2. In the second  row, K doesn’t exist – hence it returned 0
  3. In the third  row, K  exists at two positions, 1st and 3rd – It takes only the first position, so it returns 0
  4. In the fourth row, K exists in the first position – hence it returned 1
  5. In fifth row, K doesn’t exist – hence it returned 0

Example 2
In this example, we will get the position of value – “A” from the Student_full_name column in the above created dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#and import struct types and other data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType,ArrayType
from pyspark.sql.functions import *
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
 
# consider an array with 5 elements
my_array_data = [(1, ['A']), (2, ['B','L','B']), (3, ['K','A','K']),(4, ['K']), (3, ['B','P'])]
 
#define the StructType and StructFields
#for the above data
 
schema = StructType([StructField("Student_category", IntegerType()),StructField("Student_full_name", ArrayType(StringType()))])
 
#create the dataframe and add schema to the dataframe
df = spark_app.createDataFrame(my_array_data, schema=schema)
 
#return the position of value  - A in each row of Student_full_name column
df.select("Student_full_name",array_position("Student_full_name",'A')).show()

Output

You can see in the second column that values positions in arrays in all rows were returned.

  1. In first row, A exists at the first position – hence it returned 1
  2. In the second  row, A doesn’t exist – hence it returned 0
  3. In the third  row, A  exists in the second position- so it returns 2
  4. In the fourth row, A doesn’t exist – hence it returned 0
  5. In the fifth row, A doesn’t exist – hence it returns 0.

array_repeat()

array_repeat() is used to repeat the array n times. In other words, It will duplicate the array and replace n times across all rows in the array type column. It takes two parameters. The first parameter is the array type column name, and the second parameter takes to repeat, which takes an integer value that refers to the number of times to repeat the array.

Syntax

array_repeat(array_column,repeat)

Parameters

  1. array_column is the array column that has arrays with values
  2. repeat takes an integer value to repeat the array in all rows

array_repeat() function is used with the select() method to do the action.

Return
If return repeated arrays in the nested array.

Example 1
In this example, we will repeat the array 2 times in all rows of the Student_full_name column using array_repeat() and display the dataframe using the collect method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#and import struct types and other data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType,ArrayType
from pyspark.sql.functions import *
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
 
# consider an array with 5 elements
my_array_data = [(1, ['A']), (2, ['B','L','B']), (3, ['K','A','K']),(4, ['K']), (3, ['B','P'])]
 
#define the StructType and StructFields
#for the above data
schema = StructType([StructField("Student_category", IntegerType()),StructField("Student_full_name", ArrayType(StringType()))])
 
#create the dataframe and add schema to the dataframe
df = spark_app.createDataFrame(my_array_data, schema=schema)
 
 #repeat the array 2 times
df.select("Student_full_name",array_repeat("Student_full_name",2)).collect()

Output

[Row(Student_full_name=['A'], array_repeat(Student_full_name, 2)=[['A'], ['A']]),
 Row(Student_full_name=['B', 'L', 'B'], array_repeat(Student_full_name, 2)=[['B', 'L', 'B'], ['B', 'L', 'B']]),
 Row(Student_full_name=['K', 'A', 'K'], array_repeat(Student_full_name, 2)=[['K', 'A', 'K'], ['K', 'A', 'K']]),
 Row(Student_full_name=['K'], array_repeat(Student_full_name, 2)=[['K'], ['K']]),
 Row(Student_full_name=['B', 'P'], array_repeat(Student_full_name, 2)=[['B', 'P'], ['B', 'P']])]

We can see that array is repeated 2 times in all rows of the Student_full_name column in a nested array,

Example 2
In this example, we will repeat the array 4 times in all rows of the Student_full_name column using array_repeat() and display the dataframe using the collect method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#and import struct types and other data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType,ArrayType
from pyspark.sql.functions import *
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
 
# consider an array with 5 elements
my_array_data = [(1, ['A']), (2, ['B','L','B']), (3, ['K','A','K']),(4, ['K']), (3, ['B','P'])]
 
#define the StructType and StructFields
#for the above data
schema = StructType([StructField("Student_category", IntegerType()),StructField("Student_full_name", ArrayType(StringType()))])
 
#create the dataframe and add schema to the dataframe
df = spark_app.createDataFrame(my_array_data, schema=schema)
 
#repeat the array 4 times
df.select("Student_full_name",array_repeat("Student_full_name",4)).collect()

Output

[Row(Student_full_name=['A'], array_repeat(Student_full_name, 4)=[['A'], ['A'], ['A'], ['A']]),
 Row(Student_full_name=['B', 'L', 'B'], array_repeat(Student_full_name, 4)=[['B', 'L', 'B'], ['B', 'L', 'B'], ['B', 'L', 'B'], ['B', 'L', 'B']]),
 Row(Student_full_name=['K', 'A', 'K'], array_repeat(Student_full_name, 4)=[['K', 'A', 'K'], ['K', 'A', 'K'], ['K', 'A', 'K'], ['K', 'A', 'K']]),
 Row(Student_full_name=['K'], array_repeat(Student_full_name, 4)=[['K'], ['K'], ['K'], ['K']]),
 Row(Student_full_name=['B', 'P'], array_repeat(Student_full_name, 4)=[['B', 'P'], ['B', 'P'], ['B', 'P'], ['B', 'P']])]

We can see that array is repeated 4 times in all rows of the Student_full_name column in a nested array.

Conclusion

In this PySpark article, we saw two different array functions. array_position() is used to return the position of the specified value in an array. We noticed that it would return the first position if there were multiple values in an array. Next, we discussed the array_repeat() method that is used to duplicate the array n times across all rows. The repeated arrays are stored in a nested array. Both the functions use the select() method to do the functionality.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain