Different Ways to Create PySpark DataFrame

July 17, 2022 | by Arround The Web | No comments

Different Ways to Create PySpark DataFrame

In Python, PySpark is a Spark module used to provide a similar kind of processing like spark using DataFrame. In this article, we will discuss several ways to create PySpark DataFrame.

Method 1: Using Dictionary

Dictionary is a datastructure which will store the data in key,value pair format.

The key acts as column and value act as row value/data in the PySpark DataFrame. This has to be passed inside the list.

Structure:

[{‘key’ : value}]

We can also provide multiple dictionaries.

Structure:

[{‘key’ : value},{‘key’ : value},…….,{‘key’ : value}]

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#create an app named linuxhint

spark_app = SparkSession.builder.appName(‘linuxhint’).getOrCreate()

# create student data with 5 rows and 6 attributes

students =[{‘rollno’:’001’,’name’:’sravan’,’age’:23,’height’:5.79,’weight’:67,’address’:’guntur’},
{‘rollno’:’002’,’name’:’ojaswi’,’age’:16,’height’:3.79,’weight’:34,’address’:’hyd’},
{‘rollno’:’003’,’name’:’gnanesh chowdary’,’age’:7,’height’:2.79,’weight’:17,’address’:’patna’},
{‘rollno’:’004’,’name’:’rohith’,’age’:9,’height’:3.69,’weight’:28,’address’:’hyd’},
{‘rollno’:’005’,’name’:’sridevi’,’age’:37,’height’:5.59,’weight’:54,’address’:’hyd’}]

# create the dataframe

df = spark_app.createDataFrame( students)

#display the dataframe

df.show()

Output:

Method 2: Using list of tuples

Tuple is a data structure which will store the data in ().

We can pass the rows separated by comma in a tuple surrounded by a list.

Structure:

[(value1,value2,.,valuen)]

We can also provide multiple tuples in a list.

Structure:

[(value1,value2,.,valuen), (value1,value2,.,valuen), ………………,(value1,value2,.,valuen)]

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( list_of_tuple,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes

students =[('001','sravan',23,5.79,67,'guntur'),
('002','ojaswi',16,3.79,34,'hyd'),
('003','gnanesh chowdary',7,2.79,17,'patna'),
('004','rohith',9,3.69,28,'hyd'),
('005','sridevi',37,5.59,54,'hyd')]

#assign the column names

column_names = ['rollno','name','age','height','weight','address']

# create the dataframe

df = spark_app.createDataFrame( students,column_names)

#display the dataframe

df.show()

Output:

Method 3: Using tuple of lists

List is a data structure which will store the data in [].

We can pass the rows separated by comma in a list surrounded by a tuple.

Structure:

([value1,value2,.,valuen])

We can also provide multiple lists in a tuple.

Structure:

([value1,value2,.,valuen], [value1,value2,.,valuen], ………………,[value1,value2,.,valuen])

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( tuple_of_list,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes

students =(['001','sravan',23,5.79,67,'guntur'],
['002','ojaswi',16,3.79,34,'hyd'],
['003','gnanesh chowdary',7,2.79,17,'patna'],
['004','rohith',9,3.69,28,'hyd'],
['005','sridevi',37,5.59,54,'hyd'])

#assign the column names

column_names = ['rollno','name','age','height','weight','address']

# create the dataframe

df = spark_app.createDataFrame( students,column_names)

#display the dataframe

df.show()

Output:

Method 4: Using nested list

List is a datastructure which will store the data in [].

So, we can pass the rows separated by comma in a list surrounded by a list.

Structure:

[[value1,value2,.,valuen]]

We can also provide multiple lists in a list.

Structure:

[[value1,value2,.,valuen], [value1,value2,.,valuen], ………………,[value1,value2,.,valuen]]

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( nested_list,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes

students =[['001','sravan',23,5.79,67,'guntur'],
['002','ojaswi',16,3.79,34,'hyd'],
['003','gnanesh chowdary',7,2.79,17,'patna'],
['004','rohith',9,3.69,28,'hyd'],
['005','sridevi',37,5.59,54,'hyd']]

#assign the column names

column_names = ['rollno','name','age','height','weight','address']

# create the dataframe

df = spark_app.createDataFrame( students,column_names)

#display the dataframe

df.show()

Output:

Method 5: Using nested tuple

Structure:

((value1,value2,.,valuen))

We can also provide multiple tuples in a tuple.

Structure:

((value1,value2,.,valuen), (value1,value2,.,valuen), ………………,(value1,value2,.,valuen))

We need to provide the column names through a list while creating the DataFrame.

Syntax:

column_names = [‘column1’,’column2’,….’column’]

spark_app.createDataFrame( nested_tuple,column_names)

Example:

Here, we are going to create PySpark DataFrame with 5 rows and 6 columns through the dictionary. Finally, we are displaying the DataFrame using show() method.

#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes

students =(('001','sravan',23,5.79,67,'guntur'),
('002','ojaswi',16,3.79,34,'hyd'),
('003','gnanesh chowdary',7,2.79,17,'patna'),
('004','rohith',9,3.69,28,'hyd'),
('005','sridevi',37,5.59,54,'hyd'))

#assign the column names

column_names = ['rollno','name','age','height','weight','address']

# create the dataframe

df = spark_app.createDataFrame( students,column_names)

#display the dataframe

df.show()

Output:

Conclusion

In this tutorial, we discussed five methods to create PySpark DataFrame: list of tuples, tuple of lists, nested tuple, nested list use, and columns list to provide column names. There is no need to provide the column names list while creating PySpark DataFrame using dictionary.

Source: linuxhint.com