This article aims to provide instructions on creating an empty PySpark DataFrame or RDD, either with or without a defined schema (column names), using various methods. Additionally, the article explores a common scenario where it is necessary to create an empty DataFrame.
In some cases, while working with files, there may be instances where no file is received for processing, but it is still necessary to manually create a DataFrame with the same schema as expected. Failure to create a DataFrame with the same schema can lead to operations and transformations such as union’s to fail, as they may reference columns that are not present in the DataFrame.
Therefore, it is essential to create a DataFrame with the same schema regardless of whether the file exists or not. This means that the column names and datatypes must remain consistent. The article offers several ways to achieve this, depending on the user’s preference and the available tools at their disposal.
To create an empty RDD in PySpark, you can use the emptyRDD()
method of the SparkContext
object. For example, you can create an empty RDD with the following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Netflixsub.com').getOrCreate()
# Creates an empty RDD
emptyRDD = spark.sparkContext.emptyRDD()
print(emptyRDD)

Alternatively, you can use the parallelize()
method to create an empty RDD, as shown below:
rdd2 = spark.sparkContext.parallelize([])
print(rdd2)

Note that attempting to perform operations on an empty RDD will raise a ValueError
(“RDD is empty”).
To create an empty PySpark DataFrame with a schema (i.e., column names and data types), you can define the schema using the StructType
and StructField
classes:
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])
You can then pass an empty RDD and the schema to the createDataFrame()
method of the SparkSession
object:
df = spark.createDataFrame(emptyRDD, schema)
df.printSchema()
This will create an empty DataFrame with the specified schema.

You can also create an empty DataFrame by converting an empty RDD to a DataFrame using the toDF()
method:
df1 = emptyRDD.toDF(schema)
df1.printSchema()

If you want to create an empty DataFrame with a schema without using an RDD, you can pass an empty list and the schema to the createDataFrame()
method:
df2 = spark.createDataFrame([], schema)
df2.printSchema()

Finally, to create an empty DataFrame without a schema (i.e., no columns), you can create an empty schema and pass it to the createDataFrame()
method:
df3 = spark.createDataFrame([], StructType([]))
df3.printSchema()
