How to Replace null values in PySpark dataframe column

Why replace null values in PySpark DataFrame?

replace null values with 0 pyspark example
replace null values with 0 pyspark example

In huge datasets, there can be thousands of rows and hundreds of columns. Out of these columns, there may exist some containing null values or None in more than one cell. Null values in pyspark are nothing but no values in certain rows of String or Integer datatype columns, pyspark considers such blanks as null.

Handle null values in pyspark dataframe

It becomes a tedious job to play with null or None values. Preferably, null values in a PySpark dataframe should be handled with care. You can do the same with None values present in pyspark df. If not handled, can generate erroneous results. That’s why data should be cleaned.

Remove null values from dataframe pyspark

The hard way of dealing with null or no value records is directly removing them out of the way. how we can delete rows with null values in pyspark dataframe? With pyspark df.dropna() the goal can be achieved smoothly resulting in the removal of an entire row. You may think of it as the easiest approach to handle null/None. Problematically, you can lose certain records that contain valuable information in their other columns.

df.dropna()
image 2
The whole row is removed where TaskID as null pyspark example

How to replace null values in pyspark dataframe column?

Replace null with 0 in pyspark column

In most cases, the safe way is you replace null values present in pyspark df most notably with 0’s & occasionally by mean value (if the column is numeric) or fix string value. Nevertheless, you can replace None/null with most of the things you desire.

Pyspark allows you to do all that mentioned upwardly. df.fillna() and df.fill() are two powerful pyspark functions but not least, that can do your job of replacing the null.

df.fillna()
df.na.fill()
image
Replacing null TaskID with a fix value in pyspark

While using df.fillna() & df.na.fill() you need to take schema in consideration. The datatype of value replacing null must be equal to that of respective columns. If it is not, null appears unaffected. So, putting 0 where there is null within a string column won’t make any change.

df.fillna() & df.na.fill() can be used alternatively. Both pyspark functions have almost the same syntactic structure moreover they work in the same way.

Replacing NaN/None/Null can also be accomplished with the help of PySpark Lambda function one-liners. That’s what you’re going to explore here.

pyspark withcolumn udf lambda example to replace null values

Pyspark udf with Lambda function can also be used to replace the null values. In fact, Lambda functions do it more precisely. Implementing Pyspark udf with the Lambda function is more robust. It will replace null irrespective of column datatype.

Replace/Remove null values from dataframe PySpark Example Code

Apparently, you have become familiar with different ways of handling null values by now. In order to improve understanding, it’s essential to go through some hands-on. By practicing the pyspark code examples you’ll get to know how things actually work in real. So let’s dive into some pyspark programs.

Code for pyspark dataframe used in this Example

#creating pyspark dataframe from python list containing some null 
employee=[('Tim','Parker','Data Analyst','tid.0678308'),
          ('Stephen','Brown','Data Analyst',None),
          ('Steve','Jobs','Data Engineer','tid.5647382'),
          ('Jack','Downey','Platform Engineer','tid.0025637'),
          ('Adam','Jones','Data Scientist', None),
          ('Gwen','Willams','Data Engineer','tid.9875523')]

#defining column names
columns = ["FirstName","LastName","Title","TaskID"]

#importing SparkSession
from pyspark.sql import SparkSession

#creating a new appName 'Company' for our example
spark = SparkSession.builder.appName('Company').getOrCreate()

#assigning employee list to data and columns list for schema of our dataframe
df = spark.createDataFrame(data=employee, schema = columns)

#displaying dataframe
df.show()
image 1
null values present in TaskID column Pyspark Dataframe example
Code for replacing null values with ‘N/A’ in ‘TaskID’ column
#replacing null values from String type column 'TaskID' with 'N/A'
df = df.fillna(value="N/A", subset=['TaskID'])
df.show()
#alternate syntax for replacing null
df = df.na.fill(value="N/A", subset=['TaskID'])
df.show()
image 5
Replace null values using pyspark udf and Lambda function

Following code can replace null values with any value, for example, we are putting 0 there. Nevertheless, you can use any String, double, or float value to put back.

#importing functions as F
from pyspark.sql import functions as F

#import random to generate unique random value
import random

#this code will replace null with 0 in 'TaskID' column 
df = df.withColumn('TaskID', F.udf(lambda x: 0 if x is None else x)(F.col('TaskID')))
df.show()
image 3
replace null values with zero using Lambda function pyspark

Replace null values with 0 in pyspark dataframe integer cloumn

The dataset and example given below are completely different from the above one. Here the column we are operating on contains all int type values.

#creating a pyspark dataframe with a column holding integer value
nums=[('Tim',8),
          ('Stephen',1),
          ('Steve',None),
          ('Jack',7),
          ('Adam',3),
          ('Gwen',None)]

columns = ["FirstName","Points"]

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NameNum').getOrCreate()
df = spark.createDataFrame(data=nums, schema = columns)

#removing nulls and puting 0s
df = df.fillna(value=0, subset=['Points'])

#displaying the dataframe
df.show()
image 4

In a nutshell, you learned how to handle null/Nan values present in dataframe by different means. You can also check out Apache Spark Question to challenge your spark knowledge. Keep Learning.

Code editor-Google Colab
PySpark-3.2.1
py4j-0.10.9.3

Leave a Comment

Your email address will not be published.

Scroll to Top