How to replace null values with mean in pyspark dataframe

pyspark replace null with mean, pyspark fill null with mean, pyspark replace null with average, spark dataframe replace null with mean, fill null values pyspark, fill null values pyspark dataframe, fill null values pyspark column, fill null values with mean pyspark, fill null values with 0 pyspark, fill null values with 0 in pyspark dataframe, replace all null values with 0 in pyspark, pyspark replace null with 0, pyspark replace null with 0 in a column, pyspark replace null with zero, pyspark df replace null with 0,
fill null values pyspark example

Most of the time, while cleaning the data, you have to make some choices based on the usability of the data. If a dataset that you are working on contains null values, you need to decide the most feasible way of doing it. Undoubtedly, you can either replace null values with some alternative or completely remove them out of the way. Mostly, the first way is the preferred one. It reduces the threat of data loss. Let’s take a look at our pyspark example dataset.

#creating a pyspark dataframe with a column holding integer value
nums=[('Tim',8),
          ('Stephen',1),
          ('Steve',None),
          ('Jack',7),
          ('Adam',3),
          ('Gwen',None)]

columns = ["FirstName","Points"]

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NameNum').getOrCreate()
df = spark.createDataFrame(data=nums, schema = columns)

df.show()
image 9
pyspark example dataframe output

You need to fill null with some meaningful option. If the column contains int, double or numeric values ordinarily we put 0 in place of null, that’s the easiest way. Nonetheless, you can replace null with mean in pyspark dataframe.

Pyspark provides multiple ways to replace null with the average value of the pyspark column. Let’s first explore the ways to calculate the mean. So, as you take the average of pyspark dataframe column with no null value, it makes no trouble.

Replace null values in PySpark dataframe column

How to calculate average in pyspark dataframe

The average is simply the sum by the number of records or pyspark dataframe rows. Alternatively, if there are null values in the column, then the mean(col(‘columnName’)) function skips the null records hence row count decreases resulting in the faulty mean value. agg({‘columnName’: ‘mean’}) and agg({‘columnName’: ‘avg’}) also do the same thing. You can replace null with such mean taken. To get the more vivid look, let’s dive into the code.

Calculate pyspark mean by ignoring null values

#creating a pyspark dataframe with a column holding integer value
nums=[('Tim',8),
          ('Stephen',1),
          ('Steve',None),
          ('Jack',7),
          ('Adam',3),
          ('Gwen',None)]

columns = ["FirstName","Points"]

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NameNum').getOrCreate()
df = spark.createDataFrame(data=nums, schema = columns)

df_mean = df.agg({'Points': 'avg'}).collect()
df_mean[0][0]
pyspark average ignore null, pyspark avg ignore null, pyspark aggregate average ignore null, mean of a column pyspark,
calculate mean in pyspark example using agg function

Furthermore, you can use ‘mean’ in place of ‘avg’, the code will look something like this.

#In aggregate function 'mean' or 'avg' can be used alternatively.
df_mean = df.agg({'Points': 'mean'}).collect()
df_mean[0][0]

In a different way, this can be done by importing mean from pyspark.sql.functions and using mean function. It will produce the exact same result for the above pyspark dataframe.

#importing mean function
from pyspark.sql.functions import mean

df_mean = df.select(mean(col('Points')).alias('avg')).collect()

avg = df_mean[0]['avg']
avg
image 7
pyspark mean of a column using mean function

You see, taking the average is not a big deal right. But, it’s not getting calculated as desired. Hence, you can overcome this problem by replacing null with 0 in pyspark dataframe beforehand. Once, you replace nulls, no pyspark dataframe row will be skipped therefore mean is calculated considering the whole dataset and not specific to records(rows) with no null value. The goal can be achieved by using df.fillna() before calculating the mean.

Calculate pyspark mean without ignoring nulls

#importing mean function
from pyspark.sql.functions import mean

#replacing null with 0 to calculate the exact mean value
df = df.fillna(value=0, subset=['Points'])
df_mean = df.select(mean(col('Points')).alias('avg')).collect()

avg = df_mean[0]['avg']
avg
pyspark fillna with mean, 
pyspark fillna with 0, 
pyspark fillna example, 
pyspark na.fill example,
fillna with mean pyspark,
na.fill pyspark example,
average in pyspark,
mean without ignoring null values pyspark example

Ultimately, you can replace null values with this mean value in your pyspark dataframe, all at once . There will be no null values and mean values fall in place of null. Let’s walk through the pyspark code for that.

pyspark replace null with mean

#creating a pyspark dataframe with a column holding integer value
nums=[('Tim',8),
          ('Stephen',1),
          ('Steve',None),
          ('Jack',7),
          ('Adam',3),
          ('Gwen',None)]

columns = ["FirstName","Points"]

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NameNum').getOrCreate()
df = spark.createDataFrame(data=nums, schema = columns)

#importing mean function
from pyspark.sql.functions import mean

#replacing null with 0 to calculate the exact mean value
df_mean = df.fillna(value=0, subset=['Points']).select(mean(col('Points')).alias('avg')).collect()
avg = df_mean[0]['avg']

df = df.fillna(value=avg, subset=['Points'])
df.show()
image 10

Noticeably, avg is a float number, whereas the pyspark coumn has all values of integer type. So the mean value (3.1666666666) getting cast to 3. So, all null values are replaced by 3.

More precisely, PySpark udf can be used to fill null with exact mean value elaborated below.

#replacing null with 0 to calculate the exact mean value
df_mean = df.fillna(value=0, subset=['Points']).select(mean(col('Points')).alias('avg')).collect()
avg = df_mean[0]['avg']

df = df.withColumn('Points', F.udf(lambda x: avg if x is None else x)(F.col('Points')))
df.show()
image 11
fill null values with mean pyspark example

When you deal with huge data you need to be careful about what it contains. Datasets that hold some sort of industrial data, data related to transactions, productions, or some commodity usages, surveys. They can be as big as GBs in size. Nevertheless, with such a humongous size their rows also range somewhere around millions. Machines or sensor readings are taken daily, sometimes hourly, or for every second are also stored as CSV files and later transformed into datasets. Remarkably, there are some erroneous values, null/NaN, ambiguous data that get generated and carried forward.

PySpark Databricks Interview Questions

Data Engineers and Analysts are the people who catch this data to create value out of it. Data analyst and Engineers collects such data. Cleans, and makes sense out of it in order to solve a problem, conversely use it for analytical or operational projects. By means of big data technology like Hadoop, Spark this job is done smoothly. Whereas, pyspark is a way for writing code to carry out these transformations.

Leave a Comment

Your email address will not be published. Required fields are marked *

error: Content is protected !!
Scroll to Top