
Most of the time, while cleaning the data, you have to make some choices based on the usability of the data. If a dataset that you are working on contains null values, you need to decide the most feasible way of doing it. Undoubtedly, you can either replace null values with some alternative or completely remove them out of the way. Mostly, the first way is the preferred one. It reduces the threat of data loss. Let’s take a look at our pyspark example dataset.
#creating a pyspark dataframe with a column holding integer value
nums=[('Tim',8),
('Stephen',1),
('Steve',None),
('Jack',7),
('Adam',3),
('Gwen',None)]
columns = ["FirstName","Points"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NameNum').getOrCreate()
df = spark.createDataFrame(data=nums, schema = columns)
df.show()

You need to fill null with some meaningful option. If the column contains int, double or numeric values ordinarily we put 0 in place of null, that’s the easiest way. Nonetheless, you can replace null with mean in pyspark dataframe.
Pyspark provides multiple ways to replace null with the average value of the pyspark column. Let’s first explore the ways to calculate the mean. So, as you take the average of pyspark dataframe column with no null value, it makes no trouble.
Replace null values in PySpark dataframe column
How to calculate average in pyspark dataframe
The average is simply the sum by the number of records or pyspark dataframe rows. Alternatively, if there are null values in the column, then the mean(col(‘columnName’)) function skips the null records hence row count decreases resulting in the faulty mean value. agg({‘columnName’: ‘mean’}) and agg({‘columnName’: ‘avg’}) also do the same thing. You can replace null with such mean taken. To get the more vivid look, let’s dive into the code.
Calculate pyspark mean by ignoring null values
#creating a pyspark dataframe with a column holding integer value
nums=[('Tim',8),
('Stephen',1),
('Steve',None),
('Jack',7),
('Adam',3),
('Gwen',None)]
columns = ["FirstName","Points"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NameNum').getOrCreate()
df = spark.createDataFrame(data=nums, schema = columns)
df_mean = df.agg({'Points': 'avg'}).collect()
df_mean[0][0]

Furthermore, you can use ‘mean’ in place of ‘avg’, the code will look something like this.
#In aggregate function 'mean' or 'avg' can be used alternatively.
df_mean = df.agg({'Points': 'mean'}).collect()
df_mean[0][0]
In a different way, this can be done by importing mean from pyspark.sql.functions and using mean function. It will produce the exact same result for the above pyspark dataframe.
#importing mean function
from pyspark.sql.functions import mean
df_mean = df.select(mean(col('Points')).alias('avg')).collect()
avg = df_mean[0]['avg']
avg

You see, taking the average is not a big deal right. But, it’s not getting calculated as desired. Hence, you can overcome this problem by replacing null with 0 in pyspark dataframe beforehand. Once, you replace nulls, no pyspark dataframe row will be skipped therefore mean is calculated considering the whole dataset and not specific to records(rows) with no null value. The goal can be achieved by using df.fillna() before calculating the mean.
Calculate pyspark mean without ignoring nulls
#importing mean function
from pyspark.sql.functions import mean
#replacing null with 0 to calculate the exact mean value
df = df.fillna(value=0, subset=['Points'])
df_mean = df.select(mean(col('Points')).alias('avg')).collect()
avg = df_mean[0]['avg']
avg

Ultimately, you can replace null values with this mean value in your pyspark dataframe, all at once . There will be no null values and mean values fall in place of null. Let’s walk through the pyspark code for that.
pyspark replace null with mean
#creating a pyspark dataframe with a column holding integer value
nums=[('Tim',8),
('Stephen',1),
('Steve',None),
('Jack',7),
('Adam',3),
('Gwen',None)]
columns = ["FirstName","Points"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NameNum').getOrCreate()
df = spark.createDataFrame(data=nums, schema = columns)
#importing mean function
from pyspark.sql.functions import mean
#replacing null with 0 to calculate the exact mean value
df_mean = df.fillna(value=0, subset=['Points']).select(mean(col('Points')).alias('avg')).collect()
avg = df_mean[0]['avg']
df = df.fillna(value=avg, subset=['Points'])
df.show()

Noticeably, avg is a float number, whereas the pyspark coumn has all values of integer type. So the mean value (3.1666666666) getting cast to 3. So, all null values are replaced by 3.
More precisely, PySpark udf can be used to fill null with exact mean value elaborated below.
#replacing null with 0 to calculate the exact mean value
df_mean = df.fillna(value=0, subset=['Points']).select(mean(col('Points')).alias('avg')).collect()
avg = df_mean[0]['avg']
df = df.withColumn('Points', F.udf(lambda x: avg if x is None else x)(F.col('Points')))
df.show()

When you deal with huge data you need to be careful about what it contains. Datasets that hold some sort of industrial data, data related to transactions, productions, or some commodity usages, surveys. They can be as big as GBs in size. Nevertheless, with such a humongous size their rows also range somewhere around millions. Machines or sensor readings are taken daily, sometimes hourly, or for every second are also stored as CSV files and later transformed into datasets. Remarkably, there are some erroneous values, null/NaN, ambiguous data that get generated and carried forward.
PySpark Databricks Interview Questions
Data Engineers and Analysts are the people who catch this data to create value out of it. Data analyst and Engineers collects such data. Cleans, and makes sense out of it in order to solve a problem, conversely use it for analytical or operational projects. By means of big data technology like Hadoop, Spark this job is done smoothly. Whereas, pyspark is a way for writing code to carry out these transformations.