50+ most important Apache Spark mcq questions & Databricks

apache spark mcq questions, apache spark multiple choice questions, apache spark objective questions, apache spark quiz, apache spark sql and cloudant quiz, pyspark mcq, pyspark mcq questions, pyspark multiple choice questions, spark mcq questions, spark multiple choice questions, spark objective questions
Apache pyspark mcq questions

We provide Apache Spark objective questions [pyspark](mcqs) along with the Apache Spark interview question and quiz. These Apache Spark multiple choice questions (quiz) contains some of the most important Hadoop Spark objective questions along with databricks apache spark mcq questions. Studying below Apache Spark mcq quiz and azure databricks mcq questions can help you crack the databricks exam. Following mcqs is a great resource for Azure exams and tests taken by companies link Capgemini, Infosys, Accenture, Cognizant, TCS as a data Analyst training tests.

Apache Spark multiple choice questions

Q.1. Spark was initially started by ____________ at UC Berkeley AMPLab in 2009.
A : Mahek Zaharia
B : Matei Zaharia
C : Doug Cutting
D : Stonebraker

Advertisement

Matei Zaharia

Q.2. ____________ is a component on top of Spark Core.
A : Spark Streaming
B : Spark SQL
C : RDDs
D : All of the mentioned

Spark SQL

Q.3. Spark SQL provides a domain-specific language to manipulate ___________ in Scala, Java, or Python.
A : Spark Streaming
B : Spark SQL
C : RDDs
D : All of the mentioned

Advertisement

RDDs

Q.4. ______________ leverages Spark Core fast scheduling capability to perform streaming analytics.
A : MLlib
B : Spark Streaming
C : GraphX
D : RDDs

Advertisement

Spark Streaming

Q.5. ____________ is a distributed machine learning framework on top of Spark.
A : MLlib
B : Spark Streaming
C : GraphX
D : RDDs

MLlib

Q.5. Given a dataframe df, select the code that returns its number of rows:
A : df.take(‘all’)
B : df.collect()
C : df.count()
D : df.numRows()

Advertisement

df.count()

Q.6. Users can easily run Spark on top of Amazon’s __________
A : Infosphere
B : EC2
C : EMR
D : None of the mentioned

EC2

Q.7. Which of the following can be used to launch Spark jobs inside MapReduce?
A : SIM
B : SIMR
C : SIR
D : RIS

Advertisement

SIMR (Spark In MapReduce)

Q.8. Which of the following language is not supported by Spark?
A : Java
B : Pascal
C : Scala
D : Python

Pascal

Q.9. Spark is packaged with higher level libraries, including support for _________ queries.
A : SQL
B : C
C : C++
D : None of the mentioned

SQL

Q.10. Spark includes a collection over ________ operators for transforming data and familiar data frame APIs for manipulating semi-structured data.
A : 50
B : 60
C : 70
D : 80

Advertisement

80

Q.10. Given a DataFrame df that includes a number of columns among which a column named quantity and a column named price, complete the code below such that it will create a DataFrame including all the original columns and a new column revenue defined as quantity*price:
A : df.withColumnRenamed(“revenue”, expr(“quantity*price”))
B : df.withColumn(revenue, expr(“quantity*price”))
C : df.withColumn(“revenue”, expr(“quantity*price”))
D : df.withColumn(expr(“quantity*price”), “revenue”)

df.withColumn(“revenue”, expr(“quantity*price”))

Apache Spark mcq questions

Q.11. Spark is engineered from the bottom-up for performance, running ___________ faster than Hadoop by exploiting in memory computing and other optimizations.
A : 100x
B : 150x
C : 200x
D : None of the mentioned

100x

Q.12. Spark powers a stack of high-level tools including Spark SQL, MLlib for _________
A : regression models
B : statistics
C : machine learning
D : reproductive research

Advertisement

machine learning

Q.13. For Multiclass classification problem which algorithm is not the solution?
A : Naive Bayes
B : Random Forests
C : Logistic Regression
D : Decision Trees

Decision Trees

Q.14. Which of the following is a tool of Machine Learning Library?
A : Persistence
B : Utilities like linear algebra, statistics
C : Pipelines
D : All of the above

All of the above

Q.15. Which of the following is true for Spark core?
A : It is the kernel of Spark
B : It enables users to run SQL / HQL queries on the top of Spark.
C : It is the scalable machine learning library which delivers efficiencies
D : Improves the performance of iterative algorithm drastically.

Advertisement

It is the kernel of Spark

Q.15. Given a DataFrame df that has some null values in the column created_date, find the code below such that it will sort rows in ascending order based on the column creted_date with null values appearing last.
A : orderBy(asc_nulls_last(“created_date”))
B : sort(asc_nulls_last(“created_date”))
C : orderBy(col(“created_date”).asc_nulls_last())
D : orderBy(col(“created_date”), ascending=True))

orderBy(col(“created_date”).asc_nulls_last())

Q.16. Which of the following is true for Spark MLlib?
A : Provides an execution platform for all the Spark applications
B : It is the scalable machine learning library which delivers efficiencies
C : enables powerful interactive and data analytics application across live streaming data
D : All of the above

It is the scalable machine learning library which delivers efficiencies

Q.17. Which of the following is true for RDD?
A : We can operate Spark RDDs in parallel with a low-level API
B : RDDs are similar to the table in a relational database
C : It allows processing of a large amount of structured data
D : It has built-in optimization engine

Advertisement

We can operate Spark RDDs in parallel with a low-level API

Q.18. RDD is fault-tolerant and immutable
A : True
B : False
C : Both
D : None

True

Q.19. The read operation on RDD is
A : Fine-grained
B : Coarse-grained
C : Either fine-grained or coarse-grained
D : Neither fine-grained nor coarse-grained

Advertisement

Either fine-grained or coarse-grained

Q.20. The write operation on RDD is
A : Fine-grained
B : Coarse-grained
C : Either fine-grained or coarse-grained
D : Neither fine-grained nor coarse-grained

Coarse-grained

Databricks mcq questions

Q.20. Which one of the following commands does NOT trigger an eager evaluation?
A : df.collect()
B : df.take()
C : df.show()
D : df.join() –> CORRECT

Advertisement

Numerical Python

Q.20. Which one of the following command triggers an eager evaluation?
A : df.filter()
B : df.select()
C : df.show()
D : df.limit()

df.show()

Q.21. Is it possible to mitigate stragglers in RDD?
A : Yes
B : No
C : Both
D : None

Yes

Q.22. Fault Tolerance in RDD is achieved using
A : Immutable nature of RDD
B : DAG (Directed Acyclic Graph)
C : Lazy-evaluation
D : None of the above

Advertisement

DAG (Directed Acyclic Graph)

Q.23. What is action in Spark RDD?
A : The ways to send result from executors to the driver
B : Takes RDD as input and produces one or more RDD as output.
C : Creates one or many new RDDs
D : All of the above

The ways to send result from executors to the driver

Databricks apache spark questions

Q.24. The shortcomings of Hadoop MapReduce was overcome by Spark RDD by
A : Lazy-evaluation
B : DAG
C : In-memory processing
D : All of the above

Advertisement

All of the above

Q.25. Spark is developed in which language
A : Java
B : Scala
C : Python
D : R

Scala

Q.25. Which of the following is NOT an actions
A : foreach()
B : printSchema()
C : first()
D : reduce()

Advertisement

printSchema()

Q.25. Which of the following is an actions
A : foreach()
B : printSchema()
C : cache()
D : sort()

foreach()

Azure databricks mcq questions

Q.25. Which of the following is a transformation?
A : foreach()
B : flatMap()
C : save()
D : count()

Advertisement

flatMap()

Q.26. Which of the following is not a component of the Spark Ecosystem?
A : Sqoop
B : GraphX
C : MLlib
D : BlinkDB

Sqoop

Q.27. Which of the following algorithm is not present in MLlib?
A : Streaming Linear Regression
B : Streaming KMeans
C : Tanimoto distance
D : None of the above

Tanimoto distance

Q.28. Which of the following is not the feature of Spark?
A : Supports in-memory computation
B : Fault-tolerance
C : It is cost-efficient
D : Compatible with other file storage system

Advertisement

It is cost-efficient

Q.29. Which of the following is the reason for Spark being Speedy than MapReduce?
A : DAG execution engine and in-memory computation
B : Support for different language APIs like Scala, Java, Python and R
C : RDDs are immutable and fault-tolerant
D : None of the above

DAG execution engine and in-memory computation

Q.30. Which of the following is true for RDD?
A : RDD is a programming paradigm
B : RDD in Apache Spark is an immutable collection of objects
C : It is a database
D : None of the above

RDD in Apache Spark is an immutable collection of objects

databricks objective questions

Q.30. Which of the following statements are NOT true for broadcast variables ?
A : Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of being serialized with every single task.
B : A custom broadcast class can be defined by extending org.apache.spark.utilbroadcastV2 in Java or Scala or pyspark.Accumulatorparams in Python. –> CORRECT
C : It is a way of updating a value inside a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way.–> CORRECT
D : It provides a mutable variable that Spark cluster can safely update on a per-row basis. –> CORRECT

Advertisement

B : A custom broadcast class can be defined by extending org.apache.spark.utilbroadcastV2 in Java or Scala or pyspark.Accumulatorparams in Python. –> CORRECT C : It is a way of updating a value inside a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way.–> CORRECT D : It provides a mutable variable that Spark cluster can safely update on a per-row basis. –> CORRECT

Q.32. Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of being serialized with every single task.
A : True
B : False
C : Can’t Specify
D : None

True

Q.33. broadcast variables are ______ and lazily replicated across all nodes in the cluster when an action is triggered
A : mutable
B : immutable
C : both
D : None of above

immutable

pyspark mcq questions

Q.34. The code below should return a new DataFrame with 50 percent of random records from DataFrame df without replacement.
A : df.sample(False, 0.5, 5)
B : df.random(False, 0.5, 5)
C : df.sample(False, 5, 25)
D : df.sample(False, 50, 5)

Advertisement

df.sample(False, 0.5, 5)

Q.35. Which of the following DataFrame commands will NOT generate a shuffle of data from each executor across the cluster?
A : df.map()
B : df.collect()
C : df.orderBy()
D : df.repartition()

df.map()

Q.36. Which of the following DataFrame commands will NOT generate a shuffle of data from each executor across the cluster?
A : df.union()
B : df.collect()
C : df.orderBy()
D : df.repartition()

df.union()

Q.37. Which of the following DataFrame commands is a narrow transform?
A : df.drop()
B : df.collect()
C : df.orderBy()
D : df.repartition()

Advertisement

df.drop()

Apache Spark mcq Questions and Answers

Q.38. Which of the following DataFrame commands is a wide transform?
A : df.drop()
B : df.contains()
C : df.filter()
D : df.repartition()

df.repartition()

Databricks spark interview questions

Q.39. Which of the following DataFrame commands is a wide transform?
A : df.drop()
B : df.intersection()
C : df.filter()
D : df.map()

df.intersection()

Q.40. Which of the following DataFrame commands is a narrow transform?
A : df.distinct()
B : df.MapPartition()
C : df.cartesian()
D : df.reduceByKey()

Advertisement

df.MapPartition()

Q.41. When Spark runs in Cluster Mode, which of the following statements about nodes is correct ?
A : There is one single worker node that contains the Spark driver and all the executors.
B : The Spark Driver runs in a worker node inside the cluster.
C : There is always more than one worker node.
D : There are less executors than total number of worker nodes.

The Spark Driver runs in a worker node inside the cluster.

pyspark multiple choice questions

Q.42. The DataFrame df includes a time string column named timestamp_1. Which is the correct syntax that creates a new DataFrame df1 that is just made by the time string field converted to a unix timestamp?
A : df1 = df.select(unix_timestamp(col(“timestamp_1″),”MM-dd-yyyy HH:mm:ss”).as(“timestamp_1”))
B : df1 = df.select(unix_timestamp(col(“timestamp_1″),”MM-dd-yyyy HH:mm:ss”, “America/Los Angeles”).alias(“timestamp_1”))
C : df1 = df.select(unix_timestamp(col(“timestamp_1″),”America/Los Angeles”).alias(“timestamp_1”))
D : df1 = df.select(unix_timestamp(col(“timestamp_1″),”MM-dd-yyyy HH:mm:ss”).alias(“timestamp_1”))

df1 = df.select(unix_timestamp(col(“timestamp_1″),”MM-dd-yyyy HH:mm:ss”).alias(“timestamp_1”))

Q.43. If you wanted to:
1. Cache a df as SERIALIZED Java objects in the JVM and;
2. If the df does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed;
3. Replicate each partition on two cluster nodes.
which command would you choose ?

A : df.persist(StorageLevel.MEMORY_ONLY)
B : df.persist(StorageLevel.MEMORY_AND_DISK_SER)
C : df.persist(StorageLevel.MEMORY_AND_DISK_2_SER)
D : df.cache(StorageLevel.MEMORY_AND_DISK_2_SER)

df.persist(StorageLevel.MEMORY_AND_DISK_2_SER)

Apache Spark interview questions and answers for experienced

Q.1. Spark is best suited for ______ data.
A : Real-time
B : Virtual
C : Structured
D : All of the above

Advertisement

Real-time

Q.2. Which of the following Features of Apache Spark?
A : Speed
B : Supports multiple languages
C : Advanced Analytics
D : All of the above

All of the above

Q.3. In how many ways Spark uses Hadoop?
A : 2
B : 3
C : 4
D : 5

Advertisement

2

Q.4. When was Apache Spark developed ?
A : 2007
B : 2008
C : 2009
D : 2010

2009

Q.5. Which of the following is incorrect way for Spark deployment?
A : Standalone
B : Hadoop Yarn
C : Spark in MapReduce
D : Spark SQL

Advertisement

Spark SQL

Q.7. ________ is a distributed graph processing framework on top of Spark.
A : MLlib
B : Spark Streaming
C : GraphX
D : None of the above

GraphX

Q.8. Point out the correct statement.
A : Spark enables Apache Hive users to run their unmodified queries much faster
B : Spark interoperates only with Hadoop
C : Spark is a popular data warehouse solution running on top of Hadoop
D : All of the above

Advertisement

Spark enables Apache Hive users to run their unmodified queries much faster

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top