Top Pyspark Interview Questions - Solved

Pyspark is an interface of apache spark in python used for large scale data processing and data engineering which includes machine learning also. Here we will explore pyspark interview questions that will help you to understand the types of problems and scenarios asked in an interview and what could be the solution to those queries.

Q: How would you load csv data to spark using pyspark?

We can read csv with the csv function provided in spark.
Below is a example of the code :

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
# with above line we have created a spark session
#which is necessary every time we work with pyspark

df_pyspark=spark.read.option('header','true').csv('/path/name.csv')
 

The above code will take all the values of the data as a string value only.
To let the code recognize the data type, we need tomodify code like below :

df_pyspark=spark.read.option('header','true').csv('/path/name.csv', inferSchema=True)
 

Q: Different ways of dropping a row in pyspark.

* To drop all the rows ahving any field value as null :

df_pyspark= df_pyspark.na.drop()
 

* To drop all the rows having all field values as null :

df_pyspark= df_pyspark.na.drop(how='all')
 

* To drop all the rows having at least two field values as null :

df_pyspark= df_pyspark.na.drop(how='any', thresh=2)
 

* To drop all the rows having a particular field value as null :

df_pyspark= df_pyspark.na.drop(how='any', subset=['column_name'])
 

Q: How would you fill missing or null values in a column with pyspark ?

We would use fill function of pyspark to do this :

df_pyspark= df_pyspark.na.fill('value_to_fill', ['column_name1', 'column_name2'])
 

With above code we can replace null values with the desired value for required columns.

Q: Consider a scenario where you have a data having records of employees of a company with their name, department , salary. How would you check what is the average salary within each department using pyspark.

There are different aggregation function provided by pyspark. Here we will use 'avg' function while grouping on department as per the requirement.


df_avg_sal= df_pyspark.groupby("department").agg(avg("salary").alias("avg_salary")
df_avg_sal.show()
 

We are using alias function to give name to our column which will show average salary.

Q: We need to select the employee getting highest salary in each department.

To solve this scenario we will add a new column to give row number within each department and then will filter only the top employee based on salary.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
ranking = Window.partitionBy("department").orderBy("salary")
top_sal_emp= df_pyspark.withColumn("Rank", row_number().over(ranking))
top_sal_emp.filter(top_sal_emp.Rank==1).show()
 

We are using function "withColumn" in the above code to add new column i.e. "Rank" on which we are applying filter to get top employee based on salary.

Q: Need to select the employee who is working in marketing and having salary greater than 50k.

We will use filter condition to get this :

df_top_marketer= df_pyspark.filter( (df_pyspark.salary>50000) & (df_pyspark.department=="Marketing"))
df_top_marketer.show()
 

Q: We need to compare the salry of employees with average salary within their department.

We will first check the average salary within each department and then comapre it with the emplyees salary in another column :

from pyspark.sql.window import Window
from pyspark.sql.functions import col
partition_col = Window.partitionBy("department")
avg_sal_emp= df_pyspark.withColumn("avg_sal", avg(col("salary")).over(partition_col))
diff_sal_comp = avg_sal_emp.withColumn("sal_diff", col("salary") - col("avg_sal"))
diff_sal_comp.show()
 

Q: Need to calculate total salary given to all the employees whose name starts with letter 'A'.

Here we will take sum of salaries of those employees whose name initialis 'A' :

total_sal = df_pyspark.filter(col("name").startswith("A")).agg(sum(col("salary")).alias("total_salary"))
total_sal.show()
 

Q: We have a scenario where we have sales data in which amount is embedded with "$" sign. We need to conver that column to float and without dollar sign.

Here we will remove the dollar sign from the desired column and then convert that to float datatype :

sales_df = sales_df.withColumn("new_price", sales_df.price.substr(2,100).cast('float') )
sales_df.show()
 

Q: How to sort data on the basis of salary in pyspark ?

We can use sort function like below :

emp_df = emp_df.sort(col('salary').desc())
emp_df.show()
 

Q: How to count distinct rows in a dataframe in pyspark ?

We can use distinct function like below :

df.distinct().count().show()
 

Q: How to rename a column name in a dataframe in pyspark ?

We can use below function like below :

df = df.withColumnRenamed('name', 'first_name')
 

Q: How to run sql on a dataframe in pyspark ?

This can be done in three steps :
Step 1 : Create a dataframe of your data
Step 2 : Register a temporary table with that dataframe
Step 3 : Query that temporary table with sql query

df=spark.read.option('header','true').csv('/path/name.csv', inferSchema=True)
df.registerTempTable("temp_table")
query=spark.sql(' select * from temp_table')
query.show()
 

Q: How to write dataframe data to text file ?

We can use below function to write data to text file in different formats like csv, json and parquet :

df.write.csv('abc.csv')
df.write.json('abc.json')
df.write.parquet('abc.parquet')
 

Q: Consider a scenario where we have employee data having their time spent in office for particular days as shown below: . We need to convert days into columns as shown below:


We can use pivot functionality of the pyspark while grouping on the employee name :

df_new = df.groupby("emp_name").pivot("day").agg(first(df.time_spent))
df_new.show()
 

Q: Consider a scenario where we have denormalized data like below diagram but we want normalized data where we want different rows for each item.


To normalize data like this we can use explode function within pyspark which typical use case in these kind of scenarios:

df_normalized = df.select("Id",explode("Items").alias("Item") )
df_normalized.show()
 

Q: Explain pyspark architecture in simple terms.


  • Pyspark architecture consist of cluster of nodes.

  • Where one is driver node and others are worker nodes.

  • Diver is the master node and others are slave nodes.

  • Main operations are done at worker nodes which is controlled by driver node with the help of cluster manager.

  • Pyspark is basically a wrapper built in java on JVM.
  • Q: Consider a scenario where data is separated based on delimiter and we have to divide that record into multiple records based on that delimiter.

  • We will first divide the data based on given delimiter using split function
  • Then will use "explode" function to create different records for the same record.
  • We will select "ID" column along with splitted column to show different columns.

  • Below is the pyspark code to perform the same.
    
    from pyspark.sql.functions import col, explode, split
    df1 = spark.createDataFrame([
        (1, "Adam, Amani , Singh"),
        (2, "Ram"),
        (3, "Shiv, Gupta")
    ], schema='Id long, Name string')
    df2= df1.select( col("Id"), explode(split(col("Name"), ",") ).alias("Name") )
    df1.show()
    df2.show()