
Data Lab PreTest
Authored by Sinh Hoàng Văn
Instructional Technology
Professional Development

AI Actions
Add similar questions
Adjust reading levels
Convert to real-world scenario
Translate activity
More...
Content View
Student View
50 questions
Show all answers
1.
MULTIPLE SELECT QUESTION
1 min • 1 pt
You have the following dataset and need to calculate the average salary per department, but only include departments where the average salary is greater than 4000:
df.groupBy(\Department\").avg(\"Salary\").filter(\"avg(Salary) > 4000\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, AVG(Salary) AS AvgSalary FROM employees GROUP BY Department HAVING AVG(Salary) > 4000\")"
from pyspark.sql.functions import col, avg\n\ndf.groupBy(\Department\").agg(avg(col(\"Salary\")).alias(\"AvgSalary\")).filter(col(\"AvgSalary\") > 4000)"
df.filter(\Salary > 4000\").groupBy(\"Department\").avg(\"Salary\")"
2.
MULTIPLE SELECT QUESTION
1 min • 1 pt
You need to find all employees in departments with a budget greater than 8000. Which code snippet achieves this?
employees_df.join(departments_df, \Department\").filter(\"Budget > 8000\").select(\"Name\", \"Department\")"
employees_df.createOrReplaceTempView(\employees\")\ndepartments_df.createOrReplaceTempView(\"departments\")\nspark.sql(\"SELECT e.Name, e.Department FROM employees e JOIN departments d ON e.Department = d.Department WHERE d.Budget > 8000\")"
from pyspark.sql.functions import col\n\nemployees_df.join(departments_df, \Department\").where(col(\"Budget\") > 8000).select(\"Name\", \"Department\")"
employees_df.join(departments_df, employees_df.Department == departments_df.Department).filter(\Budget > 8000\")"
3.
MULTIPLE SELECT QUESTION
1 min • 1 pt
From the dataset below, find the top-earning employee in each department, including their department name and salary. Use Spark SQL:
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, Salary FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC) AS rank FROM employees) t WHERE rank = 1\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, MAX(Salary) AS Salary FROM employees GROUP BY Department, Name\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, MAX(Salary) AS MaxSalary FROM employees GROUP BY Department\")"
from pyspark.sql.window import Window\nfrom pyspark.sql.functions import col, row_number\n\nwindow_spec = Window.partitionBy(\Department\").orderBy(col(\"Salary\").desc())\ndf.withColumn(\"rank\", row_number().over(window_spec)).filter(\"rank == 1\").select(\"Department\", \"Name\", \"Salary\")"
4.
MULTIPLE SELECT QUESTION
1 min • 1 pt
Which of the following codes computes the total sales per region efficiently in Spark SQL?
df.createOrReplaceTempView(\sales\")\nspark.sql(\"SELECT Region, SUM(Sales) AS TotalSales FROM sales GROUP BY Region\")"
df.groupBy(\Region\").sum(\"Sales\")"
from pyspark.sql.functions import col, sum\n\ndf.groupBy(col(\Region\")).agg(sum(col(\"Sales\")).alias(\"TotalSales\"))"
df.filter(\Sales IS NOT NULL\").groupBy(\"Region\").sum(\"Sales\")"
5.
MULTIPLE SELECT QUESTION
1 min • 1 pt
How can you rank products within each category by their sales in descending order?
from pyspark.sql.window import Window\nfrom pyspark.sql.functions import col, rank\n\nwindow_spec = Window.partitionBy(\Category\").orderBy(col(\"Sales\").desc())\ndf.withColumn(\"rank\", rank().over(window_spec))"
df.createOrReplaceTempView(\products\")\nspark.sql(\"SELECT Category, Product, RANK() OVER (PARTITION BY Category ORDER BY Sales DESC) AS Rank FROM products\")"
df.groupBy(\Category\", \"Product\").agg(rank().alias(\"Rank\"))"
from pyspark.sql.functions import row_number\n\nwindow_spec = Window.partitionBy(\Category\").orderBy(\"Sales DESC\")\ndf.withColumn(\"row_number\", row_number().over(window_spec))"
6.
MULTIPLE SELECT QUESTION
1 min • 1 pt
Which function would you use to replace all NULL values in a Spark DataFrame with default values?
df.na.fill({\Column1\": 0, \"Column2\": \"N/A\"})"
df.na.replace(NULL, {\Column1\": 0, \"Column2\": \"N/A\"})"
df.fillna({\Column1\": 0, \"Column2\": \"N/A\"})"
df.replace(NULL, {\Column1\": 0, \"Column2\": \"N/A\"})"
7.
MULTIPLE SELECT QUESTION
1 min • 1 pt
How do you efficiently join two DataFrames on multiple keys in PySpark?
df1.join(df2, (df1.Key1 == df2.Key1) & (df1.Key2 == df2.Key2))
df1.join(df2, \Key1\", \"inner\")"
df1.createOrReplaceTempView(\df1\")\ndf2.createOrReplaceTempView(\"df2\")\nspark.sql(\"SELECT * FROM df1 JOIN df2 ON df1.Key1 = df2.Key1 AND df1.Key2 = df2.Key2\")"
df1.merge(df2, \Key1\")"
Access all questions and much more by creating a free account
Create resources
Host any resource
Get auto-graded reports

Continue with Google

Continue with Email

Continue with Classlink

Continue with Clever
or continue with

Microsoft
%20(1).png)
Apple
Others
Already have an account?