Search Header Logo

Data Lab PreTest

Authored by Sinh Hoàng Văn

Instructional Technology

Professional Development

Data Lab PreTest
AI

AI Actions

Add similar questions

Adjust reading levels

Convert to real-world scenario

Translate activity

More...

    Content View

    Student View

50 questions

Show all answers

1.

MULTIPLE SELECT QUESTION

1 min • 1 pt

You have the following dataset and need to calculate the average salary per department, but only include departments where the average salary is greater than 4000:

df.groupBy(\Department\").avg(\"Salary\").filter(\"avg(Salary) > 4000\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, AVG(Salary) AS AvgSalary FROM employees GROUP BY Department HAVING AVG(Salary) > 4000\")"
from pyspark.sql.functions import col, avg\n\ndf.groupBy(\Department\").agg(avg(col(\"Salary\")).alias(\"AvgSalary\")).filter(col(\"AvgSalary\") > 4000)"
df.filter(\Salary > 4000\").groupBy(\"Department\").avg(\"Salary\")"

2.

MULTIPLE SELECT QUESTION

1 min • 1 pt

You need to find all employees in departments with a budget greater than 8000. Which code snippet achieves this?

employees_df.join(departments_df, \Department\").filter(\"Budget > 8000\").select(\"Name\", \"Department\")"
employees_df.createOrReplaceTempView(\employees\")\ndepartments_df.createOrReplaceTempView(\"departments\")\nspark.sql(\"SELECT e.Name, e.Department FROM employees e JOIN departments d ON e.Department = d.Department WHERE d.Budget > 8000\")"
from pyspark.sql.functions import col\n\nemployees_df.join(departments_df, \Department\").where(col(\"Budget\") > 8000).select(\"Name\", \"Department\")"
employees_df.join(departments_df, employees_df.Department == departments_df.Department).filter(\Budget > 8000\")"

3.

MULTIPLE SELECT QUESTION

1 min • 1 pt

From the dataset below, find the top-earning employee in each department, including their department name and salary. Use Spark SQL:

df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, Salary FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC) AS rank FROM employees) t WHERE rank = 1\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, MAX(Salary) AS Salary FROM employees GROUP BY Department, Name\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, MAX(Salary) AS MaxSalary FROM employees GROUP BY Department\")"
from pyspark.sql.window import Window\nfrom pyspark.sql.functions import col, row_number\n\nwindow_spec = Window.partitionBy(\Department\").orderBy(col(\"Salary\").desc())\ndf.withColumn(\"rank\", row_number().over(window_spec)).filter(\"rank == 1\").select(\"Department\", \"Name\", \"Salary\")"

4.

MULTIPLE SELECT QUESTION

1 min • 1 pt

Which of the following codes computes the total sales per region efficiently in Spark SQL?

df.createOrReplaceTempView(\sales\")\nspark.sql(\"SELECT Region, SUM(Sales) AS TotalSales FROM sales GROUP BY Region\")"
df.groupBy(\Region\").sum(\"Sales\")"
from pyspark.sql.functions import col, sum\n\ndf.groupBy(col(\Region\")).agg(sum(col(\"Sales\")).alias(\"TotalSales\"))"
df.filter(\Sales IS NOT NULL\").groupBy(\"Region\").sum(\"Sales\")"

5.

MULTIPLE SELECT QUESTION

1 min • 1 pt

How can you rank products within each category by their sales in descending order?

from pyspark.sql.window import Window\nfrom pyspark.sql.functions import col, rank\n\nwindow_spec = Window.partitionBy(\Category\").orderBy(col(\"Sales\").desc())\ndf.withColumn(\"rank\", rank().over(window_spec))"
df.createOrReplaceTempView(\products\")\nspark.sql(\"SELECT Category, Product, RANK() OVER (PARTITION BY Category ORDER BY Sales DESC) AS Rank FROM products\")"
df.groupBy(\Category\", \"Product\").agg(rank().alias(\"Rank\"))"
from pyspark.sql.functions import row_number\n\nwindow_spec = Window.partitionBy(\Category\").orderBy(\"Sales DESC\")\ndf.withColumn(\"row_number\", row_number().over(window_spec))"

6.

MULTIPLE SELECT QUESTION

1 min • 1 pt

Which function would you use to replace all NULL values in a Spark DataFrame with default values?

df.na.fill({\Column1\": 0, \"Column2\": \"N/A\"})"
df.na.replace(NULL, {\Column1\": 0, \"Column2\": \"N/A\"})"
df.fillna({\Column1\": 0, \"Column2\": \"N/A\"})"
df.replace(NULL, {\Column1\": 0, \"Column2\": \"N/A\"})"

7.

MULTIPLE SELECT QUESTION

1 min • 1 pt

How do you efficiently join two DataFrames on multiple keys in PySpark?

df1.join(df2, (df1.Key1 == df2.Key1) & (df1.Key2 == df2.Key2))
df1.join(df2, \Key1\", \"inner\")"
df1.createOrReplaceTempView(\df1\")\ndf2.createOrReplaceTempView(\"df2\")\nspark.sql(\"SELECT * FROM df1 JOIN df2 ON df1.Key1 = df2.Key1 AND df1.Key2 = df2.Key2\")"
df1.merge(df2, \Key1\")"

Access all questions and much more by creating a free account

Create resources

Host any resource

Get auto-graded reports

Google

Continue with Google

Email

Continue with Email

Classlink

Continue with Classlink

Clever

Continue with Clever

or continue with

Microsoft

Microsoft

Apple

Apple

Others

Others

Already have an account?