Data Lab PreTest

Data Lab PreTest

Professional Development

50 Qs

quiz-placeholder

Similar activities

ISPFL6 - Midterm Exam

ISPFL6 - Midterm Exam

Professional Development

55 Qs

YONO

YONO

Professional Development

50 Qs

Javascript

Javascript

Professional Development

45 Qs

Final Drive Assemblies

Final Drive Assemblies

Professional Development

50 Qs

Carpentry review chp 1

Carpentry review chp 1

Professional Development

50 Qs

OSHA, NFPA, Workplace Safety

OSHA, NFPA, Workplace Safety

8th Grade - Professional Development

53 Qs

Chapter 5 (PART I) Securing Hosts and Data

Chapter 5 (PART I) Securing Hosts and Data

Professional Development

47 Qs

GWS Administrator

GWS Administrator

Professional Development

47 Qs

Data Lab PreTest

Data Lab PreTest

Assessment

Quiz

Instructional Technology

Professional Development

Hard

Created by

Sinh Hoàng Văn

FREE Resource

50 questions

Show all answers

1.

MULTIPLE SELECT QUESTION

1 min • 1 pt

You have the following dataset and need to calculate the average salary per department, but only include departments where the average salary is greater than 4000:
df.groupBy(\Department\").avg(\"Salary\").filter(\"avg(Salary) > 4000\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, AVG(Salary) AS AvgSalary FROM employees GROUP BY Department HAVING AVG(Salary) > 4000\")"
from pyspark.sql.functions import col, avg\n\ndf.groupBy(\Department\").agg(avg(col(\"Salary\")).alias(\"AvgSalary\")).filter(col(\"AvgSalary\") > 4000)"
df.filter(\Salary > 4000\").groupBy(\"Department\").avg(\"Salary\")"

2.

MULTIPLE SELECT QUESTION

1 min • 1 pt

You need to find all employees in departments with a budget greater than 8000. Which code snippet achieves this?
employees_df.join(departments_df, \Department\").filter(\"Budget > 8000\").select(\"Name\", \"Department\")"
employees_df.createOrReplaceTempView(\employees\")\ndepartments_df.createOrReplaceTempView(\"departments\")\nspark.sql(\"SELECT e.Name, e.Department FROM employees e JOIN departments d ON e.Department = d.Department WHERE d.Budget > 8000\")"
from pyspark.sql.functions import col\n\nemployees_df.join(departments_df, \Department\").where(col(\"Budget\") > 8000).select(\"Name\", \"Department\")"
employees_df.join(departments_df, employees_df.Department == departments_df.Department).filter(\Budget > 8000\")"

3.

MULTIPLE SELECT QUESTION

1 min • 1 pt

From the dataset below, find the top-earning employee in each department, including their department name and salary. Use Spark SQL:
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, Salary FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC) AS rank FROM employees) t WHERE rank = 1\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, MAX(Salary) AS Salary FROM employees GROUP BY Department, Name\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, MAX(Salary) AS MaxSalary FROM employees GROUP BY Department\")"
from pyspark.sql.window import Window\nfrom pyspark.sql.functions import col, row_number\n\nwindow_spec = Window.partitionBy(\Department\").orderBy(col(\"Salary\").desc())\ndf.withColumn(\"rank\", row_number().over(window_spec)).filter(\"rank == 1\").select(\"Department\", \"Name\", \"Salary\")"

4.

MULTIPLE SELECT QUESTION

1 min • 1 pt

Which of the following codes computes the total sales per region efficiently in Spark SQL?
df.createOrReplaceTempView(\sales\")\nspark.sql(\"SELECT Region, SUM(Sales) AS TotalSales FROM sales GROUP BY Region\")"
df.groupBy(\Region\").sum(\"Sales\")"
from pyspark.sql.functions import col, sum\n\ndf.groupBy(col(\Region\")).agg(sum(col(\"Sales\")).alias(\"TotalSales\"))"
df.filter(\Sales IS NOT NULL\").groupBy(\"Region\").sum(\"Sales\")"

5.

MULTIPLE SELECT QUESTION

1 min • 1 pt

How can you rank products within each category by their sales in descending order?
from pyspark.sql.window import Window\nfrom pyspark.sql.functions import col, rank\n\nwindow_spec = Window.partitionBy(\Category\").orderBy(col(\"Sales\").desc())\ndf.withColumn(\"rank\", rank().over(window_spec))"
df.createOrReplaceTempView(\products\")\nspark.sql(\"SELECT Category, Product, RANK() OVER (PARTITION BY Category ORDER BY Sales DESC) AS Rank FROM products\")"
df.groupBy(\Category\", \"Product\").agg(rank().alias(\"Rank\"))"
from pyspark.sql.functions import row_number\n\nwindow_spec = Window.partitionBy(\Category\").orderBy(\"Sales DESC\")\ndf.withColumn(\"row_number\", row_number().over(window_spec))"

6.

MULTIPLE SELECT QUESTION

1 min • 1 pt

Which function would you use to replace all NULL values in a Spark DataFrame with default values?
df.na.fill({\Column1\": 0, \"Column2\": \"N/A\"})"
df.na.replace(NULL, {\Column1\": 0, \"Column2\": \"N/A\"})"
df.fillna({\Column1\": 0, \"Column2\": \"N/A\"})"
df.replace(NULL, {\Column1\": 0, \"Column2\": \"N/A\"})"

7.

MULTIPLE SELECT QUESTION

1 min • 1 pt

How do you efficiently join two DataFrames on multiple keys in PySpark?
df1.join(df2, (df1.Key1 == df2.Key1) & (df1.Key2 == df2.Key2))
df1.join(df2, \Key1\", \"inner\")"
df1.createOrReplaceTempView(\df1\")\ndf2.createOrReplaceTempView(\"df2\")\nspark.sql(\"SELECT * FROM df1 JOIN df2 ON df1.Key1 = df2.Key1 AND df1.Key2 = df2.Key2\")"
df1.merge(df2, \Key1\")"

Create a free account and access millions of resources

Create resources
Host any resource
Get auto-graded reports
or continue with
Microsoft
Apple
Others
By signing up, you agree to our Terms of Service & Privacy Policy
Already have an account?