Which of the following statements about feature engineering is correct?

One-hot encoding transforms a categorical column into multiple binary columns.

Binning can be used to reduce the impact of minor observation variations by grouping continuous values into bins.

One-hot encoding is used for numerical columns to reduce skew.

Binning should never be used because it always leads to information loss.

You have the following histogram (hypothetical description): It shows a single peak around 50. The tail extends to the right up to around 200. Most data points are between 40 and 70. Which conclusion(s) can you draw?

The distribution is right-skewed.

There might be outliers on the right end.

The distribution is left-skewed.

Data are likely symmetric around the mean.

You have a numeric feature with unknown distribution in a large dataset. Which approach(es) would you choose to quickly understand its distribution?

Generate a histogram or density plot.

Create a box plot to see outliers and quartiles.

Automatically assume it's normal without any plots.

Why do we often look at the variance or standard deviation in EDA?

To understand how spread out the data is from the mean.

To evaluate if data is sparse in a high-dimensional setting.

To estimate the risk of bias in a linear model.

Consider the following code snippet: import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = pd.DataFrame( {'Age': [23, 45, 23, 30, 35, 45, 50], 'Salary': [50000, 80000, 45000, 60000, 62000, 80000, 100000]}) sns.boxplot(x=df['Age']) plt.show() Which of the following statements could be true about the resulting box plot?

We will see the distribution of Age values with possible outliers.

The median of Age might be around 35.

The box plot will show the relationship between Age and Salary.

The plot is a vertical box plot by default.

Look at the code snippet: import pandas as pd data = {'Product': ['A', 'B', 'C', 'A', 'B'], 'Sales': [10, 20, 15, 5, 30]} df = pd.DataFrame(data) pivot_df = df.pivot_table(index='Product', values='Sales', aggfunc='sum') print(pivot_df) What is printed?

A table with a single column Sales indexed by Product.

Sum of sales for each product in rows.

A table with columns A, B, C and sums of Sales.

It will throw an error because pivot_table requires multiple values.

Write a snippet of Python code (pandas) that reads a CSV file called houses.csv into a DataFrame and prints the first 5 rows. Which of the following code blocks is correct?

import pandas as pd df = pd.read_csv('houses.csv') print(df.head())

import pandas as pd df = pd.read('houses.csv') df.head()

import pandas as pd data = pd.read_csv('houses_data.csv') print(data.head(5))

import pandas as pd df = pd.read_excel('houses.csv') df.head(5)

You have the following Python code: import pandas as pd df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue'], 'Value': [10, 20, 5, 15]}) # Your code here to one-hot encode the 'Color' column Complete the code snippet to one-hot encode the 'Color' column and display the first 5 rows. Which is correct?

encoded_df = pd.get_dummies(df, columns=['Color']) print(encoded_df.head())

encoded_df = df encoded_df['Color'] = df['Color'].factorize() print(encoded_df.head())

encoded_df = df.drop('Color', axis=1) print(encoded_df.head())

encoded_df = df.explode('Color') print(encoded_df.head())

You have a feature representing the number of website visits, which is heavily right-skewed (some users have thousands of visits while most have few). Which technique(s) might you try first to normalize this distribution?

Log transformation of the visits feature

Box-Cox transformation of the visits feature

One-hot encoding for the visits feature

Dropping all observations above the 95th percentile

You have a massive dataset (hundreds of GB) stored in HDFS. Which statement(s) are true regarding using PySpark for EDA?

PySpark allows distributed DataFrame operations via Spark DataFrames.

PySpark includes methods like describe() for basic statistics.

PySpark cannot handle operations larger than a single machine's memory.

PySpark is only for streaming data, not EDA.

In which scenario(s) would using Azure Data Factory be appropriate during EDA?

You need to schedule and automate data ingestion workflows from on-premises to Azure.

You require a pipeline to move and transform data before EDA in Azure Synapse.

You want an interactive development environment for Python-based EDA.

You want to build a real-time model deployment pipeline.

Which of the following statements correctly match the tools/services to their primary usage?

Azure Synapse: Unified analytics platform that can handle SQL, Spark, and data warehouse functionalities.

MLflow: A library that helps track experiments, parameters, and models.

Azure Databricks: Collaborative Apache Spark-based analytics platform.

Azure Synapse: A tool used only for real-time streaming.

You are analyzing transaction amounts in a financial dataset. The box plot shows a very long upper whisker and some points far above the whisker. What does this typically indicate?

Right-skewed distribution with potential outliers

You have a continuous variable 'age,' and you decide to bin it into intervals: 0-18 (minor), 19-65 (adult), 66+ (senior). In which scenario(s) might this be useful?

You suspect non-linear relationships and want simpler categories.

You want to reduce model complexity by turning a numeric feature into nominal categories.

You want to perform a correlation matrix on continuous features.

You want to do principal component analysis (PCA) on numerical features.

You generated a correlation matrix for 5 numeric variables in a dataset. One pair of variables shows a correlation of 0.95. Which conclusion(s) might you draw?

These two variables are highly correlated.

Redundant features might cause multicollinearity in modeling.

This means they have a cause-and-effect relationship.

One of the variables should always be dropped immediately.

You have a new project: Data size is moderate (10 GB). You need scheduling for nightly ingestion. You want to track model versions, metrics, and parameters. You prefer an interactive environment for quick EDA. Which combination of tools/services might be most effective?

Use Azure Data Factory to schedule and orchestrate data ingestion.

Use MLflow to track model experiments.

Use Azure Databricks for interactive EDA with Spark if needed.

Store data in Azure Synapse, do EDA using PySpark notebooks.

You conduct a two-sample t-test comparing the means of two groups (Group A and Group B). The resulting p-value is 0.03, and you set your significance level (α) at 0.05. Which conclusions can be drawn?

The difference between Group A and Group B means is statistically significant at α=0.05.

There is about a 3% probability of observing these data (or more extreme) if there were truly no difference between groups.

The difference between Group A and Group B is not statistically significant.

The p-value tells you directly how large the effect size is.

You have a contingency table of observed frequencies for two categorical variables. After performing a Chi-squared test, you obtain a p-value of 0.001. What can you conclude?

There is a statistically significant association between the two categorical variables.

Since the p-value is below typical significance thresholds (such as 0.05), you reject the null hypothesis of 'no association.'

There is no relationship between the two variables.

The test proves that one variable causes changes in the other.

EDA Quiz 1

Authored by Vijay Agrawal

Computers

Professional Development

Used 1+ times

AI Actions

Add similar questions

Adjust reading levels

Convert to real-world scenario

Translate activity

More...

Content View

Student View

27 questions

Show all answers

MULTIPLE SELECT QUESTION

30 sec • 1 pt

You are ingesting data from multiple sources (CSV, JSON, and Parquet). Which of the following statements are correct?

CSV files cannot handle hierarchical data structures.

JSON files are human-readable and can handle nested objects.

Parquet files store data in a columnar format and allow efficient compression.

CSV files always load faster than Parquet files.

MULTIPLE SELECT QUESTION

30 sec • 1 pt

You have a 50 GB CSV file you need to ingest and analyze. Which approach(es) could be most practical?

Use Python's built-in open() and read line by line in a loop.

Use pandas.read_csv() without specifying chunksize.

Use chunking in pandas (chunksize parameter) to process data in smaller batches.

Convert the CSV into a more compressed format like Parquet and use a distributed environment (e.g., PySpark).

MULTIPLE SELECT QUESTION

30 sec • 1 pt

Which of the following summary statistics are typically useful during EDA?

Mean, median, mode

Standard deviation, variance

Range, interquartile range

Confusion matrix

MULTIPLE SELECT QUESTION

30 sec • 1 pt

When examining a dataset's distribution, which are signs that the data might be right-skewed?

The mean is greater than the median.

The mean is less than the median.

A histogram shows a longer tail to the right.

The mode is greater than the median.

MULTIPLE SELECT QUESTION

30 sec • 1 pt

You have a dataset of housing prices. You notice that some houses are extremely expensive compared to the rest. Which methods can help you identify outliers effectively?

Box plot to detect points beyond 1.5 IQR from the quartiles.

Z-scores to find values far from the mean.

Dropping all data above the median price.

Calculating the difference between max and min values.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

What does a high variance in a dataset indicate?

The data points are spread out from the mean

The data points are closely clustered around the mean

The dataset has a low level of variability

The dataset has many missing values

MULTIPLE SELECT QUESTION

30 sec • 1 pt

Which transformations are commonly used to stabilize variance or reduce skew in data?

Box-Cox transformation

Log transformation

Min-Max scaling

One-hot encoding

Access all questions and much more by creating a free account

Create resources

Host any resource

Get auto-graded reports

Continue with Google

Continue with Email

Continue with Classlink

Continue with Clever

or continue with

Microsoft

Apple

Others

Already have an account?

Similar Resources on Wayground

22 questions

A+ - 11C - Use Command-line Tools

Quiz

•

Professional Development

24 questions

SC-300: Identity and Access Administrator - TCS

Quiz

•

Professional Development

23 questions

A+ - 15B - Identify Features of macOS

Quiz

•

Professional Development

22 questions

Blockchain Basicss

Quiz

•

Professional Development

22 questions

Friki

Quiz

•

Professional Development

23 questions

Data Quality Quiz

Quiz

•

Professional Development

22 questions

CISSP Asset Quiz

Quiz

•

Professional Development

22 questions

C-4 CHAPTAR -1

Quiz

•

Professional Development

Popular Resources on Wayground

15 questions

Fractions on a Number Line

Quiz

•

3rd Grade

20 questions

Equivalent Fractions

Quiz

•

3rd Grade

25 questions

Multiplication Facts

Quiz

•

5th Grade

$fractions$

22 questions

fractions

Quiz

•

3rd Grade

20 questions

Main Idea and Details

Quiz

•

5th Grade

20 questions

Context Clues

Quiz

•

6th Grade

15 questions

Equivalent Fractions

Quiz

•

4th Grade

20 questions

Figurative Language Review

Quiz

•

6th Grade

Discover more resources for Computers

10 questions

How to Email your Teacher

Quiz

•

Professional Development

6 questions

3RD GRADE DECLARATION OF INDEPENDENCE EXIT TICKET

Quiz

•

Professional Development

19 questions

Black History Month Trivia

Quiz

•

6th Grade - Professio...

22 questions

Multiplying Exponents with the Same Base

Quiz

•

9th Grade - Professio...

40 questions

Flags of the World

Quiz

•

KG - Professional Dev...