CSF 7.3 Assessment - Data Cleaning

CSF 7.3 Assessment - Data Cleaning

Assessment

Flashcard

Computers

9th - 12th Grade

Hard

Created by

Quizizz Content

FREE Resource

Student preview

quiz-placeholder

8 questions

Show all answers

1.

FLASHCARD QUESTION

Front

Which wombat do you most relate to at this moment?

Back

undefined

2.

FLASHCARD QUESTION

Front

What are the two main steps in data cleaning?

Back

Explore the data to identify the unclean data, and then clean the data using a programming language.

Answer explanation

The first step in data cleaning is to explore the dataset and identify any errors, inconsistencies, or missing values that need to be cleaned. This involves analyzing the data, visualizing it, and checking for outliers or inconsistencies. Once the unclean data has been identified, the second step is to clean the data using a programming language. This may involve removing duplicates, filling in missing values, standardizing data formats, and correcting errors or inconsistencies. After the data has been cleaned, it can be used to make a model or perform other types of analysis.

3.

FLASHCARD QUESTION

Front

Can you spot the unclean data? Options: There are misspellings, There are duplicate entries, There are blank data cells, There is data that does not make sense for the category, There are data values in different formats

Back

There are misspellings

Answer explanation

Cells G5, G11 have misspellings.

4.

FLASHCARD QUESTION

Front

What is the reason why clean data is necessary for a predictive model?

Back

Dirty data can skew the results of the predictive model, leading to inaccurate predictions.

Answer explanation

Predictive models are built to make accurate predictions based on patterns in the data. If the data used to build the model is dirty, i.e., contains errors, inconsistencies, outliers, or missing values, the model may be trained on inaccurate patterns, which can lead to incorrect predictions. Clean data ensures that the model is trained on accurate patterns, which will lead to better predictive performance.

5.

FLASHCARD QUESTION

Front

Which of the following is most effective when a row of data needs to be cleaned? Options: You must drop the unclean record row from the table. Fix each affected row of unclean data by hand. Use a programming language to fix multiple rows of data at a time. Keep the data unclean, since unclean data has no effect on a predictive model.

Back

Use a programming language to fix multiple rows of data at a time.

Answer explanation

Cleaning a large dataset manually can be a time-consuming and error-prone process, and dropping rows of data may result in the loss of valuable information. Therefore, using a programming language to clean data is a more efficient and accurate approach. Programming languages like Python or R have built-in libraries and functions for data cleaning, which can automate the process of identifying and correcting errors or inconsistencies in the data. This approach can save time, reduce the risk of human error, and allow data scientists to clean and analyze large datasets more effectively.

6.

FLASHCARD QUESTION

Front

Which type of data contributes to bad models? Options: Unclean data only, Biased data only, Both unclean data and biased data, Neither unclean data nor biased data

Back

Both unclean data and biased data

Answer explanation

When building a predictive model, it is important to use high-quality data that is both accurate and unbiased. Unclean data can lead to inaccurate or inconsistent results, while biased data can result in models that are skewed or incomplete. For example, if a dataset contains missing values or outliers, it can affect the accuracy of the model predictions. Similarly, if the dataset is biased towards a particular group or demographic, the model may not be representative of the broader population and may produce inaccurate or unfair results. Therefore, data scientists must carefully evaluate their datasets for both cleanliness and bias to ensure that their models are reliable and accurate.

7.

FLASHCARD QUESTION

Front

Which of the following is NOT an example of data that could easily be identified as invalid by a programming language? A U.S. phone number with 8 digits, The annual tuition of a Virginia college listed as "-12,345.67", A stock market ticker (required to contain only letters in the US) with an @ symbol.

Back

A misspelling in a person’s name

Answer explanation

While a misspelling in a person's name may be considered incorrect or invalid from a human perspective, a programming language may not necessarily be able to identify it as such. This is because the spelling of a name can be highly subjective and may not follow a specific set of rules or patterns. On the other hand, the other examples given - a U.S. phone number with 8 digits, a negative value for annual tuition, and a stock market ticker with an @ symbol - can all be easily identified as invalid by a programming language based on established rules or formats for each type of data.

8.

FLASHCARD QUESTION

Front

How confident do you feel about this topic?

Back

Very confident, Mostly confident, Somewhat confident, Not confident at all