Search Header Logo
Python 2

Python 2

Assessment

Presentation

•

Information Technology (IT)

•

10th Grade

•

Practice Problem

•

Easy

Created by

Deema Alluhayb

Used 1+ times

FREE Resource

275 Slides • 26 Questions

1

media

2

Open Ended

Why is it important to avoid data leakage when applying imputation strategies in data cleaning?

3

media

4

Multiple Choice

Which of the following is a possible effect of mean or median imputation on data distribution?

1

It increases the variance of the data

2

It shrinks the variance of the data

3

It creates more missing values

4

It has no effect on the data

5

media

6

Multiple Choice

Which imputation method is generally preferred when the data distribution is highly skewed, such as income?

1

Mean imputation

2

Median imputation

3

Mode imputation

4

Random imputation

7

media

8

Open Ended

Describe a scenario where using a special category like 'Unknown' for categorical imputation might be more informative than using the mode.

9

media

10

Multiple Select

Which of the following are objectives when handling duplicate records in a dataset?

1

Distinguish entity vs event duplicates

2

Design deduplication rules with business logic

3

Aggregate events to entity-level features

4

Normalize schema types

11

media

12

Fill in the Blank

Fill in the blank: Multiple logs of the same event, such as retries or re-sends, are called ___ duplicates.

13

media

14

Open Ended

What is the difference between full-row and subset-based duplicate detection, and when would you use each approach?

15

media

16

Multiple Select

Which of the following are important steps to ensure data type consistency in a pandas DataFrame?

1

Convert numeric strings and currencies safely

2

Handle mixed-type columns

3

Parse datetimes and extract features

4

Aggregate duplicate events

17

media

18

Multiple Choice

Which of the following is NOT a problem caused by incorrect data types in machine learning pipelines?

1

Aggregations on string numerics concatenate instead of summing

2

Mixed-type columns become 'object', hiding structure

3

Broken datetime parsing leads to NaT and missing features

4

All numeric columns are automatically converted to float

19

media

20

Open Ended

Explain how using pd.to_numeric(errors='coerce') helps in handling messy numeric data entries. What are the benefits and what should you inspect after conversion?

21

media

22

Multiple Select

Which steps are recommended for cleaning currency data before converting to numeric types? Select all that apply.

1

Remove non-numeric characters except dot and minus

2

Convert with to_numeric(errors='coerce')

3

Treat failed parses as missing

4

Ignore validation of result ranges

23

media

24

Fill in the Blank

Single datetime64 dtype enables time features, invalid entries become ___, and format can be inferred automatically.

25

media

26

Open Ended

Why is it important for feature-generation functions to be identical in both offline training and online inference when extracting features from dates?

27

media

28

Multiple Choice

Which method can be used to mitigate the impact of extreme outliers in a dataset?

1

IQR and z-score methods

2

np.log1p transformation

3

Clipping values

4

All of the above

29

media

30

Multiple Choice

Which of the following best describes the difference between statistical outliers and domain outliers?

1

Statistical outliers are always errors, while domain outliers are always valid.

2

Statistical outliers are values far from the bulk of the distribution, while domain outliers are determined by domain knowledge.

3

Domain outliers are detected using IQR, while statistical outliers are detected using Z-score.

4

Statistical outliers are always valid, while domain outliers are always errors.

31

media

32

Multiple Select

Which of the following are true about the IQR (Interquartile Range) method for outlier detection?

1

It is based on quartiles and not affected by extreme values.

2

It uses the mean and standard deviation to detect outliers.

3

It is the standard method for box plots.

4

It is less effective for symmetric distributions.

33

media

34

Fill in the Blank

Fill in the blank: The Z-score method for outlier detection assumes approximate ___ in the data.

35

media

36

Multiple Choice

After handling outliers using removal, capping, or transformation, what is a potential risk when removing outliers that are actually valid extreme cases?

1

The dataset becomes more balanced.

2

Important signals may be lost.

3

The mean and median become equal.

4

The model becomes less sensitive to rare events.

37

media

38

media

39

Open Ended

Describe a scenario where capping could disadvantage a specific group and how you would detect that risk.

40

media

41

Open Ended

Explain why it is important to inspect missingness, duplicates, and outliers by subgroup rather than only globally when cleaning data.

42

Multiple Choice

Which of the following is NOT a key pattern to check during data cleaning as per the checklist?

1

Missing Values

2

Imputation

3

Model Training

4

Duplicates

43

media

44

media

45

media

46

Fill in the Blank

Outliers may represent specific subgroups, leading to potential ___ in data analysis.

47

Multiple Choice

Which of the following strategies is used for handling outliers in real datasets?

1

Capping and flooring

2

Ignoring outliers

3

Random sampling

4

Feature scaling

48

media

49

Multiple Choice

When is it reasonable to use capping for outliers instead of removing them?

1

When tails are real but shouldn't dominate

2

When the goal is robust prediction

3

When sample size is limited

4

All of the above

50

media

51

media

52

media

53

media

54

media

55

media

56

media

57

media

58

media

59

media

60

media

61

media

62

media

63

media

64

media

65

media

66

media

67

media

68

media

69

media

70

media

71

media

72

media

73

media

74

media

75

media

76

media

77

media

78

media

79

media

80

media

81

media

82

media

83

media

84

media

85

media

86

media

87

media

88

media

89

media

90

media

91

media

92

media

93

media

94

media

95

media

96

media

97

media

98

media

99

media

100

media

101

media

102

media

103

media

104

media

105

media

106

media

107

media

108

media

109

media

110

media

111

media

112

media

113

media

114

media

115

media

116

media

117

media

118

media

119

media

120

media

121

media

122

media

123

media

124

media

125

media

126

media

127

media

128

media

129

media

130

media

131

media

132

media

133

media

134

media

135

media

136

media

137

media

138

media

139

media

140

media

141

media

142

media

143

media

144

media

145

media

146

media

147

media

148

media

149

media

150

media

151

media

152

media

153

media

154

media

155

media

156

media

157

media

158

media

159

media

160

media

161

media

162

media

163

media

164

media

165

media

166

media

167

media

168

media

169

media

170

media

171

media

172

media

173

media

174

media

175

media

176

media

177

media

178

media

179

media

180

media

181

media

182

media

183

media

184

media

185

media

186

media

187

media

188

media

189

media

190

media

191

media

192

media

193

media

194

media

195

media

196

media

197

media

198

media

199

media

200

media

201

media

202

media

203

media

204

media

205

media

206

media

207

media

208

media

209

media

210

media

211

media

212

media

213

media

214

media

215

media

216

media

217

media

218

media

219

media

220

media

221

media

222

media

223

media

224

media

225

media

226

media

227

media

228

media

229

media

230

media

231

media

232

media

233

media

234

media

235

media

236

media

237

media

238

media

239

media

240

media

241

media

242

media

243

media

244

media

245

media

246

media

247

media

248

media

249

media

250

media

251

media

252

media

253

media

254

media

255

media

256

media

257

media

258

media

259

media

260

media

261

media

262

media

263

media

264

media

265

media

266

media

267

media

268

media

269

media

270

media

271

media

272

media

273

media

274

media

275

media

276

media

277

media

278

media

279

media

280

media

281

media

282

media

283

media

284

media

285

media

286

media

287

media

288

media

289

media

290

media

291

media

292

media

293

media

294

media

295

media

296

media

297

media

298

media

299

media

300

Multiple Choice

Which of the following is a potential effect of mean or median imputation on data distribution?

1

It increases variance

2

It shrinks variance

3

It creates artificial spikes

4

It strengthens all correlations

301

Open Ended

Reflecting on today's lesson about imputation strategies, what is one key takeaway you have about why imputation is never neutral?

media

Show answer

Auto Play

Slide 1 / 301

SLIDE