Suppose you are training an SVM in two dimensions. Given the two weight vectors w1 = (1,100)⊤ and w2 = (1,3)⊤, which one is more likely to be picked by the SVM training algorithm? Explain!

w_2 will be chosen because it has smaller norm and hence provides larger margin.

w_1 will be chosen because it has a larger norm and hence provides a larger margin.

Both w_1 and w_2 will be chosen equally by the SVM training algorithm.

Neither w_1 nor w_2 can be chosen as they are not valid weight vectors.

For decision trees, why does the likelihood of overfitting increase with increasing maximum tree depth? Recall that the maximum tree depth is the maximum depth to which the decision tree is allowed to be grown into.

As you go down the tree, the nodes have fewer and fewer data points on which to create a good split.

Increasing the maximum tree depth allows for more complex models that generalize better.

Deeper trees are always more accurate regardless of the data points available.

The depth of the tree has no impact on the likelihood of overfitting.

In my lab, we study the use of gestures in collaborative problem solving tasks. We gathered 4 hours of audio-visual data, consisting of 10 groups each of 3 people collaboratively solving a problem involving physical objects, and developed a random forest method to detect when any participant in the data is performing a gesture of interest (such as pointing, pinching, or grabbing). There are two possible ways we could evaluate this gesture classifier: 1 (a) Pool the samples, randomly shuffle them, and split them into 10 folds and perform a rotating stratified 90:10 10-fold cross-validation. (b) Perform a rotating stratified 10-fold cross-validation using each group in turn as the test group. Which of these is a better way to evaluate if I want to understand the robustness of my classifier to unseen data, and why?

(b) Perform a rotating stratified 10-fold cross-validation using each group in turn as the test group, as it tests the model on entirely unseen groups.

(a) Pool the samples and perform a rotating stratified 90:10 10-fold cross-validation, as it allows for more data to be used in training.

(c) Use a single group for training and testing, as it simplifies the evaluation process.

(d) Randomly select gestures from the dataset for evaluation, as it provides a diverse set of examples.

Why does it not matter that ReLU is also technically not differentiable at all values of x?

ReLU is only non-differentiable at exactly x = 0 and the practical odds of inputs to ReLU being exactly 0 are extremely small, and if this occurs, the problem is easily solved with small epsilon values.

ReLU is differentiable at all points except for x = 0, which is a common issue in neural networks.

The non-differentiability of ReLU at all values of x makes it unsuitable for training deep learning models.

ReLU's non-differentiability at x = 0 is a significant problem that cannot be resolved easily.

What are hyperparameters and why is it important to find good values for them? How is optimizing hyperparameters different than optimizing the value of classifier parameters such as the weight vector of a linear classifier?

Hyperparameters are parameters chosen before the model is run, and they significantly alter the classifier's performance.

Hyperparameters are parameters chosen after the model is run, and they have little impact on the classifier's performance.

Hyperparameters are the same as classifier parameters, and optimizing them is identical to optimizing the weight vector.

Hyperparameters are irrelevant to the model's performance and can be ignored.

In step 6, the test data is being normalized using the test means. What does this imply about the representation of test points?

Test points are no longer represented using values meaningful relative to the training distribution.

Test points are represented using values meaningful relative to the training distribution.

Test points are represented using the same values as the training data.

Test points are normalized using the training means.

Random forests is a variation on the bagging classifier. What changes were made in random forests compared to bagging, and what was the motivation for making those changes?

RF uses decision trees while bagging can use any classifier; furthermore, RF samples the features at each split.

RF uses a single decision tree and does not sample features at each split.

Bagging uses decision trees exclusively and samples features at each split.

Random forests do not use any classifiers and rely solely on feature sampling.

Exam Questions

Authored by Wayground Content

Mathematics

AI Actions

Add similar questions

Adjust reading levels

Convert to real-world scenario

Translate activity

More...

Content View

Student View

16 questions

Show all answers

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Intuitively, why are individual decision trees brittle and sensitive to individual feature values? What do random forests do that alleviates this limitation? What mechanism can be used in a random forest to come to a final decision?

Individual decision trees are robust and not sensitive to feature values, while random forests use a single tree for decision making.

Random forests use majority voting for classification and averaging for regression, which helps in making more stable decisions.

Individual decision trees are complex and require many features, while random forests use only one feature to make decisions.

Random forests rely on a single decision tree to make predictions, which is less sensitive to feature changes.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

What common activation function that we discussed in class has a derivative that is the step function?

ReLU

Sigmoid

Tanh

Softmax

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

The perceptron algorithm What are the hyperparameters of the perceptron algorithm? From your experience, which of those hyperparameters have an effect over its accuracy? Explain!

The number of epochs and learning rate

The number of hidden layers and activation function

The batch size and dropout rate

The optimizer and regularization strength

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

What is the issue in performing PCA separately on the training set and the test set? In code this might look like the pseudo-code shown below, where we have chosen to map the data into a ten dimensional space: X_train_pca = PCA(n_components=10).fit_transform(X_train) X_test_pca = PCA(n_components=10).fit_transform(X_test) # the fit_transform method of a PCA object computes the # principal components and transforms the given feature matrix # into the space of the principal components. # Assume that X_train is the matrix containing the training set # and X_test is the matrix containing the test set What is the correct way of doing this?

Training PCA separately on the training and test set creates incompatible feature representations.

Training PCA on the entire dataset is OK since it does not make use of the labels, so creates no leak of label information.

PCA should only be applied to the test set to avoid data leakage.

PCA can be performed on the training set only, and the test set should be ignored.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

What do SVMs aim for in terms of the separating hyperplane?

The steepest slope of the hyperplane

The maximum margin from the closest points

The minimum distance from all training points

The average slope of the hyperplane

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

My lab is collaborating with a biologist who studies the microbiome of both living and dead animals and humans. They sampled the microbiomes of dead mice and humans and we helped them develop a random forest-based approach to predict the time since death based on the microbiome composition. This has potential forensic applications since standard forensic techniques work for a limited range of time. To generate the data, each body was sampled at regular intervals and its microbiome composition determined by appropriate experimental protocols. In order to evaluate the ability of a classifier to predict time-since-death, they performed leave-one-cadaver out cross validation, where a classifier was trained on all measurements performed on all but one cadaver. The classifier was evaluated on all the measurements performed on the left-out cadaver. This is iterated until obtaining predictions on all cadavers. Explain the value of this evaluation procedure over a procedure that pools all the samples from all the cadavers, i.e. mixes them all up and then performs cross-validation over the pooled samples.

Leave-one-cadaver-out is better because it allows the system to work on cadavers that it has not seen during training.

Pooling samples provides a larger dataset, which improves the classifier's performance.

Leave-one-cadaver-out is less computationally intensive than pooling samples.

Pooling samples allows for a more generalized model that can predict across different cadavers.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Why can this function not be used as an activation function in a multilayer perceptron neural network (hint: think about how we have to incorporate activation functions when calculating gradient descent)?

It is not differentiable, and its derivative at all points where x= 0 is equal to 0

It is too complex to compute during backpropagation

It produces outputs that are not bounded between 0 and 1

It requires too much computational power to evaluate

Access all questions and much more by creating a free account

Create resources

Host any resource

Get auto-graded reports

Continue with Google

Continue with Email

Continue with Classlink

Continue with Clever

or continue with

Microsoft

Apple

Others

Already have an account?

Similar Resources on Wayground

15 questions

Review Number Bonds and Addition Within 10

Quiz

•

1st Grade

20 questions

Biostatistics quiz #1

Quiz

•

University

15 questions

StatProb Lesson 4 Quiz Reviewer (SY 23 - 24)

Quiz

•

11th Grade

13 questions

How to tell the time

Quiz

•

3rd Grade

14 questions

CHCI GENERAL KNOWLEDGE

Quiz

•

11th Grade

20 questions

Ratios & Proportions

Quiz

•

6th Grade

15 questions

Math Enigma Round 1

Quiz

•

4th - 5th Grade

15 questions

Mental math Grade 1 15.1.21

Quiz

•

1st Grade

Popular Resources on Wayground

16 questions

Grade 3 Simulation Assessment 2

Quiz

•

3rd Grade

19 questions

HCS Grade 5 Simulation Assessment_1 2526sy

Quiz

•

5th Grade

10 questions

Cinco de Mayo Trivia Questions

Interactive video

•

3rd - 5th Grade

17 questions

HCS Grade 4 Simulation Assessment_2 2526sy

Quiz

•

4th Grade

24 questions

HCS Grade 5 Simulation Assessment_2 2526sy

Quiz

•

5th Grade

13 questions

Cinco de mayo

Interactive video

•

6th - 8th Grade

20 questions

Math Review

Quiz

•

3rd Grade

30 questions

GVMS House Trivia 2026

Quiz

•

6th - 8th Grade

Discover more resources for Mathematics

16 questions

3D Shapes

Quiz

•

KG - 1st Grade

24 questions

5th Grade Math EOG Review

Quiz

•

KG - University

73 questions

NWEA MATH Practice 161-170

Quiz

•

KG - 3rd Grade

13 questions

Time

Quiz

•

KG - 2nd Grade

20 questions

Place Value

Quiz

•

KG - 3rd Grade

10 questions

Counting by Tens

Quiz

•

KG - 1st Grade

15 questions

Single digit addition and subtraction

Quiz

•

KG - 1st Grade

14 questions

perimeter

Quiz

•

KG - 3rd Grade

Exam Questions

Intuitively, why are individual decision trees brittle and sensitive to individual feature values? What do random forests do that alleviates this limitation? What mechanism can be used in a random forest to come to a final decision?

What common activation function that we discussed in class has a derivative that is the step function?

The perceptron algorithm What are the hyperparameters of the perceptron algorithm? From your experience, which of those hyperparameters have an effect over its accuracy? Explain!

What do SVMs aim for in terms of the separating hyperplane?

Why can this function not be used as an activation function in a multilayer perceptron neural network (hint: think about how we have to incorporate activation functions when calculating gradient descent)?

Suppose you are training an SVM in two dimensions. Given the two weight vectors w1 = (1,100)⊤ and w2 = (1,3)⊤, which one is more likely to be picked by the SVM training algorithm? Explain!

How many SVM classifiers are being trained during this process? Note that the parameter cv=k tells scikit-learn to perform k-fold cross-validation.

For decision trees, why does the likelihood of overfitting increase with increasing maximum tree depth? Recall that the maximum tree depth is the maximum depth to which the decision tree is allowed to be grown into.

Access all questions and much more by creating a free account

Similar Resources on Wayground

Popular Resources on Wayground

Discover more resources for Mathematics