Reinforcement Learning Quiz

Reinforcement Learning Quiz

Assessment

Quiz

Other

Professional Development

Hard

Created by

Sai Ganesh

FREE Resource

Student preview

quiz-placeholder

42 questions

Show all answers

1.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

In the update rule Q_t+1(a)←Q_t(a) +α(R_t−Q_t(a)), select the value of α that we would prefer to estimate Q values in a non-stationary bandit problem.

α=1/n_a+1

α=0.1

α=n_a+1

α=1/(n_a+1)^2

2.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

The “Credit assignment problem” is the issue of assigning a correct mapping of rewards accumulated by the action(s). Which of the following is/are the reason for credit assignment problem in RL? (Select all that apply)

Reward for an action may only be observed after many time steps.

An agent may get the same reward for multiple actions.

The agent discounts rewards that occurred in previous time steps.

Rewards can be positive or negative

3.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Assertion 1: In stationary bandit problems, we can achieve asymptotically correct behaviour by selecting exploratory actions with a fixed non-zero probability without decaying exploration. Assertion 2: In non-stationary bandit problems, it is important that we decay the probability of exploration to zero over time in order to achieve asymptotically correct behavior.

Assertion 1 and Assertion 2 are both True.

Assertion 1 is True and Assertion 2 is False.

Assertion 1 is False and Assertion 2 is True.

Assertion 1 and Assertion 2 are both False.

4.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

We are trying different algorithms to find the optimal arm for a multi arm bandit. The expected payoff for each algorithm corresponds to some function with respect to time t (time starting from 0). Given that the optimal expected pay off is 1, which among the following functions corresponds to the algorithm with the least Regret? (Hint: Plot the functions)

tanh(t/5)

1−2^−t

x/20 if x < 20 and 1 after that

Same regret for all the above functions.

5.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Which of the following is/are correct and valid reasons to consider sampling actions from a softmax distribution instead of using an ε-greedy approach?

Softmax exploration makes the probability of picking an action proportional to the action-value estimates. By doing so, it avoids wasting time exploring obviously ’bad’ actions.

We do not need to worry about decaying exploration slowly like we do in the ε-greedy case. Softmax exploration gives us asymptotic correctness even for a sharp decrease in temperature.

It helps us differentiate between actions with action-value estimates (Q values) that are very close to the action with maximum Q value.

6.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Consider a standard multi-arm bandit problem. The probability of picking an action, using the softmax policy is given by: Pr(at=a) = e^(Q_t(a)/β) / Σ_b e^(Q_t(b)/β). Now, assuming the following action-value estimates: Q_t(a_0) = 1, Q_t(a_1) = 0.2, Q_t(a_2) = 0.5, Q_t(a_3) = -1, Q_t(a_4) = 0.02 and Q_t(a_5) = -2. What is the probability that action 2 is selected? (use β= 0.1)

0

0.13

0.232

0.143

7.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

What are the properties of a solution method that is PAC Optimal?

Both (b) and (c)

Both (a) and (b)

It always reaches optimal behaviour faster than an algorithm that is simply asymptotically correct.

It is guaranteed to find the correct solution.

It minimizes sample complexity to make the PAC guarantee.

Create a free account and access millions of resources

Create resources
Host any resource
Get auto-graded reports
or continue with
Microsoft
Apple
Others
By signing up, you agree to our Terms of Service & Privacy Policy
Already have an account?