In the update rule Q_t+1(a)←Q_t(a) +α(R_t−Q_t(a)), select the value of α that we would prefer to estimate Q values in a non-stationary bandit problem.

Reinforcement Learning Quiz

Quiz
•
Other
•
Professional Development
•
Hard
Sai Ganesh
FREE Resource
Student preview

42 questions
Show all answers
1.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
α=1/n_a+1
α=0.1
α=n_a+1
α=1/(n_a+1)^2
2.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
The “Credit assignment problem” is the issue of assigning a correct mapping of rewards accumulated by the action(s). Which of the following is/are the reason for credit assignment problem in RL? (Select all that apply)
Reward for an action may only be observed after many time steps.
An agent may get the same reward for multiple actions.
The agent discounts rewards that occurred in previous time steps.
Rewards can be positive or negative
3.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
Assertion 1: In stationary bandit problems, we can achieve asymptotically correct behaviour by selecting exploratory actions with a fixed non-zero probability without decaying exploration. Assertion 2: In non-stationary bandit problems, it is important that we decay the probability of exploration to zero over time in order to achieve asymptotically correct behavior.
Assertion 1 and Assertion 2 are both True.
Assertion 1 is True and Assertion 2 is False.
Assertion 1 is False and Assertion 2 is True.
Assertion 1 and Assertion 2 are both False.
4.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
We are trying different algorithms to find the optimal arm for a multi arm bandit. The expected payoff for each algorithm corresponds to some function with respect to time t (time starting from 0). Given that the optimal expected pay off is 1, which among the following functions corresponds to the algorithm with the least Regret? (Hint: Plot the functions)
tanh(t/5)
1−2^−t
x/20 if x < 20 and 1 after that
Same regret for all the above functions.
5.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
Which of the following is/are correct and valid reasons to consider sampling actions from a softmax distribution instead of using an ε-greedy approach?
Softmax exploration makes the probability of picking an action proportional to the action-value estimates. By doing so, it avoids wasting time exploring obviously ’bad’ actions.
We do not need to worry about decaying exploration slowly like we do in the ε-greedy case. Softmax exploration gives us asymptotic correctness even for a sharp decrease in temperature.
It helps us differentiate between actions with action-value estimates (Q values) that are very close to the action with maximum Q value.
6.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
Consider a standard multi-arm bandit problem. The probability of picking an action, using the softmax policy is given by: Pr(at=a) = e^(Q_t(a)/β) / Σ_b e^(Q_t(b)/β). Now, assuming the following action-value estimates: Q_t(a_0) = 1, Q_t(a_1) = 0.2, Q_t(a_2) = 0.5, Q_t(a_3) = -1, Q_t(a_4) = 0.02 and Q_t(a_5) = -2. What is the probability that action 2 is selected? (use β= 0.1)
0
0.13
0.232
0.143
7.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
What are the properties of a solution method that is PAC Optimal?
Both (b) and (c)
Both (a) and (b)
It always reaches optimal behaviour faster than an algorithm that is simply asymptotically correct.
It is guaranteed to find the correct solution.
It minimizes sample complexity to make the PAC guarantee.
Create a free account and access millions of resources
Popular Resources on Wayground
25 questions
Equations of Circles

Quiz
•
10th - 11th Grade
30 questions
Week 5 Memory Builder 1 (Multiplication and Division Facts)

Quiz
•
9th Grade
33 questions
Unit 3 Summative - Summer School: Immune System

Quiz
•
10th Grade
10 questions
Writing and Identifying Ratios Practice

Quiz
•
5th - 6th Grade
36 questions
Prime and Composite Numbers

Quiz
•
5th Grade
14 questions
Exterior and Interior angles of Polygons

Quiz
•
8th Grade
37 questions
Camp Re-cap Week 1 (no regression)

Quiz
•
9th - 12th Grade
46 questions
Biology Semester 1 Review

Quiz
•
10th Grade