What is the primary computational limitation of policy iteration?

It requires complete policy evaluation at every iteration

It cannot handle stochastic environments

It does not guarantee optimality

Why is dynamic programming considered computationally expensive?

It requires full knowledge of transition probabilities and reward function

It relies on random exploration

It cannot incorporate reward signals

It depends heavily on function approximation

What guarantees convergence of the Bellman expectation operator?

Discount factor satisfies ( \gamma \in [0,1) ) ensuring contraction

Which condition ensures stable convergence in learning algorithms?

$$(sum_k alpha_k = infinity)$$ and $$(sum_k alpha_k^2

Constant learning rate (alpha_k = 1)

A problem can be represented as a Markov Decision Process (MDP) when:

The next state depends only on the current state and action

The environment is strictly deterministic

In a Grid World environment, the primary objective of the agent is to:

Reach a goal state while maximizing cumulative reward

Maximize classification accuracy

When the discount factor (gamma = 0.9), the agent:

A. Takes future rewards into account

C. Considers only immediate rewards

D. Focuses entirely on long-term rewards without discounting

The expression (P(s' | s, a)) denotes:

The probability of transitioning to the next state

A policy in reinforcement learning specifies:

The action to take in each state

The value assigned to each state

The probability of transitioning to the next state

The policy improvement step involves:

Selecting actions that maximize the value function

Ignoring transition probabilities

Policy iteration is said to converge when:

The policy becomes stable and stops changing

The reward reaches its maximum value

The discount factor becomes zero

An important element in designing a reinforcement learning system is:

Defining states, actions, and rewards

Dynamic Programming (DP) methods require:

A complete model of the environment (transition probabilities and rewards)

Dynamic Programming techniques:

Do not require exploration and rely on a known model

Require exploration of the environment

What condition guarantees the convergence of the Bellman expectation operator (L^\text{\textbf{1}})?

Discount factor (\gamma \in [0,1]) ensuring contraction and convergence

(R(s,a) = 0) for all state-action pairs

When the discount factor ([0;32mγ = 0[0m), the agent:

Considers only immediate rewards

If an agent has (V(s_1) = 10) and (V(s_2) = 15), which statement is correct?

(s_2) provides a higher long-term reward

Policy evaluation is used to compute:

The value function for a given policy

Policy iteration consists of:

Policy evaluation followed by policy improvement

In a reinforcement learning setting with partial observability, which statement best describes the agent–environment interaction?

The agent receives an observation (O_t) correlated with the true state (S_t)

The agent directly observes the true state without uncertainty

The agent receives only rewards without any state-related information

In a stochastic Grid World, if an action intended to move right instead results in moving up or down with certain probabilities, this indicates:

The action outcomes are probabilistic with multiple possible next states

The system depends on past history

The reward remains fixed for a given state-action pair

Which statement correctly defines the Markov property?

The next state depends only on the current state and is independent of past states

The next state depends on the full history of states

Which of the following best describes an optimal policy (π*)?

It maximizes expected cumulative discounted reward over time

It maximizes only immediate reward

It is always unique in all scenarios

Policy iteration primarily becomes computationally expensive because:

It requires full policy evaluation at each iteration

It cannot handle stochastic environments

It does not guarantee optimality

Why is dynamic programming considered computationally expensive?

It requires full knowledge of transition probabilities and reward functions

It requires only partial knowledge of the environment

It is only used for deterministic environments

It does not require any knowledge of rewards

Which condition ensures stable convergence in stochastic approximation methods?

$$(sum_k alpha_k = infinity)$$ and $$(sum_k alpha_k^2

Constant learning rate (alpha_k = 1)

Reinforcement Learning Concepts Worksheet

Authored by Reena Anbhazhagan

Engineering

University

Used 2+ times

Reinforcement Learning Concepts Worksheet

AI Actions

Add similar questions

Adjust reading levels

Convert to real-world scenario

Translate activity

More...

Content View

Student View

38 questions

Show all answers

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Which statement best characterizes the interaction between the agent and environment when the environment is partially observable?

The agent has direct and complete access to the true state at every time step

The agent receives observations (O_t) that are correlated with the true state (S_t)

The agent receives no feedback from the environment

The agent only obtains reward signals without any state-related information

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Which condition ensures that a process satisfies the Markov property?

The next state depends on the entire history of past states

The next state depends only on the action taken, ignoring the current state

The next state depends only on the reward signal

The next state depends only on the current state and not on past states

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Which statement correctly describes an optimal policy?

It maximizes the expected cumulative discounted reward over time

It focuses only on maximizing immediate rewards

It minimizes variability in rewards instead of maximizing returns

It is always unique for all decision problems

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

In a stochastic grid world, if an intended action results in other movements with certain probabilities, what does this indicate?

The policy is deterministic

The system depends on past states

Action outcomes are probabilistic, leading to multiple possible next states

Rewards remain constant for every state-action pair

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Which tuple correctly defines an MDP?

(S, A, P, R, γ)

(S, A, R, γ)

(S, P, R, γ)

(A, P, R, γ)

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Which of the following is the correct tuple for a Markov Decision Process (MDP)?

((S, A, R))

((S, A, P, R, \gamma))

((S, P, R))

((A, P, R, \gamma))

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

What ensures convergence in iterative policy evaluation?

The discount factor satisfies ( \gamma < 1 )

The policy is stochastic

Rewards are always zero

The state space is infinite

Access all questions and much more by creating a free account

Create resources

Host any resource

Get auto-graded reports

Continue with Google

Continue with Email

Continue with Classlink

Continue with Clever

or continue with

Microsoft

Apple

Others

Already have an account?

Popular Resources on Wayground

10 questions

Factors 4th grade

Quiz

•

4th Grade

10 questions

Cinco de Mayo Trivia Questions

Interactive video

•

3rd - 5th Grade

13 questions

Cinco de mayo

Interactive video

•

6th - 8th Grade

20 questions

Math Review

Quiz

•

3rd Grade

20 questions

Main Idea and Details

Quiz

•

5th Grade

20 questions

Context Clues

Quiz

•

6th Grade

20 questions

Inferences

Quiz

•

4th Grade

19 questions

Classifying Quadrilaterals

Quiz

•

3rd Grade

Discover more resources for Engineering

20 questions

Block Buster Movies

Quiz

•

10th Grade - Professi...

20 questions

Disney Trivia

Quiz

•

University

24 questions

5th Grade Math EOG Review

Quiz

•

KG - University

14 questions

Reading- SC Ready Practice

Quiz

•

5th Grade - University

25 questions

APUSH Decades Review

Quiz

•

9th Grade - University

40 questions

Famous Logos

Quiz

•

7th Grade - University

44 questions

Repaso - La Calaca Alegre (whole book) [Twist]

Quiz

•

9th Grade - University

14 questions

(5-3) 710 Mean, Median, Mode & Range Quick Check

Quiz

•

6th Grade - University