What is the purpose of reward signals?
To provide feedback to the agent about the quality of its actions in reinforcement learning.
True or false: Exploration is important in reinforcement learning.
TRUE
Exploration helps agents discover new strategies and improve performance.
Fill in the blank: Q-learning is a type of _______ learning.
value-based
What does policy refer to in reinforcement learning?
A strategy that defines the agent’s actions based on its state.
Define Markov Decision Process (MDP).
A mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker.
What is the discount factor in reinforcement learning?
A value between 0 and 1 that determines the importance of future rewards.
True or false: Temporal Difference Learning combines ideas from dynamic programming and Monte Carlo methods.
TRUE
It updates estimates based on other learned estimates without waiting for a final outcome.
What is the role of value functions?
To estimate how good it is for an agent to be in a given state.
Fill in the blank: SARSA stands for _______.
State-Action-Reward-State-Action
Define exploit in the context of reinforcement learning.
To choose the best-known action based on current knowledge.
What is policy gradient method?
A type of reinforcement learning that optimizes the policy directly.
True or false: Deep Q-Networks use neural networks to approximate Q-values.
TRUE
This approach allows handling high-dimensional state spaces.
What does experience replay do?
Stores past experiences to improve learning efficiency and stability.
Fill in the blank: Actor-Critic methods involve both an _______ and a critic.
actor
Define exploration-exploitation tradeoff.
The balance between exploring new actions and exploiting known rewarding actions.
What is the Bellman equation?
A recursive equation that relates the value of a state to the values of its successor states.
True or false: Monte Carlo methods require complete episodes to update value estimates.
TRUE
These methods average returns from complete episodes for learning.
What is a reward shaping technique?
Modifying the reward function to make learning easier and faster.
Fill in the blank: Dyna-Q integrates learning with _______ and planning.
simulation
Define policy iteration.
An algorithm that iteratively improves the policy based on value function updates.
What is the value iteration algorithm?
An algorithm that computes the optimal policy by iteratively updating value estimates.
True or false: Hierarchical reinforcement learning breaks tasks into smaller subtasks.
TRUE
This approach simplifies complex problems by structuring them hierarchically.
What is transfer learning in reinforcement learning?
Applying knowledge gained in one task to improve learning in a different but related task.
Fill in the blank: Multi-agent reinforcement learning involves _______ agents.
multiple