Paper: Download here
Authors: Matteo Turchetta et. al
Published in: NeurIPS 2020

Takeaway message

Curriculum learning can be used to perform safe RL with the help of a teacher. The teacher only intervenes and resets the student (agent) to a safe state whenever it starts behaving dangerously. The intervention strategy changes gradually as the performance of a set of students improve. This leads to the development of a safe learning curricula. The set of interventions in this work is fixed and pre-specified. This works focuses on determining the optimal sequence of interventions.

Motivations

The learning process proposed in this work is akin to teaching a child to bike. The possible set of interventions that keeps the child safe are the use of training wheels, catching the child when they are about to fall, and wearing elbow and knee guards. Devising a learning curriculum based on the progress of the child leads to safe learning.

Proposed Solution

In the proposed algorithm, CISR, the teacher plays a curriculum in every training round and evaluates the student’s performance. It then tries an improved curriculum on a new student in the next round. The environment is modeled as a CMDP where the constraints are on the number of constraint violations and the number of times the teacher intervenes. The student and the teacher both learn online. The student learns to perform well on the task while the teacher learns to improve the intervention curriculum.

A high-level overview of the algorithm is as follows:

  1. The teacher adaptively constructs a sequence of intervention induced CMDPs for a new student.
  2. The student interacts with the environment for a few episodes (or steps). In each episode, the student acts in a CMDP adaptively generated by the teacher. The student updates its policy by transferring knowledge across the episodes. The teacher computes features of the student’s performance by taking into consideration the student’s performance over the past episodes. It then generates a CMDP based on these features for the next episode.
  3. Based on the performance of the students taught so far, the teacher updates its decision rule on how the CMDP sequence should be generated for the next round. A separate CMDP is used for evaluating the students’ policies to create features and rewards for the teacher.

Evaluation of the solution

The performance of CISR is compared against different reset strategies (hard, soft, no reset) on a grid world and (narrow, wide) on the lunar lander domain. Hard reset involves sending the agent to the start position which results in limited exploration with the agent failing to learn to reach the goal. Soft resets sends the agent back to the previous state which leads to performance plateaus. CISR retains the best of both worlds by proposing a strategy that uses soft resets in the beginning so that the agent learns to reach the goal and then eventually uses hard resets for better generalization. CISR also performs better than a bandit-based curriculum learning since it learns across students which results in faster training.

In the lunar lander domain, the narrow intervention helps the agent in learning to reach the goal but plateaus the performance. The wide intervention helps in exploration and arriving at the optimal performance. In the lunar lander domain, CISR first uses the narrow intervention to speed up learning and subsequently switches to the wide one.

Analysis of the problem, idea, and evaluation

Safe exploration is essential for training agents using RL in the real world. Achieving this using curriculum learning is promising since it can be combined with many other different approaches e.g. with a baseline policy. The only requirement is that the teacher has prior knowledge about all the modes of failure. The results presented in the paper look convincing. However, coming up with hand-designed intervention strategies can be tedious on more complex tasks such as urban driving.

Contributions

CISR provides a principled way to optimize curriculum policies across generations of students while guaranteeing safe student training. The codebase has also been shared.

Questions

For learning tasks with different goals e.g. lane following and obstacle avoidance for a mobile robot, is it better to learn them jointly using CISR or sequentially? My hunch is that learning sequentially could lead to catastrophic forgetting while joint learning could result in an exponential increase in the number of constraints and performance plateaus.