Schmied, T. (2022). Self-supervision, data augmentation and online fine-tuning for offline RL [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.89725
E194 - Institut für Information Systems Engineering
-
Date (published):
2022
-
Number of Pages:
131
-
Keywords:
offline reinforcement learning; self-supervised learning; data augmentation
de
offline reinforcement learning; self-supervised learning; data augmentation
en
Abstract:
Reinforcement learning (RL) methods learn through interaction with an environment. The RL paradigm is inherently designed to be performed in an online fashion. However, for many applications in the real world, learning online is not always feasible due to resource and/or safety constraints. Unlike online RL, offline RL, the main topic of this thesis, allows the agent to learn policies from previou...
Reinforcement learning (RL) methods learn through interaction with an environment. The RL paradigm is inherently designed to be performed in an online fashion. However, for many applications in the real world, learning online is not always feasible due to resource and/or safety constraints. Unlike online RL, offline RL, the main topic of this thesis, allows the agent to learn policies from previously collected datasets. Current RL algorithms have a number of other major limitations, among them data-inefficiency. Two promising streams of research that address this limitation are self-supervised methods and data augmentation. These methods were, however, developed for online RL, and it is not yet clear if their benefits translate to the offline case. Moreover, it is not always ideal to eliminate online environment interaction altogether. Both online RL and offline RL have their individual advantages and disadvantages. Algorithms that combine both approaches, e.g., via offline pre-training and online fine-tuning, can draw from the best of both worlds. Consequently, there is a need for RL agents that can learn both online and offline in a data-efficient way. In this thesis, we improve the learning performance of offline RL algorithms by integrating existing self-supervised methods, data augmentations and online fine-tuning into the learning process. We select three established self-supervised online RL architectures (Curl, SPR, SGI) and five prominent data augmentations and adapt them for the offline setting. We then augment a state-of-the-art offline RL algorithm, Conservative Q-Learning (CQL), with the selected methods and compare them against five established baselines. We empirically evaluate all algorithms on both discrete and continuous control tasks usingoffline Atari and Gym-MuJoCo datasets, respectively. To this end, we select four Atari games (Pong, Breakout, Seaquest, QBert) and three Gym-MuJoCo tasks (Halfcheetah, Hopper, Walker-2d) for our experiments. Our results show that self-supervised methods and data augmentations can outperform the baseline agents and considerably improve the learning performance of offline RL algorithms on Gym-MuJoCo but are not beneficial on Atari. Furthermore, we investigate how offline pre-training followed by online fine-tuning affects the learning performance of the selected offline RL algorithm. Our results further demonstrate that hybrid algorithms that learn both offline and online can be far superior to learning online or offline alone.
en
Additional information:
Abweichender Titel laut Übersetzung der Verfasserin/des Verfassers