A plethora of real world problems, such as the control of autonomous vehicles and drones, packet delivery, and many others consists of a number of agents that need to take actions based on local observations and can thus be formulated in the multi-agent reinforcement learning (MARL) setting. Furthermore, as more machine learning systems are deployed in the real world, they will start having impact on each other, effectively turning most decision making problems into multiagent problems. In this thesis we develop and evaluate novel deep multi-agent RL (DMARL) methods that address the unique challenges which arise in these settings. These challenges include learning to collaborate, to communicate, and to reciprocate amongst agents. In most of these real world use cases, during decentralised execution, the final policies can only rely on local observations. However, in many cases it is possible to carry out centralised training, for example when training policies on a simulator or when using extra state information and free communication between agents during the training process.
The first part of the thesis investigates the challenges that arise when multiple agents need to learn to collaborate to obtain a common objective. One difficulty is the question of multi-agent credit assignment: Since the actions of all agents impact the reward of an episode, it is difficult for any individual agent to isolate the impact of their actions on the reward. In this thesis we propose Counterfactual Multi-Agent Policy Gradients (COMA) to address this issue. In COMA each agent estimates the impact of their action on the team return by comparing the estimated return with a counterfactual baseline. We also investigate the importance of common knowledge for learning coordinated actions: In Multi-Agent Common Knowledge Reinforcement Learning (MACKRL) we use a hierarchy of controllers that condition on the common knowledge of subgroups of agents in order to either act in the joint-action space of the group or delegate to smaller subgroups that have more common knowledge. The key insight here is that all policies can still be executed in a fully decentralised fashion, since each agent can independently compute the common knowledge of the group. In MARL, since all agents are learning at the same time, the world appears nonstationary from the perspective of any given agent. This can lead to learning difficulties in the context of off-policy reinforcement learning which relies on replay buffers. In order to overcome this problem we propose and evaluate a metadata fingerprint that effectively disambiguates training episodes in the replay buffer based on the time of collection and the randomness of policies at that time.
So far we have assumed the agents act fully decentralised, i.e., without directly communicating with each other. In the second part of the thesis we propose three different methods that allow agents to learn communication protocols. The first method, Differentiable Inter-Agent Learning (DIAL), uses differentiation across a discrete communication channel (specifically a cheap-talk channel) during centralised training to discover a communication protocol suited for solving a given task. The second method, Reinforced Inter-Agent Learning (RIAL), simply uses RL for learning the protocol, effectively treating the messages as actions. Neither of these methods directly reasons over the beliefs of the agents. In contrast, when humans observe the actions of others, they immediately form theories about why a given action was taken and what this indicates about the state of the world. Inspired by our insight, in our third method, the Bayesian Action Decoder (BAD), agents directly consider the beliefs of other agents using an approximate Bayesian update and learn to communicate both through observable actions and through grounded communication actions. Using BAD we obtain the best known performance on the imperfect information, cooperative card game Hanabi.
While in the first two parts of the thesis all agents are optimising a team reward, in the real world there commonly are conflicting interests between different agents. This can introduce learning difficulties for MARL methods, including unstable learning and the convergence to poorly performing policies. In the third part of the thesis we address these issues using Learning with Opponent-Learning Awareness (LOLA). In LOLA agents take into account the learning behaviour of the other agents in the environment and aim to find policies that shape the learning of their opponents in a way that is favourable to themselves. Indeed, instead of converging to the poorly performing defect-defect equilibrium in the iterated prisoner’s dilemma, LOLA agents discover the tit-for-tat strategy. LOLA agents effectively reciprocate with each other, leading to overall higher returns. We also introduce the Infinitely Differentiable Monte-Carlo Estimator (DiCE), a new computational tool for estimating the higher order gradients that arise when one agent is accounting for the learning behaviour of other agents in the environment. Beyond being useful for LOLA, DiCE also is a general purpose objective that generates higher order gradient estimators for stochastic computation graphs, when differentiated in an auto-differentiation library.
To conclude, this thesis makes progress on broad range of the challenges that arise in multi-agent settings and also opens-up a number of exciting questions for future research. These include how agents can learn to account for the learning of other agents when their rewards or observations are unknown, how to learn communication protocols in settings of partial common interest, and how to account for the agency of humans in the environment.