The Role of Reinforcement Learning in the Emergence of Conventions: Simulation Experiments with the Repeated Volunteer’s Dilemma

We use reinforcement learning models to investigate the role of cognitive mechanisms in the emergence of conventions in the repeated volunteer’s dilemma (VOD). The VOD is amulti-person, binary choice collective goods game in which the contribution of only one individual is necessary and su icient to produce a benefit for the entire group. Behavioral experiments show that in the symmetric VOD,where all groupmembers have the same costs of volunteering, a turn-taking convention emerges, whereas in the asymmetric VOD,where one “strong” group member has lower costs of volunteering, a solitary-volunteering convention emerges with the strong member volunteering most of the time. We compare three di erent classes of reinforcement learningmodels in their ability to replicate these empirical findings. Our results confirm that reinforcement learning models canprovide aparsimonious account of howhumans tacitly agreeonone course of actionwhenencountering each other repeatedly in the same interaction situation. We find that considering contextual clues (i.e., reward structures) for strategy design (i.e., sequences of actions) and strategy selection (i.e., favoring equal distribution of costs) facilitate coordinationwhenoptimaare less salient. Furthermore, ourmodels producebetter fits with the empirical datawhen agents actmyopically (favoring current over expected future rewards) and the rewards for adhering to conventions are not delayed.


Introduction
. Conventions solve coordination problems that occur in everyday life, from how to greet each other in the street, to what to wear at a black-tie event, to what citation style to use in a paper. Conventions can be deliberately introduced (e.g., citation styles), but they can also emerge tacitly, as a consequence of individual actions and interactions (Centola & Baronchelli ; Lewis ; Sugden ; Young ). Results from experiments with economic games show that incentives matter for the conventions that can emerge in repeated social interactions (Diekmann & Przepiorka ). That is, individuals engage more in certain behaviors the less costly it is to do so. However, how cognitive processes, such as learning, interact with structural properties of the situation in the emergence of conventions is less well understood (Przepiorka et al. ; Simpson & Willer ). .
Here, we use agent-based simulations to investigate the role of learning in the emergence of conventions in the repeated, three-person volunteer's dilemma game (VOD). The VOD is a binary choice, n-person collective goods game in which a single player's volunteering action is necessary and su icient to produce the collective good for the entire group (Diekmann ). For example, a couple awaken by their crying baby face a VOD with regard to who should get up and calm the baby. A group of friends wanting to go out in town on Friday night face a VOD with regard to who should drive and abstain from drinking alcohol. The members of a work team that embarks on a new project face a VOD with regard to who should take the lead. In all these situations, every individual prefers to volunteer if no one else does, but prefers having someone else do it even more. .
An important distinction can be made between symmetric and asymmetric VODs (Diekmann ). In a symmetric VOD, all players have the same costs and benefits from volunteering. In an asymmetric VOD, at least two players have di erent costs and/or benefits from volunteering. Previous research shows that a turn-taking convention, in which each player incurs the cost of volunteering sequentially, can emerge in the symmetric VOD.
In an asymmetric VOD, in which one player has lower costs of volunteering, a solitary-volunteering convention emerges by which the individual with the lowest cost volunteers most of the time (Diekmann & Przepiorka ; Przepiorka et al. ). .
Following the concept of "low (rationality) game theory" (Roth & Erev , p. ), we conduct simulation experiments with adaptive learning (rather than "hyperrational") agents to explain experimental data. That is, we compare three classes of reinforcement learning models to study how the process of learning contributes to the emergence of the turn-taking convention and the solitary-volunteering convention in the repeated symmetric and asymmetric VOD, respectively. Through simulation experiments and validation with human experimental data, we gain insights into potential mechanisms that can explain how individuals learn to conform to the expectations of others. Specifically, we address two research questions (RQs): ( ) How well do di erent classes of reinforcement learning models fit the human experimental data? ( ) Which model properties a ect the fit with the experimental data? .
By addressing these questions, we expect to learn whether reinforcement learning models can provide a cognitive mechanism that can explain the emergence of conventions (RQ ) and what parameter settings are needed to simulate human behavior (RQ ). Reinforcement learning models are widely used in cognitive modeling and cognitive architecture research. Having such an existing mechanism as a candidate explanation for a wider range of problems, is preferred over introducing new mechanisms to explain behavior (cf. Newell ). Our work provides such a critical test. .
The remainder of the paper is structured as follows. We first review literature that has used cognitive modeling to explain the emergence of conventions in humans and outline the contribution our paper makes to this literature by focusing on the repeated, three-person VOD. We then recap the principles of reinforcement learning models and introduce the three model classes that we focus on in this study. The next three sections outline the simulation procedures, list and describe the parameters we systematically vary in our simulation experiments, and describe our approach to the analysis of the data produced in our simulation experiments. The results section presents our findings, and the final section discusses our findings in the light of previous research and points out future research directions.

Previous Literature
Reinforcement learning as cognitive mechanism of convention emergence . Cognitive science has shown a growing interest in the role of cognitive mechanisms in the emergence of social behavior. A review article by Hawkins et al. ( ) points out that a potentially fruitful area for cognitive science research is 'the real computational problems faced by agents trying to learn and act in the world' (p. ). Spike et al. ( ) identify three important factors to address this gap: ( ) the availability of feedback and a learning signal, ( ) having a mechanism to cope with ambiguity, and ( ) a mechanism for forgetting. Cognitive modeling studies of conventions have so far focused mainly on the mechanism of memory and forgetting in two-player social interaction games (e.g., Collins et al. ; Gonzalez et al. ; Juvina et al. ; Roth & Erev ; Stevens et al. ). These models showed that learning to play the game requires fine-tuning of associated cognitive mechanisms (e.g., parameters for learning, forgetting), and that learned behavior can, within limits, transfer to other game settings (Collins et al. ). .
Here we focus on a factor that has been under-explored, namely the role of feedback and a learning signal. More specifically, we investigate whether social decision-making observed in a computerized lab experiment with the three-person VOD can be explained based on reinforcement learning models of cognitive decision-making (RQ ), and what characteristics such reinforcement learning models should have (RQ ).

.
The basic task of a reinforcement learning agent is to learn how to act in specific settings, so as to maximize a numeric reward (Sutton & Barto ). Reinforcement learning models have been used in two ways to study the emergence of conventions. The first family of reinforcement learning models takes an explanatory approach and are designed to study how empirical data comes about. Roth & Erev ( ), for example, used adaptive learning agents to study data collected in experiments on bargaining and market games. In a very abbreviated form, agents select actions based on probabilities for these actions, while probabilities are derived from the averaged rewards obtained for these actions in the past. Roth & Erev ( ) showed that the dominant experimental behavior of perfect equilibrium play is in principle explainable by reinforcement learning. However, agents required more than , interactions before patterns stabilized. Other studies demonstrated how conventions can emerge from the interaction between reinforcement learning and other cognitive mechanisms, such as declarative memory (Juvina et al. ). However, these models also required the introduction of separate, novel cognitive mechanisms (e.g., 'trust') to provide a good fit with empirical data (see also Collins et al. ). Introduction of new mechanisms to fit results from a new context (e.g., the VOD) is at odds with the general objective of cognitive modeling to have a fixed cognitive architecture that can explain performance in a variety of scenarios (Newell ).
. ) uses reinforcement learning to identify what the objectively best policy is to handle scenarios. Such models allow one to identify whether humans apply the optimal strategies for solving a learning problem, or to compare whether (boundedly) rational agents opt for strategies predicted by game-theoretical considerations. Sun et al. ( ), for example, studied tipping points in the reward settings of a Hawk-Dove game variant beyond which conventions can be expected to emerge. Izquierdo et al. ( , ) found that settings with high learning and aspiration rates move quickly from transient regimes of convention emergence to asymptotic regimes of stable conventions. Zschache ( , ) showed that an instance of melioration learning, namely Q-Learning (Watkins & Dayan ), produces stable patterns of game-theoretical predictions and that these patterns emerge much quicker (∼ factor ) than in the Roth-Erev model. In contrast to explanatory models (e.g., Roth & Erev ), this family of models takes an exploratory approach to study the general conditions that contribute to the emergence of conventions. .
In our paper, we follow an explanatory approach (family ), while avoiding the addition of a separate mechanism (e.g., trust), to investigate whether (RQ ), and under what conditions (RQ ), reinforcement learning alone can explain the emergence of conventions in the repeated three-person VOD. We explore three di erent classes of reinforcement learning models to investigate how generalizable results are across models and parameter settings. We evaluate our models against a previously reported empirical dataset collected by Diekmann & Przepiorka ( ), described briefly at the end of this section.
The use of economic games to study convention emergence .
In behavioral and experimental social science, the emergence of conventions is o en observed in laboratory experiments with economic games in which the same group of participants interact with each other repeatedly (Camerer ). The structure of the game that emulates the interaction situation in one round can vary in terms of payo s of individual group members and how these payo s are reached contingent on these group members' actions (i.e., strategies). The bulk of the experimental research uses symmetric games, i.e., games in which the roles of players are interchangeable because all players face the same payo s from their actions. More recently, however, it has been recognized that asymmetric games may better capture the individual heterogeneity that occurs in real-life settings (Hauser et al. ; Kube et al. ; Otten et al. ; Przepiorka & Diekmann , ; van Dijk & Wilke ). In asymmetric games, players di er in their payo s and their roles are therefore not interchangeable.
. Moreover, the experimental literature on the emergence of cooperation in humans has mostly used linear collective goods games, in which each unit a player contributes increases the collective good by the same amount (Andreoni ; Fischbacher et al. ; Chaudhuri ). With a few exceptions, only more recently, threshold and step-level collective good games are used in experimental research (Rapoport & Suleiman ; van de Kragt et al. ; Milinski et al. ; Dijkstra & Bakker ). .
The VOD falls into the realm of step-level collective good games, and a distinction can be made between the symmetric and an asymmetric VOD. Figure presents a three-person version of the VOD in normal form. In the symmetric VOD, all three players have the same costs of and benefits from volunteering (b = 30), while in the asymmetric VOD, one "strong" player has lower costs of volunteering, which manifests itself in higher net benefits for that player (80 > b > 30; e.g., b = 70). The game has three Pareto optimal, pure-strategy Nash equilibria (circled in red in Figure ), in which only one player volunteers while the other two players abstain from volunteering. Furthermore, it has one mixed-strategy Nash equilibrium in which all players volunteer with a certain probability (Diekmann , ). In the asymmetric VOD, rational players can tacitly coordinate on the pure strategy Nash equilibrium in which only the strong player volunteers, even if the game is played only once (Diekmann ; Harsanyi & Selten ). In the symmetric VOD, tacit coordination on one of the purestrategy equilibria is more di icult. If, however, the symmetric game is repeated over an indeterminate number of rounds, turn-taking among all three players becomes a salient, Pareto optimal Nash equilibrium (Lau & Mui ). Although solitary volunteering by the strong player remains a salient Nash equilibrium in the asymmetric game if the game is repeated, it leads to an inequitable distribution of payo s as the strong player obtains a lower payo than the other two in every round (Diekmann & Przepiorka ).

Figure :
The three-person VOD in normal form. In the three-person VOD, three players decide simultaneously and independently of each other whether to volunteer (choose v) and produce a collective good or not volunteer (choose ¬v). For example, if player chooses v, the payo is b, regardless of what the other two players do. If player chooses ¬v, then the payo depends on what the others do. If at least one of the other players chooses v, then player will receive a payo of . However, if all players choose ¬v, all will receive a payo of since in this case the collective good will not be produced. .
In their experiment, Diekmann & Przepiorka ( ) matched participants randomly in groups of three and assigned them to either the symmetric VOD or an asymmetric VOD condition. In all experimental conditions, participants interacted with the same group members in the VOD for to rounds. In each round, they faced the same VOD, made their decisions individually and independently from each other and were provided with full information feedback on the decisions each group member took and these group members' corresponding payo s (see Figures and in the Appendix). The payo s each participant gained in each round were summed and converted into money that participants received at the end of the experiment. Diekmann & Przepiorka ( ) found that in the symmetric VOD, a turn-taking convention emerges, whereas in an asymmetric VOD, a solitary-volunteering convention emerges. While in turn-taking each group member volunteers and incurs a cost sequentially, in solitary-volunteering the group member with the lowest cost of volunteering, volunteers most of the time. This finding has been replicated in several follow-up experiments (Przepiorka et al. , ).

.
Our paper contributes to the research on step-level collective good games by investigating how di erences in emergent conventions are reflected in cognitive models to better understand how conventions come about (Guala & Mittone ; Tummolini et al. ; Young ). In order to draw meaningful conclusions about how humans learn to coordinate in social settings, and thus how conventions emerge, our research uses reinforcement learning models to explain the empirical observations made by Diekmann & Przepiorka ( ) in their behavioral lab experiment.

Reinforcement learning models
. Generally speaking, reinforcement learning agents learn what actions to pick based on the rewards that their actions result in. Reinforcement learning has a solid theoretical basis in behavioral psychology (Herrnstein & Vaughan ) and neuroscience (e.g., Schultz et al. ), generates empirically validated predictions (e.g., Herrnstein et al. ; Mazur ; Tunney & Shanks ; Vaughan Jr ), and finds applications in multiple cognitive architectures (e.g., Anderson ; Laird ). From the available realizations of reinforcement learning models, we consider Q-Learning (Watkins & Dayan ) as the prime candidate for our study. That is because, on the one hand, Q-Learning has been proven informative and e icient in previous game-theoretic contexts of convention emergence (Zschache ). On the other hand, Q-learning models are parsimonious in that they do not require additional mechanisms nor a full representation of their environment, but only a defined state they are in at a certain time (s t ) and the available actions in that state (a t ). Given that there is no need for a detailed model of the world, these Q-learning models are also referred to as model-free. Quality values (Q-values) for state-action pairs (Q(s t ; a t )) track how rewarding an action was at a given state, and form the basis of the decision making process. That is, the most rewarding action in a given state (max Q(s t ; a t )) is typically given priority. Moreover, Q-values are constantly updated based on the sum of their old values and anticipated new values (Sutton & Barto ): New Q-values are composed of the actual reward r t+1 received a er action a t in state s t and the maximum expected reward in a future state (s t+1 ) from all available actions (max a Q(s t+1 ; a)). In our models, expectations depend on agents' social preferences and are set to be the same for all agents at the beginning of a simulation run. Furthermore, setting expectations to high but achievable rewards ought to minimize the time of convention emergence (Izquierdo et al. , ). Selfish agents, therefore, expect to receive the reward for the optimal individual action, while altruistic agents expect to receive the reward for the optimal collective action. .
The impact of future rewards on Q-value updates is moderated by a discount factor and a learning rate. The discount factor (0 ≤ γ ≤ 1) parameterizes the importance of expected future Q-values relative to immediate rewards. Smaller γ values make the agent more 'short-sighted' and rely more heavily on immediate rewards (r t+1 ), whereas larger γ values place more weight on long-term rewards (Q(s t ; a t )). The learning rate (0 < α ≤ 1) defines the extent of new experiences (both r t+1 and Q(s t+1 ; a t )) overriding old information. Models with smaller α values adjust their Q-values more slowly, and as a result, rely more heavily on experience than on most recent rewards.
. Action selection typically considers a process to balance between exploitation of (currently) most rewarding actions (highest Q-value) and exploration of alternatives (lower Q-values that might also lead to success). A common approach for this is -greedy, which defines a probability for the selection of a random action, while the most rewarding action is selected with probability 1 − . This is especially useful in changing environments, so that low rewarding actions are still considered and may become more rewarding when conditions change.
In our scenario, however, conditions do not change over time, so that there is no reason for agents to change strategies when there is one superior strategy. Furthermore, patterns in the experiment were mostly stable once they emerged, so that humans seem to stick to clearly best performing actions. To account for the stability of conditions and human behavior, we add noise to the calculated Q-values (Q noisy ) each time that an action is decided, rather than taking an -greedy approach.
. In our models, the noise is drawn from a continuous uniform distribution with parameter η. Each time that a Q noisy -value is calculated, a value within the range [−η, η] is randomly drawn and multiplied with the original Q-value. The action that is associated with the highest Q noisy -value is then selected for execution by the model. Mathematically, this can be expressed as: For example, imagine there are two Q-values (Q 1 = 60, Q 2 = 50). If η = 0.2, then each Q-value is assigned a random ρ value between -. and + . and an updated Q-value is calculated. For example, Q noisy1 = 60 + 60 × (−0.15) = 51 and Q noisy2 = 50 + 50 × 0.09 = 54.5, where ρ 1 = −0.15 and ρ 2 = 0.09, respectively. In this case, the action of Q 2 would be selected, even though originally it had the lower Q-value. In other words, the model explores actions that have not yielded the highest rewards in the past. However, if η = 0.1, then the chance that Q 2 would be selected is a lot smaller as the impact of noise on the Q-value is smaller (i.e., Q noisy1 could range between [54,66] and Q noisy2 between [45,55]), and the model will exploit past successful actions more. .
Consequently, when a Q-value of one action is clearly superior, the corresponding action will be exploited. In cases where two or more actions have comparable Q-values, the noise factor regulates the frequency between exploration and exploitation. That is, the higher the setting for η the more explorative the model.

Model classes
. Within the basic reinforcement learning framework outlined above, we define three model classes that di er in their assumptions. We call these model classes ClassicQ, SequenceX, and VolunteerX. This allows us to test how well the reinforcement learning models work in general, and to what degree they depend on additional model assumptions. All three models follow the concept of Markov chains (Gagniuc ). That is, they consist, first, of well-defined states and, second, of events that define the transition between these states (see Figure ).

Figure :
The three model classes represented as Markov chains. For ClassicQ, states are defined by the two previously performed actions. Consequently, ClassicQ requires two rounds at the beginning of each game to initialize. Transition events between states are actively selected actions (volunteering, not volunteering) by the agents, based on how beneficial previously taken actions were in the same state. For SequenceX and Volun-teerX, states are mostly defined by actions planned to be taken in the coming rounds. Whenever there is no more action le to be taken (∅), the agents actively select a sequence of actions based on how beneficial previously selected action sequences were. Actions are then performed without reconsideration in the following rounds until the sequence ends. .
The ClassicQ model class uses a typical approach for Q-Learning. In the three-person VOD, actions can be to either volunteer or not volunteer: a ∈ {v, ¬v}. States are defined by the actions an agent took in the previous two rounds: s ∈ {vv, v¬v, ¬vv, ¬v¬v}. Available Q-values for ClassicQ are therefore: Q(vv; v), Q(vv; ¬v), Q(v¬v; v), Q(v¬v; ¬v), Q(¬vv; v), Q(¬vv; ¬v), Q(¬v¬v; v), Q(¬v¬v; ¬v). Furthermore, ClassicQ requires an initialization phase of two rounds ( Figure , top three states). Here, every agent volunteers with a probability of . in each of the two rounds, as only one of the three agents is required to volunteer for the optimal outcome. All following actions are selected based on Q-values. Consider, for example, an agent that did not volunteer in the last two rounds (s t = ¬v¬v; Figure , bottom right state of the ClassicQ model). Depending on the larger (noisy) Q-value (see Equation ) for the currently available state actions pairs (Q(¬v¬v; v) and Q(¬v¬v; ¬v)), the agent decides to volunteer (Figure , active transition from state ¬v¬v to state ¬vv) or to not volunteer (Figure , active transition from state ¬v¬v back to state ¬v¬v). As ClassicQ's actions are influenced by two past states, we refer to this as a backward-looking perspective. .
By contrast, the SequenceX and VolunteerX model classes follow a forward-looking perspective. That is, actions are defined by sequences of consecutive future actions (Figure , orange states), which are automatically and without reconsideration performed in the following rounds: a ∈ {vvv, vv¬v, v¬vv, . . . }. Agents select a new action sequence whenever no actions are le to perform (s = ∅). Our models look forward at most actions.
The agent selects an action sequence, depending on the largest (noisy) Q-value (see Equation ) for all action sequences (e.g., v¬vv - Figure , bottom le state), and performs the corresponding actions in the following three rounds (a t = v; a t+1 = ¬v; a t+2 = v).
. VolunteerX, in contrast, minimizes the length of action sequences by only defining when to volunteer. We used a strategy space from "immediately" to "in the third round": Q(∅; v), Q(∅; ¬vv), Q(∅; ¬v¬vv). In addition, we included a strategy to not volunteer in the following round: Q(∅; ¬v). .
To summarize, all three model classes make decisions based on Q-values, which are shaped by experience, and which consider a sequence of up to three states and/or actions. However, the models di er in whether they are backward looking (ClassicQ: last two actions determine what Q-learning options are available) or forwardlooking (SequenceX and VolunteerX consider all state-action pairs when deciding the next action). From a practical perspective, each time that ClassicQ needs to determine a next action, it can only choose between two Q-values out of all eight potential options (e.g., if the last two actions were v and ¬v, then the only choice is between Q(v¬v; v) and Q(v¬v; ¬v)). SequenceX and VolunteerX, in contrast, consider all Q-values whenever a new action sequence is selected (eight options for SequenceX, and four for VolunteerX).

Conditions .
We tested two di erent payo conditions, which were identical to the "Symmetric" and "Asymmetric " conditions in the experiment by Diekmann & Przepiorka ( ) (also see Figure ). In the Symmetric condition, all agents experience the same benefit if the collective good is produced (r 1,2,3 = 80) and they incur the same costs when they volunteer to produce the collective good (K 1,2,3 = 50). In the Asymmetric condition (henceforth Asymmetric), all agents experience the same benefit when someone volunteers to produce the collective good (r 1,2,3 = 80), however the cost for volunteering is lower for one agent (K 1 = 10), compared to the other two agents (K 2,3 = 50).

Simulation procedure .
To Utility of agents , , and = else Utility of agent = , if action was ¬v -= , if action was v and VOD is "symmetric" -= , if action was v and VOD is "asymmetric" Utility of agents and = , if action was ¬v -= , if action was v . Evaluate action selected in step (each agent): Set expected reward ("max" component of Equation ): , if action was ¬v and social preference is "selfish" -own costs, if action was v and social preference is "selfish" ( + + -lowest costs) / , if social preference is "altruistic" Update Q-value of action selected in step using Equation

Parameter settings .
For each model class, five parameters (see Table ) were varied systematically: • Discount rate (γ) describes the importance of expected future rewards relative to immediate rewards. As there is no consensus in the literature, we tested a wide range of values. This ranged from models that consider distant rewards equal to immediate rewards (γ = 1), to models that discount future rewards strongly (e.g., γ = 0.5, rewards that are step away are halved).
• Learning rate (α) defines the extent of new experiences overriding old information. As there is no consensus in the literature, we tested a wide range of values. They ranged from models that relied relatively heavily on previous experience (α close to . ) to models that rely more heavily on recent experience (α close to . ).
• Initial Q-values (ι) concern the initial value that each state-action pair gets assigned before running simulations. If the Q-value deviates a lot from the eventual learned value, the learning trajectory takes longer (as the new Q-value is impacted by α and γ). Moreover, in such cases the model might get stuck in local optima earlier (i.e., if an initial action yields a very high reward relative to the expected value, then noise cannot overcome this selection). Therefore, we tried two values of the initial value: a payo close to the theoretic optimum in the symmetric condition of ( . ) and one that is substantially lower ( . ).
• Exploration rate (η) defines the range of how much noise is randomly added to the Q-values before an action is selected (Equation ). We distinguish between two scenarios: agents that are either more (η = 0.1) or less conservative (η = 0.2) in exploring new actions as a result of noise.
• Social preference (S) manipulates how the expected maximum reward (see Equation ) is composed. For selfish agents, the expected reward (max a Q(s t+1 ; a)) corresponds to the highest possible individual reward. For altruistic agents, the expected reward corresponds to the highest possible collective reward.
The final assumption in our model is that the above parameters and settings are related to the general "cognitive architecture" (Anderson ; Newell ) of the agents. In other words, when specific parameter values were set (e.g., α = 0.4, γ = 0.9, η = 0.1, ι = 67.5), all three agents in the VOD assumed these parameter values. The only thing that might di er between them were the volunteering costs (K) in the asymmetric VOD. Although in practice di erent agents might di er in their cognitive architecture and in the way that they weigh di erent types of information (e.g., have di erent learning rates), keeping the values consistent across agents allowed us to systematically test how model fit changes in response to parameter changes (RQ ). This in turn helped us to understand what is more important for a good model fit: a change in model class (ClassicQ vs SequenceX vs VolunteerX) or a change in model parameters (e.g., when each model achieves an equally good fit with its "best" parameter set).

Data and Analysis
. We had three model classes (ClassicQ, SequenceX, and VolunteerX), tested in two experimental conditions (symmetric and asymmetric VOD), for two types of reward integration (selfish or altruistic), with (11 × 11 × 2 × 2) parameter combination options (see Table ). For each unique combination of these parameters, we ran simulations. This resulted in a total of , (3 × 2 × 2 × 484 × 10) simulation runs. For each run, three agents interacted in rounds of the VOD game. .
To answer RQ (how well each model class fits the human experimental results), we first compute the mean Latent Norm Index (LN I; Diekmann & Przepiorka ) for the simulation runs per condition and model class. The LN I k,m describes the proportion of behavioral pattern k that emerged in a group of size m in a given number of rounds. The behavioral patterns that are commonly found in experiments with the repeated threeplayer VOD are solitary-volunteering (k = 1), turn-taking between two players (k = 2), and turn-taking between all three players (k = 3) (see also Przepiorka et al. ). Since the empirical data we aim at reproducing are based on interactions between three participants, we use three agents in our simulation experiments (m = 3).

.
Consider, for example, the following sequence of actions in which positive numbers denote the index of a single volunteering player (i.e., player , or ) per round and denotes rounds in which either no or more than one player volunteered: . The game has a single sequence of player volunteering three times in a row ( ). A game of rounds therefore results in % solitary-volunteering (LN I 1,3 = 30%). It follows further that the same sequence maps to % turn-taking between two players (LN I 2,3 = 0%) and % turn-taking between all three players ( ; LN I 3,3 = 70%). Note that to avoid the detection of pseudo patterns, a pattern needs to be stable for at least m consecutive rounds to be counted towards the LN I. For a detailed description of how the LN I is calculated, refer to Diekmann & Przepiorka ( , p. f.). .
To answer how well our models match the empirical data, we use root-mean-square errors (RMSE) between mean LNI of the simulated data and the LNI observed in the empirical data, with lower RMSE indicating a better model fit. .
To answer RQ (which model properties a ect model fit), we perform multiple linear regressions with RMSE as dependent variable and the model parameters as independent variables. This allows to understand the direction and size of the e ect of each parameter on a model's fit to the empirical data. We fit a total of six JASSS, ( ) , http://jasss.soc.surrey.ac.uk/ / / .html regression models: for each of our three model classes we fit one model to the data from the symmetric VOD condition and one from the asymmetric VOD condition. All parameters are centered at their means and only main e ects are considered. We also fitted models with interaction e ects. Since these models did not produce di erent insights (i.e., the main e ects remained the same), we do not refer to them in the text. All regression model results are presented in the Appendix.

Results
RQ : Model fit to empirical data . Figure shows relative frequencies of the di erent conventions that emerged in the asymmetric VOD condition (le ) and the symmetric VOD condition (right) in the empirical study of Diekmann & Przepiorka ( ) (gray bars) and our simulation experiments (colored bars) in terms of LN Is. The bars in the "Solitary volunteering", "Turn taking " and "Turn taking " panels show average LN I 1,3 , LN I 2,3 and LN I 3,3 , respectively. The di erent colored bars denote the results of the best fitting model instances of each model class (ClassicQ, SequenceX, and VolunteerX). Figure : Simulation results comparing the relative frequencies of emerging patterns (solitary volunteering, turn-taking between agents, and turn-taking between agents) between the experimental data (grey bars) and the best fitting models (yellow, orange, and red bars) in the asymmetric (le ) and symmetric (right) condition. Error bars show standard errors of the mean.

.
In the asymmetric cases, where one agent has lower costs from volunteering, the human data shows more frequent solitary volunteering. Of the three model classes, the ClassicQ model reproduces the dominant pattern of the empirical data (solitary volunteering) best. By contrast, in the symmetric case, where each agent has the same costs and benefits of volunteering, the human data shows frequent turn-taking between all three agents. ClassicQ is now the worst fitting model, as it hardly shows such turn-taking between three agents. SequenceX and VolunteerX show comparable results, where the percentage of trials with frequent turn-taking between three agents is comparable to the human data. As we will see in more detail below, variations in the relative frequency of patterns are a result of di erent stabilization times for single emerging patterns (ClassicQ in the asymmetric VOD, SequenceX and VolunteerX in the symmetric VOD) or multiple emerging patterns (ClassicQ in the symmetric VOD, SequenceX and VolunteerX in the asymmetric VOD). .
To further study the properties of the model classes, Table reports the best fitting (i.e., lowest mean RMSE) model instances (unique combination of the model class, reward integration, and parameter settings) per condition. The rows show mean RMSE per condition. The columns show the best fitting model instances per model class and condition (e.g., CQ. for ClassicQ and asymmetric VODs). Therefore, each model class has potentially three di erent model instances that produce a best fit. A single model instance of the SequenceX model class (SX. ), however, produced best fits in the asymmetric condition and in the combination of both conditions. Model instances that produce best fit per model class and condition are marked with a grey background. A * symbol denotes which of the three model classes provided the best overall fit across conditions. Note: The best fitting models are the models with the lowest RMSE. Within each column, the RMSE score is highlighted for the model that had the best score within that condition (e.g., combined, asymmetric and/or symmetric). Per row (combined, asymmetric, symmetric), the overall best fitting model is marked with a * .
. Figure , a ClassicQ instance (CQ. ) produces the best fit of the three model classes in the asymmetric condition (RMSE = . ), while the instances producing best fits for SequenceX (SX. with RMSE = . ) and VolunteerX (CX with RMSE = . ) produce patterns less consistent with the empirical data. Further, we find good fits in the symmetric condition for SequenceX (SX.

Consistent with
with RMSE = . ) and VolunteerX (VX. with RMSE = . ), while ClassicQ produces the worst fit of all best fitting models across both conditions (CQ. with RMSE = . ).

.
It is striking, however, that all of the models producing best fits in one condition produce comparably low fits in the other condition (e.g., CQ. with RMSE = . for asymmetric VODs and RMSE = . for symmetric VODs). Further, when considering the model instance producing the best fit in the combination of both conditions (VX. with RMSE = . ), two things become apparent. First, the model fit is significantly lower than the models producing best fits per condition (asymmetric: CQ. with RMSE = . ; symmetric: VX. with RMSE = . ). Second, the model fit per condition is also significantly lower than the models producing best fits per condition (RMSE = . for asymmetric VODs and RMSE = . for symmetric VODs). It follows that there is no single model able to closely replicate the human data in both conditions.

.
To . In the asymmetric VOD ( Figure ), the example rounds of the ClassicQ model instance CQ. show that this model is the first to have consistent volunteering by a single agent (agent : the agent with the lowest costs from volunteering). The two runs show that between runs, the speed of learning can di er. Furthermore, Figure  shows solitary volunteering by agent and continuous abstention from volunteering by agents and is not only dominant (most occurring), but the only emerging pattern for ClassicQ in the asymmetric VOD. For the SequenceX model runs, agent also learns to volunteer (Figure , SX. , run ). However, another agent (agent ) also occasionally volunteers, thereby limiting the percentage of trials in which there is solitary volunteering. Furthermore, Figure reveals that the dominant pattern for SequenceX in the asymmetric VOD is a form of turntaking between two agents ( : ¬vvv, / : v¬v¬v), as shown in run for SX.
in Figure . Note that this is an e icient outcome that is not captured by the LN I. For the VolunteerX model, the dominant pattern is solitary volunteering by agent (see Figure , VX. , run and Figure ). However, turn-taking between three agents can also be found. Therefore, although the forward-looking models (SequenceX, VolunteerX) can, in principle, learn the pattern, they do not do this consistently.  Emerging cycles represented as Markov-chains for the best fitting models per model class in the asymmetric VOD. All cycles shown were stable in the last rounds of a simulation run. Patterns that occurred most are referred to as dominant. Superscript numbers denote the agent for which a cycle occurred. These di er between agents, as agent has lower costs of volunteering in the asymmetric VOD, thus resulting in di erent behavior. .
For the symmetric VOD (Figure and ), all three models seem to learn some patterns of turn-taking among all three agents. The SequenceX and VolunteerX models both can learn a (almost) perfect alternation (as is also observed in the behavioral experiment; see Diekmann & Przepiorka ). Note that alternation in the SequenceX model requires three di erent action sequences (¬v¬vv, v¬v¬v, ¬vv¬v) di ering in when to volunteer, as all three agents select a new action sequence at the same time (every third round). In contrast, alternation in the VolunteerX model requires the agents to coordinate when to start a single action sequence (vvv). Although, turn-taking between three agents it is the only stable pattern emerging for SequenceX and VolunteerX (see Figure ), it can take up to rounds to stabilize. That is, although these models produce better fits in the symmetric condition, they need on average more time to learn stable behavior compared to human participants. Furthermore, ClassicQ can show turn-taking with an alteration of ¬v¬vv (Figure , CQ. , top pattern). However, this typically does not entail all three agents (see Figure , CQ. , run ). The dominant pattern for ClassicQ in the symmetric condition remains some form of volunteering by a single agent (see Figure , CQ. , run ).

RQ : Which model properties a ect fit systematically .
One interpretation of the results pertaining to RQ is that, in principle, di erent model classes can fit di erent aspects of the human data. This instigates the question what structural factors of the reinforcement models contributed to this fit. To this end, we analyzed the fit data generated by all the model instances (i.e., all parameter combinations) of the three model classes by means of multiple regression models. This allowed us to identify the direction and e ect size of each parameter on model fit. The regression results are summarized in Table (the full regression models are included in Tables through in the Appendix).
JASSS, ( ) , http://jasss.soc.surrey.ac.uk/ / / .html Doi: . /jasss.  . For all models and in all conditions, better fits are produced when agents act myopically, favoring immediate rewards over potential future rewards (low γ). Furthermore, half of the models fare better when agents rely on experience (low α) and exploration (large η), while most models produce better fit with altruistic reward expectations (i.e., maximizing group rather than individual reward). Especially the di erences in parameter settings between conditions and models allow deeper insights about the learning process.
. First, agents in the VolunteerX model class and the symmetric VOD condition produce better fit with the human data when they rely more on recent rewards (high α), rather than earlier (accumulated) experience. Our interpretation is that this e ect is due to the inconsistent length of action sequences between di erent stateaction pair alternatives. As a result, agents need to coordinate (and thus constantly reconsider) two things: (a) the same (for altruistic agents optimal) strategy (Q(∅; ¬v¬vv)) and (b) staggered points in time (e.g., agent = round , agent = round , agent = round ). This requires thorough exploration of the state-action pairs, while relying too much on experience before having coordinated on action strategies and time points may cause altruistic agents to get stuck in suboptimal outcomes for the group as a whole (e.g., turn-taking between agents).
. Second, agents in the ClassicQ model class and the symmetric VOD condition produce better fit with the human data, when they favor exploitation over exploration (low η) while acting selfishly (i.e., considering only own rewards). Our interpretation is that this e ect occurs because the expected selfish rewards provide a more salient optimum in the symmetric condition for all agents (e.g., vv → v : 30 + 30 + 30 = 90; ¬v¬v → v : 80+80+30 = 190; so the di erence between the two is ) when compared to the expected altruistic rewards of the agent with the lowest costs to volunteer in the asymmetric condition (e.g., vv → v : 70 + 70 + 70 = 210; ¬v¬v → v : 80 + 80 + 70 = 230; so the di erence between the two is only ). The former allows to exploit quickly (low η), while the latter requires exploration of the agent with the lowest costs to find the more subtle di erences in payo s.

.
In summary, the results show four main findings. First, no single model produces best fits in both conditions. Second, quick coordination relies on myopic agents favoring current reward rather than working towards potentially higher gains in the future (low γ). Third, settings of other parameters di er between model classes and conditions. Fourth, coordination requires less exploration when optima are salient.

General Discussion
. We investigated whether reinforcement learning can provide a unifying computational mechanism to explain the emergence of conventions in the repeated volunteer's dilemma (VOD). Feedback and learning are at the core of reinforcement learning (Sutton & Barto ), and an important factor considered to be involved in learning to act in the real world (Spike et al. ). Our results suggest that reinforcement learning models can fit, and thereby describe, the emergence of conventions in small human groups.
. Furthermore, we find that the exact structure and details of the model matter, as there was not one model class that provided a best fit in all conditions. While all three model classes (ClassicQ, SequenceX, VolunteerX) are based on Q-Learning, and thus favor the best performing action in a given state, they di er in how actions and states are defined (see Figure ). ClassicQ follows a backward-looking perspective, where a state is defined by the two previously performed actions, and actions performed in the current round consist of volunteering and not volunteering. In contrast, SequenceX and VolunteerX follow a forward-looking perspective by defining a sequence of actions (consisting of volunteering, not volunteering), selected whenever there are no more actions le to be performed.

.
In the asymmetric condition, ClassicQ had the best fit with the empirical data. Our interpretation is that this fit emerged due to the combination of model properties and a salient external reward structure. Specifically, Clas-sicQ has a structural advantage. That is, ClassicQ can make a decision each round (see Figure , only black states connected with black arrows), rather than up to every third round like SequenceX and VolunteerX. Moreover, each time there is only a decision between two actions (to volunteer or to not volunteer), rather than action strategies for SequenceX, or action strategies for VolunteerX (see Figure , one single black state with or black outgoing arrows to orange states). When combined with the salient reward structure of the asymmetric VOD (one agent has lower costs to volunteer), ClassicQ has a relatively easy learning problem, compared to the other two model classes: a binary choice with feedback in each round. This aligns with the principle in Spike et al. ( ) that a salient reward is essential for behavior emergence and in line with more recent behavioral data from experiments with the asymmetric VOD (Przepiorka et al. ). .
In the symmetric condition, VolunteerX has the overall best fit, closely followed by SequenceX. VolunteerX also has the best fitting model when both the symmetric and asymmetric condition are considered. Our interpretation of this result is that VolunteerX combines the advantages of two mutually impeding model concepts: structure through forward-looking action sequences and timely assessment of success. As for the first advantage, VolunteerX (and SequenceX) retains contextual structure by defining forward-looking sequences of consecutive actions. This di ers from the ClassicQ model class, which lacks the possibility to actively test combinations of actions. In contrast to SequenceX, which considers all possible action sequences of length , VolunteerX defines when to volunteer (i.e., in the st, nd, or rd round, or not at all). This reduces the number of state-action pairs that the model needs to consider from eight in SequenceX to only four in VolunteerX. Consequently, Vol-unteerX learns to coordinate relatively quickly in the di icult learning task of the symmetric condition, where turn-taking between three agents is dominant in the experimental data but ambiguous regarding who should start the sequence. .
The di erences in model fit between the various model classes suggest that there are di erent strategies in play when humans form conventions in the VOD. In simple scenarios with salient optima immediate evaluation of single actions lead to quick coordination. In more complex and ambiguous scenarios inferring structure from the problem helps to coordinate joint actions. Thus, contextual cues may help humans to select potentially fruitful and successfully apply coordination strategies. .
Concerning the parameters (RQ ), there is again some variation between model classes. However, a commonality for all model classes and all conditions is that better fits are obtained when agents act myopically, favoring current over expected future rewards (i.e., have low discount rate γ). Furthermore, most models fare better when agents rely on experience (low α). Finally, most models produce better fits with altruistic reward expectations. That is, agents coordinate quicker, and patterns are more stable, when agents maximize rewards for the entire group rather than considering only individual benefits. In the asymmetric VOD, altruistic reward expectations create a salient optimum for solitary volunteering of the agent with the lowest costs to volunteer. In the symmetric VOD, altruistic reward expectations create the same expected reward for all agents, thus facilitating patterns that allow equal distribution of costs (i.e., turn-taking). The results, therefore, suggest that human participants in the repeated VOD favor immediate rewards, rely on experience, and consider rewards for the entire group.
. These predictions are qualitatively consistent with other cognitive models of convention emergence. Good fits, for example, were produced in various two-person coordination games when agents acted myopically, giving less weight to future prospects for success (e.g., Zschache , ). Furthermore, conventions emerged with comparably low cognitive skill requirements. That is, agents do not actively consider the decisions of others, but observe and aggregate personal rewards over time (e.g., Zschache , ). In contrast to earlier work, our good fits also benefit from inherent parameters of the general reinforcement learning mechanism, rather than mechanisms behind memory and forgetting to model convention emergence (Collins et al. ; Gonzalez et al. ; Juvina et al. ; Stevens et al. ). The underlying memory mechanism of some models (e.g., Anderson et al. ), also predicts that memories (and associated behaviors) are learned faster when actions are consistent and observed frequently (Anderson & Schooler ). This aligns with our observation of low γ (focus on present) and low α (consistent learning). .
We also showed that in line with Zschache ( ) complete neglect of other agents can explain experimental data (selfish ClassicQ in the symmetric VOD). However, considering rewards of others (altruism) produces better fits with the experimental data in general. This in turn corresponds to the Roth-Erev model (Roth & Erev ), which suggests that knowing each other's payo s and salient optima lead faster to perfect equilibria. Our results additionally suggest that considering contextual clues (e.g., reward structure) for strategy design (e.g., sequences of actions) and strategy selection (e.g., favoring equal distribution of costs) may help to coordinate more quickly when optima are less salient (e.g., symmetric VOD). .
Future work could test if our qualitative model predictions hold. Specifically, our results suggest that stable social conventions are more likely to emerge when the rewards for adhering to the convention are not delayed, when rewards are provided reliably so that decisions can rely on experience, and when contextual clues, such as reward structures of the entire group, can be integrated into the action selection process. Put di erently, conventions will take longer to emerge, when rewards are delayed and when the rewards for joint actions do not create salient optima. Applying and comparing these insights to additional settings would furthermore allow to generalize our findings; a necessary step towards the definition of a formal theory of learning in the emergence of conventions.

Model Documentation
The R code of the simulation to generate and analyze the data is available under the GPLv license in the GitHub repository, https://github.com/hnunner/relavod, version: v . . ., commit: d e, DOI: . /zenodo.

Notes
Note that despite the use of noise in the calculation of which action to select, the actual Q-value is updated based on experience, as expressed in Equation .
Note that the total number of rounds is about three times the number of rounds of the experimental study by Diekmann & Przepiorka ( ). In a pilot study we compared pattern emergence and stability with an increasing number of rounds ( , , , , , ). It showed that agents require about rounds to coordinate in the symmetric condition (fewer rounds in the asymmetric condition). Simulations with more than rounds, however, showed that the emerging coordination patterns are not necessarily stable, while simulations with more than rounds hardly ever showed pattern changes a er the rounds. Simulations with rounds therefore combine two things: agents are able to learn to coordinate and emerging patterns are stable.
See Table in