Calibrating Agent-Based Models using Uncertainty Quantification Methods

: Agent-based models (ABMs) can be found across a number of diverse application areas ranging from simulating consumer behaviour to infectious disease modelling. Part of their popularity is due to their ability to simulateindividualbehavioursanddecisionsoverspaceandtime. However, whilstthereareplentifulexamples within the academic literature, these models are only beginning to make an impact within policy areas. Whilst frameworks such as NetLogo make the creation of ABMs relatively easy, a number of key methodological issues, including the quantification of uncertainty, remain. In this paper we draw on state-of-the-art approaches from the fields of uncertainty quantification and model optimisation to describe a novel framework for the calibration of ABMs using History Matching and Approximate Bayesian Computation. The utility of the framework is demonstrated on three example models of increasing complexity: (i) Sugarscape to illustrate the approach on a toy example; (ii) a model of the movement of birds to explore the efficacy of our framework and compare it to alternative calibration approaches and; (iii) the RISC model of farmer decision making to demonstrate its value in a real application. The results highlight the efficiency and accuracy with which this approach can be used to calibrate ABMs. This method can readily be applied to local or national-scale ABMs, such as those linked to the creation or tailoring of key policy decisions.


Introduction
. Agent-based modelling has grown in popularity over the past twenty years (Heppenstall et al. ). Its ability to simulate the unique characteristics and behaviours of individuals makes it a natural metaphor for modelling and understanding the impacts of individual decisions within social and spatial systems. Applications range from creating models of daily mobility (Crols & Malleson ), consumer behaviour (Sturley et al. ) and infectious disease modelling (Li et al. ).
. However, there remain important methodological challenges to be resolved if the full potential of agent based modelling is to be realised. The maturation of the agent based modelling approach is reflected in several recent position pieces (Polhill et al. ; Manson et al. ). These contributions, whilst wide ranging in their perspectives, have a number of common themes including issues such as common practice for creation of rule sets, embedding behaviour, and establishing robust calibration and validation routines. .
Within this paper, we focus on developing a more robust approach to the calibration of model parameters within ABMs. Calibration involves running the model with di erent parameters and testing, for each case, how well the model performs by comparing the output against empirical data. The goal is to find parameter sets that minimise the model's error and can be used to provide a range of predictions or analyses (Huth & Wissel ; Ge et al. ; Grimm et al. ; Purshouse et al. ). .
There are three specific problems related to the calibration of ABMs that need to be addressed. First, computational cost: ABMs are o en computationally expensive and thus calibration algorithms that require large numbers of model-runs can be infeasible. Finding optimal parameters while minimising the number of model-runs is essential. Second, there are o en uncertainties in the real-world process that an ABM is designed to simulate. This can be because any observations taken are inaccurate or because the system is stochastic and, therefore, multiple observations lead to di erent results. Third, model discrepancy: models will only ever be abstractions of real-world systems, there will always be a degree of error between an optimised model and an observation (Strong et al. ). .
Addressing these issues requires a di erent perspective to be taken by agent-based modellers. We present a novel framework that adapts established methods from the field of uncertainty quantification (UQ). We use history matching (HM) (Craig et al. ) to quickly rule out implausible models and reduce the size of the parameter space that needs to be searched prior to calibration. This also reduces the computational cost of calibration. To address uncertainties, we identify and quantify the various sources of uncertainty explicitly. The quantified uncertainties are used to measure the implausibility of parameters during HM, and to inform a threshold of acceptable model error during calibration. Finally, to gain a better understanding of model discrepancy, Approximate Bayesian Computation (ABC) is used to provide credible intervals over which the given parameters could have created the observed data (Csilléry et al. ; Marin et al. ; Turner & Van Zandt ; Sunnåker et al. ) .
. This framework is successfully applied to three models of increasing complexity. First, we use the well documented model 'Sugarscape' (Epstein & Axtell ) as a toy example to show step-by-step how to apply the framework, highlighting its simplicity and e ectiveness. Second, we use a model that simulates the population and social dynamics of territorial birds, which has previously been used in a detailed study of parameter estimation using a variety of methods, including ABC (Thiele et al. ). With this model, we highlight the advantages of using our framework over other methods of calibration in the literature. In this case the number of model-runs required for calibration is reduced by approximately half by using HM before ABC, compared to using ABC alone. Finally, we apply our framework to a more complex ABM that simulates the changes in the sizes of cattle farms in Scotland over a period of years (Ge et al. ).
. The contribution of this paper is a flexible, more e icient and robust approach to ABM calibration through a novel framework based on uncertainty quantification. The code and results are available online at https: //github.com/Urban-Analytics/uncertainty.

Background
Uncertainty and agent-based models . Understanding and quantifying sources of uncertainty in the modelling process is essential for successful calibration. Indeed, quantifying uncertainty (Heppenstall et al. ) as well as sensitivity analysis, calibration and validation more generally (Windrum et al. ; Crooks et al. ; Filatova et al. ) are seen as ongoing challenges in agent-based modelling. .
Fortunately there is a wealth of prior research to draw on; the field of Uncertainty Quantification o ers a means of quantifying and characterising uncertainties in models and the real world (Smith ). Typically, there are two forms of uncertainty quantification: forward uncertainty propagation investigates the impacts of random inputs on the outputs of a model (i.e., sensitivity analysis), whereas inverse uncertainty quantification is the process of using experimental data (outputs) to learn about the sources of modelling uncertainty (Arendt et al. ) (i.e., parameter estimation or calibration). Here we are concerned with the latter. In the context of ABMs, there are several acknowledged sources from which uncertainty can originate and then propagate through the model. These sources include: parameter uncertainty, model discrepancy/uncertainty, ensemble variance, and observation uncertainty (Kennedy & O'Hagan ). .
Parameter uncertainty can stem from the challenge of choosing which parameters to use (the model may have too few or too many) as well as the values of the parameters themselves. The values may be incorrect because the measuring device is inaccurate, or the parameters may be inherently unknown and cannot be measured directly in physical experiments (Arendt et al. ). .
A further complication with parameter uncertainty is that of identifiability or equifinality, where the same model outcomes can arise from di erent sets of parameter values. This problem is prevalent in agent-based modelling due to the large numbers of parameters that o en characterise models. This means standard sensitivity analysis-a commonly used means of assessing the impact of parameters on model results-may have limited utility when performed on an ABM because the model may be insensitive to di erent parameter values (ten Broeke et al. ). This also makes model calibration problematic because it might be di icult to rule out implausible parameter combinations at best, or at worst there may be many 'optimal' parameter combinations with very di erent characteristics. .
Model discrepancy, also referred to as model uncertainty, is the "di erence between the model and the true datagenerating mechanism" (Lei et al. ). The model design will always be uncertain as it is impossible to have a perfect model of a real-world process. Instead, a simplification must be made that will rely on assumptions and imperfections (e.g., missing information). If model discrepancy is not accounted for in calibration then the estimated parameter, rather than representing physically meaningful quantities, will have values that are "intimately tied to the model used to estimate them" (Lei et al. ). A further di iculty is that it can be hard to separate parameter uncertainty from model discrepancy, which exacerbates the identifiability problem (Arendt et al. ). .
The third form of uncertainty, ensemble variance, refers to the uncertainty that arises naturally with stochastic models. If the model is stochastic, then each time it is run the results will di er. Typically stochastic uncertainty is accounted for by running a model a large numbers of times with the same parameter combinations (an ensemble of model instances) and the variance in the ensemble output provides a quantitative measure of stochastic uncertainty.
. Finally observation uncertainty arises due to imperfections in the process of measuring the target system. This is typically the case when either the equipment used to collect observations provides imprecise or noisy data (Fearnhead & Künsch ), or in cases when multiple observations di er due to the natural variability of the real world.

Calibration of agent-based models .
The quantitative calibration of ABMs can be categorised into two groups: point estimation, and categorical or distributional estimation (Hassan et al. ). The former tries to find a single parameter combination that will produce the best fit-to-data, while the latter assigns probabilities to multiple parameter combinations over a range of plausible values. A variety of point estimation methods have been used, including minimum distance (e.g., least squared errors), maximum likelihood estimation (Zhang et al. ), simulated annealing (Neri ), and evolutionary algorithms (Heppenstall et al. ; Moya et al. ). Other methods include ordinary and di erential equations, and linear regression (Pietzsch et al. ).

.
Examples of categorical or distributional calibration include Pattern Oriented Modelling (POM), HM, Bayesian networks (Abdulkareem et al. ) and ABC (van der Vaart et al. ). While point estimation methods aim to find a single parameter point with the best fit-to-data, a selection of the best fitting parameters exposes how di erent mechanisms in the model explain the data (Purshouse et al. ). Furthermore, categorical and distributional estimation methods provide additional information on the uncertainty of the parameters and the model outputs.

.
Pattern Oriented Modelling (POM), also called inverse modelling, is an approach to develop models that will reproduce multiple patterns from various hierarchical levels and di erent spatial or temporal scales (Grimm et al. ). It is both a calibration approach and a design principle for individual or agent-based modelling. Model elements not needed to reproduce the desired patterns are removed or modified during the model development process. An advantage of this is that validation does not solely happen a er creation of the model and on one set of final results alone. Instead, it is validated while being built on multiple patterns, which makes the model more robust and credible (Waldrop ).
Bayesian approaches have been used to calibrate ABMs (Grazzini et al. ). The Bayesian approach to calibration is not new (Kennedy & O'Hagan ), but the di iculty in calculating a likelihood function for complex models hinders the use of this approach for models that aren't entirely based on mathematically tractable probability distributions. .
More recently approximate Bayesian computation (ABC) has been proposed which, unlike traditional Bayesian statistics, does not require the calculation of a likelihood function (Turner & Van Zandt ). This is useful for ABMs because deriving a likelihood function for this approach is usually infeasible. The goal of ABC di ers fundamentally from the simulated minimum di erence. ABC does not attempt to find only the single maximum likelihood parameter, but instead estimates a full posterior distribution that quantifies the probability of parameter values across the entire sample space producing the observed data. .
ABC involves sampling a set of parameters from a prior distribution and testing if the model error using those parameters is less than a chosen threshold . If the error is smaller than then the parameter set is accepted, otherwise it is rejected. Testing a su iciently large and diverse range of parameters tested facilitates the approximation of the posterior distribution, i.e., the probability of a parameter set given the data. However, sampling the parameter space and running the model for each set of parameters can be time-consuming, particularly if there are many dependent parameters or the model takes a long time to run. Therefore, e icient sampling methods are necessary. Many di erent ABC algorithms, such as rejection sampling and sequential Monte Carlo, have been applied in the literature, a selected summary of which can be found by Turner & Van Zandt ( ) and Thiele et al. ( ).

History Matching
.
History matching (HM) is a procedure used to reduce the size of the candidate parameter space (Craig et al. ) to those that are plausible. HM has been applied to ABMs in recent literature (Li et al. ; Andrianakis et al. ; Stavrakas et al. ) but, despite its power, use of the method with ABMs remains limited.
. HM involves sampling from the parameter space and measuring the implausibility of each sample. The implausibility is defined by the error of the simulation run (with the chosen parameters) and the uncertainties around the model and observation. Each sample is labelled as non-implausible (the model could produce the expected output) or implausible (the model is unlikely to produce the expected output with the chosen parameters). HM is carried out in waves. In each wave, the parameter space is sampled and split into implausible and nonimplausible regions. Each subsequent wave samples from the non-implausible region found in the previous wave. Throughout the procedure, the non-implausible region should decrease as the waves narrow towards the best set of parameters.
. Note that HM does not make any probabilistic statements (e.g., a likelihood or posterior distribution) about the non-implausible space (Andrianakis et al. ). By contrast, Bayesian calibration methods create a posterior probability distribution on the input space and do not discard implausible areas.

Methods
. Here, we describe the process of HM, ABC and our framework that combines the two in more detail. Figure highlights the process used. Note that although the diagram describes the rejection sampling method of ABC, any alternative method may be used instead. Whilst HM and ABC are useful when used alone, they become more powerful when used together. Using HM before ABC is advantageous because HM takes uncertainties of the model and observation into account whilst searching the parameter space. This enables the researcher to decide if a parameter may be plausible based on a single run of the model instead of requiring an ensemble of runs for each parameter tested. This allows exploration of the parameter space with fewer runs of the model than with an ABC method alone. .
We treat the non-implausible space found through HM as a probabilistic uniform distribution of parameters that fit the model. We propose using this distribution as an informed prior for the ABC procedure, which will then provide a more detailed posterior distribution. Figure : An overview of the proposed framework combining HM with ABC. Note that the figure demonstrates the process of the ABC rejection sampling process, but any other ABC process may be used.
Consider a total of R observations and R simulation outputs that are intended to match the observations. Let z r be the r th observation, and f r (x) be the r th output from the simulator f with parameters x. For HM, we calculate the implausibility that x is an acceptable input (i.e., the possibility that the parameter will lead to the expected output). To achieve this, we can compare the model output against the expected output as (Craig et al. ): is the squared error of the model output compared against the expected output. If not explicitly given, we assume this to be z r − f r (x) 2 . The variable V r o is the variance associated with the observation around z r (observation uncertainty), V r s is the ensemble variance, and V r m is the model discrepancy. Note it is not always necessary to include all of these terms of uncertainty (Papadelis & Flamos ; Vernon et al. ). In the case of multiple outputs (i.e., R > 1), a separate measure of implausibility is measured per output and the maximum implausibility is used (Andrianakis et al. ). .
Ensemble variability is sometimes assumed to be independent of the inputs (Papadelis & Flamos ), and so only one possible input (or input vector) is tested. However, ensemble variability may actually di er depending on the input tested and, therefore, it is important to test a small selection of possible inputs in the non-implausible space (Andrianakis et al. ). .
We use a constant threshold, denoted c, to determine if x is implausible or otherwise according to the implausbility score I r (x) given by Equation .
If I r (x) ≥ c then the error between the simulation output and observation is considered too great, even when considering all of the associated uncertainties. If I r (x) < c, then x is retained as part of the non-implausible space. The value of c is usually chosen using Pukelsheim's 3σ rule (Pukelsheim ). This implies that the correct set of parameters x will result in I r (x) < 3 with a probability of at least . . This process is repeated for multiple points x within the sample space, discarding those values of x that are deemed implausible and retaining those that are not implausible. The retained values are then explored in the next wave. For each subsequent wave we must: • Re-sample within the space that was found to be non-implausible.
• Re-calculate the model discrepancy and ensemble variance. This is necessary because our search space has narrowed towards models that fit better and so it is expected that the uncertainties about this new space will also reduce.
• Perform a new wave of HM within the narrowed space using its newly quantified uncertainties. .
The HM process continues until one or more stopping criteria are met (Andrianakis et al. ). For example, it may be stopped when all of the parameters are found to be non-implausible. Other common criteria are when the uncertainty in the emulator is smaller than the other uncertainties, or if the output from the simulator is considered to be close enough to the observation. In our examples, we stop the process when either all of the parameters are found to be non-implausible, or the area of the non-implausible space does not decrease (even if some parameters within the area are found to be implausible).

Approximate Bayesian Computation (ABC)
. We now describe the process of the rejection sampling ABC algorithm, which we use in our examples. Note, however, that alternative ABC methods that search the sample space more e iciently may be used, such as Markov chain Monte Carlo or sequential Monte Carlo (Turner & Van Zandt ; Thiele et al. ). To conduct ABC, samples are selected from a prior distribution. As HM does not tell us if any parameter set is more or less probable than another, our prior is represented as a uniform distribution. We use n particles that search the non-implausible parameter space. Large values of n (above 10, 000) will provide accurate results, whereas a smaller n may be used at the expense of decreasing the power of the result. For each particle, we sample from our prior and run the model. If the error of the model output is less than we keep the sample for that particle. Otherwise, if the error is too large then we re-sample from the prior until we find a successful parameter set for that particle. .
Choosing an optimum value for can be challenging. Setting close to zero would ensure a near exact posterior, but will result in nearly all samples being rejected, making the procedure computationally expensive (Sunnåker et al. ). Increasing allows more parameters to be accepted but may introduce bias (Beaumont et al. ). One common method to deal with this is to choose a large value in the initial run of the algorithm and gradually decrease over subsequent runs. In this case, each value of may be set ahead of time (Toni et al. ) or may be adapted a er each iteration (Del Moral et al. ; Daly et al. ; Lenormand et al. ). In our case, we propose using the uncertainties quantified in the final wave of HM to inform the initial choice of and then adapting the value in any further iterations. .
Once each particle has a successful parameter set, the posterior distribution can be estimated. Kernel density estimation is a common approach to approximate the posterior (Beaumont et al. ). The posterior may then be further refined by re-running the ABC process using the posterior from the first run as the prior for the second run.
A framework for robust validation: SugarScape example .
In this paper, we propose using HM together with ABC to calibrate the structure and parameters of an ABM. Specifically, the proposed process consists of the following four steps: • Define the parameter space to be explored.
• Quantify all uncertainties in the model and observation.
• Run HM on the parameter space.
• Run ABC, using the HM results as a prior.
. From these steps, we gain a posterior distribution of the parameters, which can be sampled to obtain a distribution of plausible outcomes from the model. The uncertainties quantified in the second step can then be used to understand the reliability of the model's outputs. .
In this section, we describe the process of these steps for the general case. We then demonstrate each step using the toy model SugarScape (Epstein & Axtell ).
Define the parameter space to be explored .
We must first decide the ranges that each parameter could take (i.e., the parameter space to be explored). This may be decided qualitatively, through expert judgement (Cooke & Goossens ; Zoellner et al. ). In some cases, the potential values of a parameter are directly measurable (e.g., the walking speed of pedestrians) and, therefore, quantitative measurements can be used to choose an appropriate range of values for the parameter.
where N is the total number of samples used, d is the measure of error between the r th expected output (z r ) and the r th model output (f r (x n )) for the parameters x n , and E r (X) is the average model error for each parameter set in x. .
We measure the error of multiple samples (instead of using only a single sample) because di erent samples within the space will likely result in di erent quantified errors, and the variance of di erent samples should provide a better overview of the whole space than a single sample.
. Ensemble variance. We select a subset of N samples that will be tested as part of HM. For each sample, run the model K times; larger values of K will result in a more accurate result at the expense of increasing computational time. The variance is calculated between the K runs, and the average variance across the N samples is calculated. Specifically, we measure: where K is the total number of runs in an ensemble, f r k (x n ) is the r th output from the k th run of the model with parameters x n , and E r (x n ) is average model error across the ensembles as: Observation uncertainty. How observation uncertainty can be measured depends on how the observations are obtained. If the real world process is directly measurable then multiple direct observations can be made and their variance can be used to quantify the uncertainty.
. It may also be the case that only indirect measurements can be obtained, and so there will be uncertainty in transforming them so that they can be compared against the model output (Vernon et al. ). If an observation cannot be measured, expert judgments may be useful in determining the expected model output and the uncertainty around the expected output. For example, experts may provide quantiles of the expected output, from which uncertainty can be understood (O'Hagan et al. ).

Run HM on the parameter space .
Given the defined parameter space (in step ) and the uncertainties quantified (in step ), we next apply HM. In large or continuous parameter spaces, we use Latin-Hypercube Sampling (LHS) to select samples.
Run ABC, using the HM results as a uniform prior .
The final non-implausible space found through HM is used as a uniform prior for ABC. Any appropriate ABC method may be used, such as rejection sampling or sequential Monte Carlo. We propose using the uncertainties measured in the final wave of HM to inform the initial value of . Based on Equation , let: .
The result of performing ABC is a posterior distribution over the non-implausible parameter space identified by HM. Note that we do not perform ABC on the full initial parameter space chosen in step .

A
Step-by-Step Example: SugarScape . In this section, we provide a step-by-step example of the proposed framework using the Sugarscape model (Epstein & Axtell ). Sugarscape is an environment that contains a resource (sugar) and agents that collect and metabolise this resource. The environment is a grid in which sugar is distributed such that some regions are rich with sugar, whilst others are scarce or bare. Figure shows the environment where darker regions indicate higher amounts of sugar. The amount of sugar in a given location is an integer in the range [0,4]. At the start of the simulation, agents are placed in a random location of the environment. We use the Sugarscape example provided by the Python-Mesa toolkit by Kazil et al. ( ). This implements a simple version of the model, in which each agents' only goal is to search for sugar to stay alive. We made minor changes to the code (detailed in our source code) enabling us to change the maximum vision and metabolism of the agents). Figure : The Sugarscape environment. The amount of sugar at each grid square is an integer ranging from to , where is indicated by the lightest green and by the darkest.

.
In each time step, the agents move, collect sugar, and consume sugar. More specifically, the agent movement rule is (Epstein & Axtell ): • Observe all neighbours within vision range in the von Neumann neighbourhood.
• Identify the unoccupied locations that have the most sugar.
• If multiple locations have the same amount of sugar choose the closest.
• If multiple locations have the same amount of sugar and are equally close, randomly choose one.
• Move to the chosen location.
• Collect all of the sugar at the new location.
• Eat the amount of sugar required to survive (according to the agent's metabolism).
• If the agent does not have enough sugar, they die.

.
In each time step, the sugar grows back according to the sugar grow-back rule: • Increase sugar amount by one if not at the maximum capacity of the given space. .
In this example, there are two parameters that we wish to explore. These are: • The maximum possible metabolism of an agent

• The maximum possible vision of an agent
where metabolism defines how much an agent needs to eat in each step of the model, and vision defines how far an agent can see and, consequently, how far they can travel in each step. Both of these parameters take integer values, and the minimum metabolism or vision an agent may have is . The metabolism and vision given to an agent is a random integer between and the calibrated maximum.
. The simulation begins with agents. However, some do not survive as there is not su icient food. As such, our measured outcome is the size of the population that the model can sustain. We create observational data using the output of an identical twin model (a model run to gain an artificial real-world observation). With the maximum metabolism as and the maximum vision as , the model was able to sustain a population of agents.
. Stochasticity in the model arises from several sources. The initial locations of the agents are chosen randomly in each run. When moving, if multiple locations are equally fit for an agent to move to, the location the agent chooses out of these will be random. The metabolism and vision of each agent is an integer randomly chosen within the defined range. These random choices leads to a stochastic model that will produce di erent results with each run.
Define the parameter space to be explored . We choose the plausible values for maximum metabolism to be in the range [1,4]. We choose as the maximum as there is no location in Sugarscape that has more than sugar. We choose the values for maximum vision to be in the range [1,16]. We choose as the maximum as this is the furthest distance from an empty location (i.e., with no food) to a non-empty location.

Quantify all uncertainties in the model and observation
Model discrepancy .
We are interested in finding the parameters that lead to a model that sustains a population of agents. To test this, we ran the model with three di erent parameter sets over model steps. We repeated this for a total of times to take into account ensemble uncertainty. The parameters are where {metabolism, vision} is { , }, { , } and { , }. For each case, we find that in all runs the agent population is stable by step (see Figure  ). To quantify the error of the model, we measure the absolute di erence between the population at step and the expected population observed (i.e., ). Therefore, to measure model discrepancy, we use Equation where z = 66 and d z, f (x n ) = |z − f (x n )| (note that we have only one measured output and observation in Sugarscape so we omit r from the formulae). Figure : The total agent population in Sugarscape over steps for runs of di erent parameter sets.
We tested changes in ensemble variance across increasing ensemble sizes to find an optimal size. We tested ensemble variance for the same three parameter sets used to determine when the model has stabilised. These are where {metabolism, vision} is { , }, { , } and { , }. Figure shows how the ensemble variance changes with ensemble size. We find the variance stabilised a er approximately runs, and so use this as our ensemble size. The measured variance at these points where {metabolism, vision} is { , } is . , for { , } is . , and for { , } is . . Run HM on the parameter space .
As we have a reasonably small sample space, we measure the implausibility of each parameter pair. Note that typically the sample space is too large and, as described above, LHS sampling is used instead. We performed waves of HM. Wave did not reduce the non-implausible parameter space further so the procedure was stopped. Figure shows the results of the first and final waves, where dark grey represents implausible regions and light orange represents non-implausible regions. The figure shows the whole space explored; that is, prior to wave the full set of a parameters is assumed to be non-implausible (and would be pictured entirely orange), and each set of parameters in the space was tested for implausibility.

.
In each wave, we retested each of the parameters that were found to be non-implausible in the previous wave. For example, Figure shows the results of the sample space at the end of wave . The non-implausible (orange) space here was used as the input for wave . In this example, the same parameters were retested because the parameter space is discrete and so there were no other values to test. However, in a continuous space (demonstrated later), new values within the plausible region are typically tested instead of retesting the same values (Andrianakis et al. ). . This example illustrates how HM can successfully reduce the space of possible parameter values and ABC can quantify the probability that these non-implausible parameter values could have produced the observed data.

Experiments and Results
. In this section, we apply our HM and ABC framework to two ABMs of real-world processes. The first model simulates the movement of territorial birds (Thiele et al. ). This model was chosen because many traditional calibration methods have been demonstrated using this model, enabling us to compare our approach with a range of commonly used alternative approaches. The second model is more complex, simulating changes in Scottish cattle farms from external policies over time (Ge et al. ). This model was chosen because (i) it provides a real-world test for the proposed framework and (ii) the calibration results themselves can provide interesting insight into the behaviour of the real system.

Case Study : Comparing against alternative calibration methods .
We now compare our proposed approach to alternative calibration methods, including point-estimation methods (such as simulated annealing) and distribution-estimation methods (such as ABC alone, without HM).

Overview of the model .
We use an example model by Railsback & Grimm ( ). The purpose of this model to explore the social groups that occur as a result of birds scouting for new territories. The entities of the model are birds and territories that they may occupy. There are locations arranged in a one-dimensional row that wraps as a ring. Each step of the model represents one month in time, and the simulation runs for years, the first two of which are ignored for the results. In each step, 'non-alpha' birds will decide whether to scout for an alpha-bird free location where they can become an alpha-bird. A full ODD of the model used, as well as a link to the source code, is provided by Thiele et al. ( ).
. Two model parameters are considered for calibration. These are the probability that a non-alpha bird will explore new locations, and the probability that it will survive the exploration. These are described as the scouting probability and survival probability, respectively. Calibration of the model involves finding values for these two parameters that enable the model to fit three criteria. Measured over a period of years, the criteria are: ) the average total number of birds, ) the standard deviation of total birds, and ) the average number of locations that lack an alpha bird. The model aims to match the three criteria above, based on observational data. These criteria are combined to obtain a single measurement of error (therefore only one model output) as provided by Railsback & Grimm ( ). Specifically, we measure: where f (x) 1 is the first model output, and z 1 is the corresponding empirical data that is given as an interval range of acceptable values that f (x) 1 can fall in. The error in Equation is used with HM in Equation .
An interval is used instead of a single value to represent the uncertainty of the observation. The error e is: wherez r is the mean value of the range z r . We use Equation as the measurement of error for both HM and ABC. . Model discrepancy. To measure model discrepancy, we select a random sample of parameter sets using an LHS design. We then measure the model discrepancy using Equation with the model error given in Equation .
. Ensemble variance. We measure ensemble variance as given in Equations and . To choose the size of an ensemble we selected samples using an LHS design and ran the model for each sample across a variety of ensemble sizes, measuring the variance for each case. Figure shows the resulting ensemble variance as the total runs of the model within an ensemble is increased. Figure a shows that one sample stands out having a much higher ensemble variance compared to the remaining samples, which are indistinguishable from each other in the figure. Figure b shows the ensemble variance for these remaining samples. The variance appears to stabilise a er about runs, so we choose this as our ensemble size.

.
We use LHS to generate samples within the initial plausible space (this is the same number of LHS samples as used by Thiele et al. ). .
In the first wave, the HM procedure judged that of these samples are non-implausible. We then re-sampled (using a new set of samples) within the new non-implausible region. A er the third wave of this procedure, HM was unable to decrease the area of non-implausible space any further. . These results are similar to those found with ABC methods by (Thiele et al. ). Figure : Results of three waves of HM on the birds model. Each point represents an explored parameter that was found to be implausible (grey) or non-implausible (orange).
. Next, we use the ABC rejection sampling method with particles to obtain a probability distribution within the sample space. Our prior is a uniform distribution of the non-implausible space found by HM. Figure a shows the resulting posterior distribution, where lighter shades indicate a higher probability. The results show the model is able to match the criteria best if the scouting probability is in the approximate range [0.2, 0.5] and if survival probability is approximately 0.98. By contrast, we also performed ABC with the same number of particles but without the HM-informed prior. Figure b shows that the posterior is much broader when information from HM is not used. The results of ABC alone in Figure b is not much more informative than the results from HM alone in Figure . These results are similar to those found using ABC alone (Thiele et al. ). .
We achieved a posterior with runs of the model using HM and ABC. By contrast, Thiele et al. ( ) obtained similar results with ABC alone using over , runs. They also demonstrate that simulated annealing and evolutionary algorithms can be used to explore the parameter space. While these methods require fewer runs of the model ( and , respectively) than HM alone ( ), they are intended to only provide the best fitting parameters that were tested, whilst HM has the advantage of discovering a region of parameters that fits well and ABC provides a posterior distribution within this region. Note that fewer runs may be used to estimate the ensemble variance, therefore making HM computationally faster, but this may decrease its accuracy and lead to non-implausible samples being quantified as implausible. To test the accuracy of our proposed approach of using HM followed by ABC, we compare results from this approach against using ABC alone across a range of parameter values. To do this, we generated pseudorandom pairs of the survival probability and scouting probability parameters. The values were chosen using an LHS design to ensure the samples e ectively cover the sample space. For each parameter pair, we want to use the model to produce synthetic data that we then use to calibrate the model. In the previous sections, we calibrated the model using the acceptable ranges of model outputs described in paragraph . , which derived from observational studies. We have seen in Figure that a relatively small range of scouting and survival probabilities are likely to fall within these criteria. Since we are now using the model to produce 'ground truth' data over a range of parameter values, we need to identify corresponding ranges of outputs that are acceptable. Thus for each parameter pair, we ran the model times to produce observations, then we set the acceptable range of each of the model fitting criteria to be the corresponding minimum and maximum from the observations. Using these observational ranges, we then performed HM followed by ABC rejection sampling using the same process as described in the previous section (i.e., using Equation to measure error, a total of model runs to measure ensemble variance, a total of samples across the plausible space for each wave of HM, and samples for ABC rejection sampling). This was carried out for each of the parameter pairs.
. Across the parameter pair tests, there were nine examples where the model output total vacant locations (see criteria in paragraph . ) was when the model was run to generate observational data. This occurred when the input survival probability was close to . The measurement error used by Thiele et al. ( ) is not fit for this situation (resulting in division by zero) and so HM (or any form of calibration) was unable to be performed against these observations. In the remaining tests, HM retained the correct parameters in cases and failed in one case where the expected value for scouting probability was not found. However, in all other tests the plausible range of scouting probability contained the correct test value . On average, the non-implausible space for scouting probability was reduced to % of its original size, and the space for survival probability was reduced to % of its original size. .
We next performed ABC rejection sampling with particles on the cases where HM could be run. We wish to compare the results of using HM before ABC against using ABC alone. Therefore, we run ABC once using the non-implausible space found through HM as an informed prior, and once using an uninformed prior. We found that, on average, ABC required more model-runs when given an uninformed prior compared to when given the HM-informed prior. The HM procedure required between and runs. Taking this into account, if we subtract the total model runs used to carry out HM from the total runs saved in the ABC process by using the HM-informed prior for each of the tests, then we find we saved an average of total model runs by using HM before ABC, compared to not using HM. If more particles are used to generate a more accurate posterior this di erence is likely to increase.

.
We analysed the results by measuring the percentage of times the expected value was within the % credible interval (CI) across the tests for both scouting probability and survival probability. We also measured the average mean absolute error (MAE) and size of the % CI across all tests. Table shows the results. The results indicate that combining HM with ABC produces slightly smaller MAEs and slightly narrower CIs compared with using ABC alone. However, the true parameter is less o en contained within the % CI. Therefore, using HM with ABC provides results that are more precise (as shown by the smaller MAEs) and more e icient (requiring fewer model runs), but at the cost of CIs that may be too small.  Table : Information on the percentage of tests where the true parameter is contained within the % credible interval (CI) of the model runs, the average mean absolute error between the expected and actual results across all runs, and the average size of the % CI when using ABC alone or using HM before ABC.
. Model discrepancy. For each model and each output, we measure the model error using Equation . We then average the errors V rj m across all possible models, resulting in a single model discrepancy term per output, denoted V r m . . Ensemble variance. The models are stochastic, meaning each time we run them the results will be di erent. We want to know how much variance there is across multiple runs. We run each model a total of times and measure the variance in the results. We do this separately for each model and each output. Then, for each output, we average the variance across all models. Note that it is not necessary to calculate ensemble variance for each parameter set, but as there are only models, we have the computational resources to measure them all. Figure shows the results. We can see that the ensemble variance has stabilised for each output at an ensemble size of runs. Figure : The ensemble variance in the RISC model across increasing ensemble sizes for each output (number of small, medium and large farms). .
To calculate ensemble variance for a given model j, we measure the di erence between each ensemble run with each other ensemble run using MASE (see Equation ). Given runs, this makes for a total of comparisons. .
First, the set of comparisons is given as is the result of the r th output in the a th run of the model. Note that V rj m (see Equation ) is nonsymmetric, so we compare f r a (x) with f r b (x) in both directions in Equation .
Note that to ensure ensemble variance and model discrepancy are directly comparable, we measure both using MASE. .
Next, we calculate the variance of the set of ensemble runs given in Equation . This is: where D r ja is the a th value in the set D r j , and E(d r k ) is the average of the set. .
Finally, we calculate the average ensemble variance across the models for output r as These three steps (Equations -) are performed for each output r. Therefore, we have a separate measure of ensemble variance per output.

.
Three waves of HM were performed, a er which HM was unable to reduce the non-implausible space any further, so the procedure was stopped. In the first wave, the non-implausible space was reduced from scenarios to seven. The second wave further reduced the plausible space to four scenarios. Table shows the four models that were found to be plausible and which factors were switched on. These results match those in Ge et al. ( ), who use POM to select plausible models. The common feature of these four models is that succession and leisure are always turned on, whereas the remaining two factors (diversification and industrialisation) are mixed.
Model ID Succ. Leisure Divers. Indust. x x Table : The plausible farming models indicating their ID numbers and which factors are switched on (where x is present).
. Table shows the ensemble variances (V e ) and model discrepancies (V m ) in each wave. There is little stochasticity in the model as shown by the relatively small ensemble variances, whereas there is noticeable discrepancy between the model result and empirical data. Therefore ensemble variance has little e ect on the implausibility score of each model, whilst model discrepancy has a strong e ect.

.
HM has helped narrow down our list of plausible models from to four. This reduction of the parameter space was achieved quickly (with model-runs) compared to the number of runs that would be required with alternative calibration methods. The result also matches that found using POM (Ge et al. ). However, HM does not provide the probability that these four models can accurately match the empirical data. Instead, each model is considered equally plausible. To gain insight into the probabilities, we use the rejection sampling method of ABC. .
We initially set to be the sum of the uncertainties found in the final wave of HM, as given in Table . We ran each of the four models times, resulting in runs of model accepted, and no runs of the remaining three models. Increasing this threshold by 0.95, however, resulted in most runs being accepted for models and , and less than a quarter of the runs accepted for models and (see Figure ). If the threshold is any lower, than no runs for or are accepted, but over are accepted for and . These results suggest that models and have the best probability of successfully simulating changes in the size of Scottish cattle farms over time. Both of these models include industrialisation (as well as succession and leisure) as part of the farm holder's decision making. Models and , which are less likely to match the empirical data, do not include industrialisation. Figure : Percentage of runs that produced an error smaller than the threshold for each output across models -, where the threshold was set as the sum of the uncertainties + 1.5 .
Through ABC rejection sampling, we have found the best models that fit the empirical data and have learnt the relative importance of each of the four factors considered. This was achieved through measuring the uncertainties associated with the model and using those uncertainties to help choose appropriate thresholds for the rejection sampling procedure.

Discussion
. We used two models to demonstrate the utility of the proposed framework. First, we use the territorial birds model to compare our proposed approach against existing approaches. Thiele et al. ( ) demonstrate a range of calibration methods on the model, including random sampling, simulated annealing and ABC. We showed that by performing HM before ABC, we can obtain equivalent results to using ABC alone using considerably fewer runs of the model, thus saving on computational time. We can also obtain an accurate prior with HM alone, which is useful for cases where performing ABC a erwards is computationally infeasible. This is achieved through identifying the uncertainties of the model.

.
We also demonstrate that HM provides more information than point-estimation calibration methods, as it provides a region of parameters that perform well whilst also accounting for uncertainties in the model and data. While this required more runs of the model than a point-estimation method, the information gained is valuable.
. Second, for the farming example (the RISC model), HM helped to rule out the implausible models. Specifically, models without leisure and succession are not chosen, which is consistent with model selection using POM (Ge et al. ). Our approach enhances our understanding of the role of di erent processes that co-exist in the Scottish dairy farms. We have found that the lack of succession (which explains the increasing number of small farms) and leisure (which explains the existence of non-profitable small farms) are the primary driving forces behind the polarisation of Scottish dairy farms. .
In addition to HM, ABC further distinguishes models with industrialisation (employing a professional manager to help expand a farm) as having a higher probability of matching the empirical data than those without. This indicates that although industrialisation may not be the primary driving force of the trend of polarisation, it is likely that it does play a role. This role may explain the increasing number of large farms. Without considering industrialisation, the model fails to capture the growth of large farms. This finding is new, and was not picked up by POM previously, or by HM alone. The reason is that both POM and HM are categorical, so they accept models both with and without industrialisation elements, as they are all plausible. ABC, however, estimates that models with industrialisation have a higher probability of matching the data than those without. This new insight is a direct consequence of combining HM with ABC to calibrate an ABM.
. If a point-estimation method of calibration was used on the RISC model, we would only discover a single best fitting model and would have not discovered that multiple models provide a good fit, which has lead to a better understanding of the factors that a ect farmers' decision making. The advantage of using ABC over POM is that we are able to find the probabilities that these factors a ect decision making. Furthermore, by using HM before ABC we were able to discover this with significantly fewer runs than would have been required with ABC alone (because the sample space was narrowed to a more accurate region by HM), thereby saving computational time.

Conclusions
. Designing and calibrating ABMs is a challenge due to uncertainties around the parameters, model structure and stochasticity of such models. We have illustrated a process of calibrating an ABM's structure and parameters that quantifies these uncertainties through the combined use of HM and ABC. The code and results used in this paper are all available online at https://github.com/Urban-Analytics/uncertainty.

.
We show that HM can be used to e iciently reduce parameter space uncertainties; moreover, by quantifying the model uncertainties it is only necessary to test each chosen parameter once. Following this, ABC provides a more detailed exploration of the remaining parameter space, quantifying uncertainties in terms of a probability distribution over non-implausible values. .
We demonstrate this process with a toy example (Sugarscape) and two models of real-world processes, which simulate the movement of territorial birds and the changing sizes of cattle farms in Scotland. In the territorial birds model, we demonstrate that our approach is more informative than point-estimation calibration methods, and more e icient than Bayesian calibration methods alone without HM. We show that the number of modelruns required for calibration is approximately halved if HM is used before ABC, compared to using ABC alone. While this is shown with a simple model and simple ABC method, we believe that using HM will be beneficial with more complex models even when more e icient methods of ABC (e.g., sequential Monte Carlo) are used. In the farming model, we show that HM was able to test competing sociological theories and removed all models with a structure that was expected to be implausible based on an alternative POM approach (Ge et al. ). We then show that ABC provides insights into the factors that are important in Scottish cattle farmers when deciding to change the size of their farm. .
As the number of parameters in a model increases, the resources required to calibrate the model grows to become prohibitive. We have suggested using HM to quickly find a narrow area of the search space, which can then be explored in more detail with a rigorous approach, such as ABC. For a particularly large parameter space or computational demanding model, one could use the non-implausible space found by HM to build a surrogate model. This simpler surrogate may be used in place of the real model to carry out ABC with more feasible resources (Lamperti et al. ). .
In future work, we will explore our proposed approach further using new results generated with the RISC model. We will create true data generated using the best fitting parameters found in this paper and investigate if our method still finds the correct parameters to be the most likely sets to fit our true data. .
If ABMs are to achieve their potential as a go to tool for policymakers and academics, robust calibration and uncertainty quantification handling methods need to be developed. Using the proposed process, calibration of ABMs can be carried out e iciently whilst taking into account all uncertainties associated with the model and the real-world process.
Term Meaning R total observations/outputs z r the r th observation f r (x) the r th model output f rj (x) the r th of the j th (RISC example) d(z r , f r (x) error measure between the r th model output and observation N total parameters sampled xn the n th parameter set v r o observation variance of the r th observation v r s ensemble variance of the r th output v r m model discrepancy of the r th output I r (c) implausibility measure of the r th output c implausibility threshold S total runs in an ensemble Table : Mathematical notations used throughout the paper and their meaning.

Notes
Note that in three of the tests, although the plausible range for survival probability contains the true value, if our precision was higher (to d.p. instead of d.p.) these cases would have been rejected