Calibrating Agent-Based Models of Innovation Diffusion with Gradients

: Consumer behavior and the decision to adopt an innovation are governed by various motives, which models find difficult to represent. A promising way to introduce the required complexity into modeling approaches is to simulate all consumers individually within an agent-based model (ABM). However, ABMs are complex and introduce new challenges. Especially the calibration of empirical ABMs was identified as a key difficulty in many works. In this work, a general ABM for simulating the Diffusion of Innovations is described. The ABM is differentiable and can employ gradient-based calibration methods, enabling the simultaneous calibration of large numbers of free parameters in large-scale models. The ABM and calibration method are tested by fitting a simulation with 25 free parameters to the large data set of privately owned photovoltaic systems in Germany, where the model achieves a coefficient of determination of R 2 ≃ 0 . 7 .


Introduction
1.1 Investigating the Diffusion of Innovations within society follows a long tradition (Ryan & Gross 1943) since the underlying question of how and when people adopt to something new is fascinating for both the social and economic sciences. Empirical data of the diffusion process usually shows an s-shaped adoption curve, and the pioneering works of Rogers (1962) and Bass (1969) were able to generate this behavior using only simple assumptions. They started a major rise in research on Innovation Diffusion, continuing until today (Mahajan et al. 1990;Meade & Islam 2006;Rao & Kishore 2010;Peres et al. 2010;Lai 2017). Applications cover various fields, such as the diffusion of mobile phones (Singh 2008), electric vehicles (Kangur et al. 2017;Gnann et al. 2018), photovoltaic (PV) systems (Zhao et al. 2011;Palmer et al. 2015;Candas et al. 2019), or energy-efficient technologies (Moglia et al. 2017(Moglia et al. , 2018Hesselink & Chappin 2019), to mention only a few. Those models differ in both their complexities and calibration effort, where a rise in model complexity is usually accompanied by a rise in the effort to calibrate the model. Our contribution is an ABM with a comparably low complexity and low calibration effort.

ABMs of innovation diffusion 2.3
The question of what is an agent-based model is not easy to answer, since the term ABM lacks a common definition. Macal (2016) offers different ideas of what can become a general definition. He proposes ABMs to be models where agents are represented individually with diverse characteristics, and he proposes that ABMs can have autonomous agents that interact with others and the environment. Empirically grounded ABMs are ABMs that are parameterized, initialized, or evaluated by empirical data sets . They have two features in common: (1) They rely on input data sets and (2) their output is compared to data from the real world.

2.4
In the field of modeling Innovation Diffusion with ABMs, general Reviews (Kiesling et al. 2012) and Reviews with a focus on empirically grounded models exist Scheller et al. 2019). Johanning et al. (2020) extracted and summarized common methods and patterns from existing empirical ABMs of Innovation Diffusion. In this work, we will focus on empirical ABMs that use an utility function approach, meaning that the models introduce a term for each agent that controls the decision process. This term is usually called utility.
Agents then adopt if their utility exceeds a predefined threshold.

2.5
Exemplary models with the utility function approach and an empirical background can be found in several works: Günther et al. (2011) define a utility for each agent in their study on biomass fuel diffusion. An agent wants to buy the biofuel if the utility exceeds an individual threshold. McCoy & Lyons (2014) simulate the diffusion of electric vehicles by assigning a utility to each agent. Here, the agent adopts with a certain probability as soon as the utility reaches a threshold. The weights within the utility are not calibrated to some data set but drawn randomly, where the range of possible values depends on the subgroup the agent belongs to. Kaufmann et al. (2009) show that the Theory of Planned Behavior (TBC, see Ajzen 1991;Muelder & Filatova 2018) also fits into the utility function approach. Their agents adopt if the intention value exceeds a derived threshold, where the intention is a weighted sum over the the agents attitude, subjective norm, and perceived behavioral control.

2.6
For the case of studying the diffusion of photovoltaic systems, several works use a utility function approach. Zhao et al. (2011) calculate a desire level for each agent, which is a weighted sum with four free parameters.
Once again, agents adopt if the desire level becomes greater than some threshold. Instead of calibrating the weights to some data set, they decided to use a survey to find the importance of each factor for the decision process. Palmer et al. (2015) use a weighted utility sum with four terms to study the diffusion of PV systems in Italy. Those terms account for the payback period, the environmental benefit of the investment, the household income, and the influence of communication with other agents. The weights of the four terms are not equal for all agents. They depend on the Sinus-Milieu the agent was assigned to. To calibrate their model and find a good threshold value, they run the ABM several times and choose the parametrization that produces the best results. Pearce & Slade (2018) base their utility on agents' income, the decision of neighboring agents, the capital cost, and the payback period of the investment. Similar to most other models, agents adopt if their utility exceeds a threshold. To find values for the five free parameters (four weights and the threshold value), they used the method of Approximate Bayesian Computation and ran their model 500,000 times on the High Throughput Computing facility of the Imperial College High Performance Computing Service. Schiera et al. (2019) use the Theory of Planned Behavior, where agents adopt if their behavior is greater than a user-defined threshold. The behavior is a weighted sum over the Behavioral Intention (bi) and the Perceived Behavioral Control (pbc), where both bi and pbc are weighted sums on its own. The values for the weights are found by a parameter swipe with a much smaller model of a "dummy city". In their conclusion, they write that the "main limitation of the presented work is most likely in the calibration of the ABM model".

2.7
In a theoretical work on the dependence of Innovation Diffusion on network topology, Delre et al. (2010) define agents that adopt if they have been informed about the Innovation and if their utility is higher than some threshold. We mention this work here even though it uses two steps within the decision process since agents first have to check their information status. But as we will discuss at the end of Section 3, an information barrier could also be included in the utility function which avoids the first step in the decision process. McCullen et al. (2013) also contribute a theoretical work and test the influence of utility weight values and network topology on the outcome of the simulation.

2.8
Regarding the TBC, Muelder & Filatova (2018) summarize different approaches of merging the TBC and ABMs. Their finding is that many works translate the TBC into a multiattribute utility, and that the functional form often is a weighted sum.

2.9
When calibrating their models to empirical data, all of the above mentioned works do not make use of model gradients. Instead they compare model outputs from different parameters to empirical data using some kind of distance function. To find good parameter values without knowing the gradients, methods from the well researched field of black-box function optimization are used (Shan & Wang 2010;Lamperti et al. 2018;Seri et al. 2021;McCulloch et al. 2022). The advantage of the black-box function methods lies at hand: Since they do not require any knowledge about the model, the same methods can be applied to a variety of different models. A good comparison on gradient-free calibration of ABMs can be found in Platt (2020) and Carrella (2021). On the other hand, a gradient-based calibration is model-specific. If the model changes, the gradients change and the calibration algorithm has to be adapted. Their advantage however is a more efficient model calibration, since they can make use of model-internal knowledge during the calibration.

2.10
Our contribution to the field of simulating the Diffusion of Innovations with ABMs is the following: We first have identified a common method already used in literature, the utility function approach. This method is formulated by us in a generalized way in Section 3. Afterwards we show that the model is differentiable. Hence, it can be efficiently calibrated using gradient-based approaches (Lee et al. 2018). We then calculate the gradients and apply a gradient-based calibration in one test case.

3.1
In the beginning of this section, we will discuss the input and output of an empirical ABM of Innovation Diffusion ( Figure 2, left and right boxes). Inputs to an ABM are independent of the modeling procedure. They consist of external data sets and the model initialization. External data is highly relevant if the model aims to reproduce empirical data (Salgado & Gilbert 2013). As an example, if one is investigating the diffusion of a new product and there are regions where agents are wealthier, more of them might buy the new product. Therefore, a data set of income or wealth is helpful for the model to produce good results. If there is no data available, one can also use submodels to produce the desired attribute. If one knows the average income in a region, a very simple submodel could assign an income to each agent according to a general income distribution which has the known mean value. The performance of an empirically grounded ABM is always limited by the quality of the explanatory data, so researchers have to identify the relevant data sets. Figure 2: Scheme of the input, model procedure and output of an ABM of Innovation Diffusion. The input consist of external data sets, which define the agents attributes X at time t = 0. During the simulation, the attributes at time t + 1 are influenced by the agent itself, by other agents and by the input data. The output is a collection of adoption curves for the identified subgroups.

3.2
The output of the ABM is a single or multiple adoption curves, which show the number of agents that have adopted the innovation over time. When agents are divided into subgroups based on common attributes (for example, agents from different towns, different ages etc), we can also see the collection of adoption curves from subgroups as the intentional output. This output is then compared to real-world adoption data.

3.3
As pointed out in Section 2, many works use a similar method to simulate the Diffusion of Innovations with ABMs which we have called utility function approach. We'll now describe this method in a general form following the ODD protocol (Grimm et al. 2006) 1 .

3.4
The purpose of the presented ABM is the simulation of Innovation Diffusion. Its goal is to produce the adoption curves for different subgroups. Therefore, individual consumers who can adopt the innovation are identified as agents. Agents are characterized by their binary adoption state, which is 1 at time t if the agent adopts at that point in time. The decision process that is responsible for the adoption is described in the next section.

3.5
The attribute vector ⃗ X n (t) ∈ R M of agent n, where M is the number of different attributes, is the only quantity that influences the agent's decision to adopt. Some of those attributes come from the input data and are already defined for all timesteps at the beginning of the simulation. For example, a data set of the income of agents over time can become one attribute. In Figure 2, the immutable parts of the agent attribute are represented by the database icon. In contrast to that, the attributes can also depend on the simulation. The number of adopters in the neighbourhood of an agent is often used in ABMs. This attribute is initialized at t = 0 and then calculated at each time step from the adoption status of the neighboring agents. In general, the attributes of an agent are influenced by the agent itself, by other agents and by the empirical data. We concatenate the attributes of all agents to get the attribute matrix X(t) ∈ R M ×N , where N is the total number of agents. The model then simulates the adoption decisions and keeps track of the binary adoption state of each agent over a given number of discrete time steps.

Decision process
3.6 Let us now take a closer look at the decision process of a single agent n, which is schematically shown in Figure  3. The decision of the agent at time t to adopt the innovation and hence to change its adoption status from 0 to 1 is only influenced by the attributes ⃗ X n (t). The attributes of agent n are the input for the utility function which describes the willingness of agent n to adopt the innovation at time t. The adoption becomes more likely if the value of the utility function is large. Here, ⃗ α ∈ R M is the vector of the K free parameters that becomes important when calibrating the model. We do not want to give any specific function for u n,t to keep the ABM as general as possible for now, but we will discuss a specification of u n,t at the end of this section. Additionally, we impose one restriction on u n,t : For the gradient-based calibration we need u n,t to be differentiable with respect to the free parameters α k . This restriction will also be discussed in more detail at the end of this section. Figure 3: Scheme of the decision process of a single agent. The only quantities influencing the decision are the attributes ⃗ X n of agent n and the free parameters ⃗ α, which are used to calculate the utility u n,t . The utility is mapped to the adoption probability p n,t using the function F. The binary output variable of agent n is 1 at only one point in time t (i.e. the point in time where the agent adopts). The exemplary blue agent has a growing utility and hence adopts at some time t, while the orange agent has a low utility and does not adopt.

3.7
The utility u n,t is then mapped to an adoption probability p n,t by using a mapping function F: With properties 3-5, F belongs to the family of cumulative distribution functions (CDF) which are typically sshaped (see Figure 3).

3.8
Having an adoption probability, we continue by defining an adoption variablẽ y n,t = 1 ifỹ n,t ′ = 0 ∀ t ′ < t and p n,t > p 0 else.
Here, p is a random number drawn uniformly from [0, 1]. By using the tilde symbol, we highlight the quantities that are produced by the simulation. Since the model is stochastic,ỹ n,t is a random variable. It can be 1 at time t with probability p n,t , but only if agent n has not adopted at earlier times, else it is 0. Note that agents can only adopt once, multiple purchases of the same technology are not possible in this model.

3.9
Often it is pointless to consider the adoption behavior of every single agent, especially if the available data is not detailed enough. Building a model that predicts the time of adoption of each agent will probably fail when investigating many agents, but building a model that aggregates the decisions of agents in subgroups can be successful (Henry & Brugger 2017). To which subgroup an agent belongs can be defined by the agent's location, age, or comparable characteristics. Being able to define subgroups of agents in the model is especially useful to model spatiotemporal data. The number of new adopters at time t in the gth subgroup: becomes the relevant model prediction. It consists of the adoption variableỹ n,t summed over all agents from the same subgroup G. In the proceeding of this work, we will use Equation 7 as the prediction for the diffusion process of the ABM and assume a division of agents into subgroups. Our results can still be used for modeling the behavior of single agents, simply by defining one subgroup per agent. On the other side, it is also possible to define only one group of agents and calibrate the model according to the behavior of all agents.
The utility function 3.10 In the social sciences, a variety of theories explain individual decisions (Schlüter et al. 2017). For modelers, it is challenging to break down those qualitative theories into a quantitative model. Since we did not predefine the utility function in Equation 1, we keep our ABM as general as possible and give modelers the possibility to implement their interpretations of qualitative theories. Their concretization of the utility function u n,t becomes the core of the adoption model and the lever controlling the simulation's complexity. However, to utilize our ABM, it is crucial that the social science theory can be broken down to a one-step calculation of an agent's utility.
In the following, we will highlight and discuss one possible realization of u n,t .

3.11
A rather simple form of the utility can be obtained by introducing one free parameter per agent's attribute and then summing over all attributes multiplied with the corresponding free parameter: Here, the parameter α m can be interpreted as a weight. If the attributes X m,n (t) are normalized, the values of ⃗ α can be compared. They give information about the importance of the different attributes.

3.12
In the section on ABMs of Innovation Diffusion, we have already seen works that use a weighted sum as the agent's utility. Some of those ABMs introduce an income barrier, only allowing agents with an income higher than some threshold to calculate an adoption probability. Those approaches include a "two-step" calculation of the utility and seem incompatible with our general ABM, but notice that they are actually included. This can be seen by adding a large negative term D · H(θ − I n ), where I n is the income of agent n, θ is the income threshold, H(x) is the Heaviside step function and D is a large, negative number. By doing so, agents that have an income smaller than θ have a large negative utility and will not adopt the innovation.

3.13
The weighted sum is of course only one of many possible forms of the utility function. Nonlinearities can be introduced by changing the functional form or by introducing additional parameters, for example, as exponents of attributes (Kahneman & Tversky 2013).

4.1
In this section, we will introduce a loss function that measures the quality of the model prediction. We will then explicitly calculate the gradients of that loss function with respect to the free parameters α k and analyze the different terms of the gradient and their importance. A link to a python implementation of the gradients can be found in Section 9.

4.2
To empirically ground the model, the free parameters ⃗ α in Equation 7 need to be calibrated to some data set. Therefore, a measure is needed that tells us how well the model performed. We will use the sum of squared residuals and introduce it as a loss function: where Y g,t is the data set of the technology diffusion process andỸ g,t (⃗ α) is the model prediction from Equation 8. The goal is to minimize L by varying the free parameters ⃗ α.

In Equation 10
, we see that L is a function of the random variableỸ and therefore a random variable itself. Hence we search an expression for the gradient of the mean loss function ⟨L⟩ with respect to the kth free parameter α k . For a better readability, the dependencies on time t, subgroup g and parameter ⃗ α are dropped where they are not relevant.
The approximation in the second line is justified only for a large number of agents N , since we know from the Central Limit Theorem that the sum ofỸ in Equation 7 converges a normal distribution for a growing number of agents in subgroup G and that the variance decreases with 1 /N. So for a large number of agents N we can say that ⟨Ỹ 2 ⟩ ≈ ⟨Ỹ ⟩ 2 .

Let us take a look at the remaining derivative
Here, we need an explicit expression for the expectation value ⟨ỹ n,t ⟩. From Equation 6 we know that the expected value of y n,t is p n,t if agent n has not adopted at earlier times t ′ < t. Since the probability for agent n for not adopting at time t ′ is (1 − p n,t ′ ), we conclude that

To pursue the calculation in Equation 12
we need the derivative of Equation 13. We continue by calculating: 4.6 The only task left is to evaluate the derivative of p n,t .
where f is the probability density belonging to the CDF F. Since the differentiability of u n,t with respect to the free parameters is a requirement when introducing the utility, we have finished the calculation at that point. In conclusion, the gradient of ⟨L⟩ is .
The resulting gradient depends on the data sets Y , X and model specifications F, f and u n,t and becomes more accurate with an increasing number of agents.
Discussion of the gradient 4.7 Next, we want to discuss the meaning of the different terms within the gradient in Equation 16. For simplicity, let us assume ∂ ∂α k u n,t > 0. This means that a rise in parameter α k results in a rise of the utility of the nth agent and therefore in a rising adoption probability. In the opposite case, ∂ ∂α k u n,t < 0 causes a changing sign, but the overall idea stays the same. Our goal now is to understand the influence of single terms on the overall sign of the gradient.

4.8
We start our analysis by recognizing that the gradient is a sum, where every summand comes from one data point and model prediction of subgroup g and time step t. The sign of each summand is influenced by the term Y g,t − n∈G F(u n,t ). This difference is negative if the model prediction is larger than the actual data and positive else. The meaning of this is intuitive: The parameter α k grows if the predicted number of adopters in subgroup g at time t is smaller than in reality (and vice versa).

4.9
The remaining part of the gradient is more complicated and comes from the derivative of the model ∂ ∂α k ⟨Ỹ ⟩. It consists of a sum over all agents that belong to the same subgroup G. In general, the contributions of the agents to this sum can be either negative or positive. The sum will have a positive sign, if most of the agents in subgroup G add a positive term. In the proceeding, we focus on one agent and fix n.

4.10
The product t ′ <t 1 − F(u n,t ′ ) reflects the larger impact of the adoption behaviour at early times compared to later times. This is rather intuitive: if the adoption probability of the agent is very high in the beginning, it is very likely that the agent adopts early and the circumstances at later times do not matter anymore. Mathematically, we know that 0 < F(u n,t ′ ) < 1, so all the factors 1 − F(u n,t ′ ) lie also between 0 and 1. For large times t, more and more factors are multiplied and the value of the product decreases.

4.11
The next term f (u n,t ) ∂ ∂α k u n,t is from the simple influence of α k at time t and agent n. If we only consider ∂ ∂α k u n,t > 0 and remember that f (u n,t ) is a probability density and hence always larger than 0, we see that this term is positive. It tells us that an increase of α k results in an increase of the utility and more agents will adopt.

4.12
The second term, which includes the sum over t ′′ , represents the influence of α k on the group of non-adopters at time t. If agent n has already adopted at earlier times, it cannot adopt at time t. Thus, if the model needs to predict a larger number of adopters at time t, it eventually has to reduce the value of α k so that enough nonadopters remain at later times. Once more, with the assumption of ∂ ∂α k u n,t > 0 this term is always negative. Thus, the two terms in the last bracket have opposite signs. The sign of the summand of the nth agent depends on which of the two terms is larger.

4.13
Having discussed the different terms within the gradients, we turn our focus to the two specifications of F: (1) we get a Gaussian Model (since this chosen F is the CDF of the gaussian distribution) and (2) by setting F = H(u n,t ) with H(u n,t ) being the Heaviside step function we get a deterministic Threshold Model. The Gaussian Model has smooth gradients and will be used later in our work when testing the ABM on the data set of PV diffusion in Germany. The Threshold Model on the other side is already applied in many of the papers summarized earlier in this work. For this reason, we want to mention it explicitly as one specification of the ABM.

For the Gaussian Model, the gradient in Equation 16
is exact up to the approximation in Equation 11. In the deterministic Threshold Model, the probability density function f appearing in all summands of the gradient is the Dirac delta function δ(u n,t ). It is only nonzero for u n,t = 0. This results in the gradients being zero, which makes the optimization impossible. Since the Threshold Model appears quite often in recent studies, an effective calibration method for this model is of special relevance. We bypass the problem of vanishing gradients by setting the probability distribution function f (u n,t ) = 1 in the gradient of the Threshold Model. The strongest argument supporting this idea is the fact that those approximated gradients can be used to achieve good results comparable to the Gaussian Model in Section 5.

4.15
Additionally, we bring up the subsequent argument for setting f = 1 in the Threshold Model. In the head or center region, the probability distribution is of order 1. In this region, the CDF has the steepest slope and a change in free parameters (and the resulting change in the utility) strongly influences the adoption probability of an agent. In the tail of the distribution, f is much smaller than 1. Here, a change in the free parameters does not really affect the agent's behavior. This is normally reflected by f ≈ 0, which suppresses the influence of this specific agent on the gradient. In the case of the Threshold Model, only agents with a utility infinitesimal close to the threshold value of 0 change their behavior for a change in the utility. Those agents are extremely rare. By setting f = 1, all agents equally contribute to the gradient, even though their behavior is not affected by a small change in the parameters. This should result in the gradient of the Threshold Model being much larger, but the direction of the gradient still should be right. The large gradient is one reason why the choice of the advanced gradient-based optimizer Adam will be important later on. Vanilla gradient descent will in general have many problems in handling those large gradients. In conclusion, we are not able to confirm this argument mathematically due to the complex nature of the gradients. The only justification for setting f = 1 is the strong calibration results, which is an indication that the idea is right.

4.16
Finally, we also want to discuss the differentiability of the utility function in Equation 8. It is important to mention that the gradient in Equation 9 is only valid if the input variables do not depend on the model outcome (and are therefore independent of α k ). This is not a problem if X k is an external data set, such as the income of the agent. However, it becomes a problem for variables depending on the model. Imagine that we want to use the number of adopters in the subgroup as a factor influencing the adoption decision, often referred to as peer pressure. Then, X k,n (t) depends on the model output from earlier times, namely the number of adopters in the specific subgroup and hence it depends on ⃗ α. In this case, the gradient in Equation 9 would include a very complex term α k ∂ ∂α k X k,n (t). One has three possibilities to handle this term. Either one can derive an expression, one finds an approximation for the derivative to keep the gradients as exact as possible, or one neglects the term. For the latter, special attention has to be paid to the convergence of the calibration, because the gradients are not exact anymore. It is also possible to calibrate the model with the imprecise gradients until it converges to a specific solution, followed by a random search around that solution. By doing so one can check if better results are in the close vicinity. Since the specific form of such complex terms depends on the model specifications, it is irrelevant for further investigations of the general ABM. Note however that we neglect a complex term α k ∂ ∂α k X k,n (t) in the gradient for three attributes in section 5 and still get a satisfyingly result.

5.1
To test the calibration with gradients, we will model the data set of small-scale photovoltaic adoption in Germany from 2000 − 2016. The model will be calibrated to match the number of new adopters in the 401 districts of Germany at 17 consecutive years. Special attention is paid to the convergence of the calibration and the quality of the fit. First, we will compare the gradient-based calibration with two gradient-free calibration methods. Second, we will evaluate the prediction of the calibrated ABM.

5.2
Works that use ABMs in the field of Innovation Diffusion often have difficulties when calibrating their models Rand & Stummer 2021). Traditional approaches utilize black-box function optimizers, where the gradients of the function are not needed. However, gradient-free optimization has strong difficulties with both computationally expensive functions and high-dimensional parameter spaces. For a growing number of free parameters, the usage of gradient-based calibration becomes the only method that ensures good fitting results for large models within a finite computational time (Shan & Wang 2010).

5.3
To show this, we test both a gradient-based and two gradient-free calibration methods on the dataset Y (g, t) of privately owned photovoltaic systems diffusion in Germany. The data set of small-scale photovoltaic systems is chosen for the following two reasons. First, it consists of trustworthy data on a detailed time and location level (Netztransparenz 2019). Second, by restricting ourselves to small-scale photovoltaic systems with capacities < 10 kW p we assume that most of the decisions are not made by professional investors and that many reasons besides economic profit play a role. The diffusion of PV systems is of special relevance in the research community, since it plays an important role in the transition towards a renewable energy system .

5.4
The data set Y (g, t) consists of 6817 data points, each representing the number of adopters in one specific district g and one year t. The chosen model utility is a weighted sum as in Equation 8 with 25 free parameters and F is the CDF of the gaussian distribution, so that we end up with the Gaussian Model as described in Section 4. The largest model used during the calibration consists of 1.6 million agents. A detailed description of the PV diffusion model and its input data can be found in the appendix.

5.5
Our procedure is as follows. We run our model with an initial set of free parameters ⃗ α 0 . This produces an adoption prediction of the modelỸ (g, t, ⃗ α 0 ). The quality of the model output is determined via the loss function L in Equation 10. Our ultimate goal is to minimize L by changing the free parameters ⃗ α during the calibration. From the variety of possible calibration methods (Salle & Yıldızoğlu 2014;Carrella 2021) we decided to test Bayesian optimization and Direct Search optimization as two gradient-free methods before applying a gradient-based approach.

5.6
Bayesian optimization is a machine-learning surrogate optimization method used to optimize expensive blackbox functions. The method is well suited, if the dimension of the parameter space is not too large (typically ≤ 20), the objective function is continuous, expensive to evaluate and has no first-or second-order derivative. Bayesian optimization then aims at finding a global optimum of the objective function.

5.7
In a surrogate method, the expensive-to-evaluate objective function is replaced by a surrogate function. During the optimization, the algorithm calculates a function estimate together with a confidence interval. The parameters for the next function evaluation are then chosen with the help of the surrogate. For a detailed description of the method, we refer to Frazier (2018).

5.8
When used to calibrate the ABM on the data set of PV diffusion, Bayesian optimization does not perform well (see Figure 4a). After 700 optimization steps, the best model achieves L ≈ 2.8 · 10 8 , which corresponds to R 2 ≈ −0.7. The explanation for the bad result is the large dimensionality of the parameter space. The number of necessary function evaluations grows exponentially with the number of free parameters and thus a method that aims to find a global minimum by examining the whole parameter space is not suited for the calibration of this expensive-to-evaluate ABM that contains 25 parameters.
Direct search optimization 5.9 The next gradient-free calibration method we use is a Direct Search method. In this method, a random starting point is chosen and the behavior of the function in the neighborhood is evaluated. If the function decreases in one direction, we follow this direction (Kolda et al. 2003). In the case of calibrating the ABM, we need an algorithm that can handle noisy functions and therefore use a search optimizer from the Python package noisyopt (Mayer et al. 2016).

5.10
When calibrating the Gaussian Model with the direct search optimizer, the achieved R 2 is still smaller than 0 after 650 steps (see Figure 4a, blue curve), hence the fit is still worse than simply taking the mean of the data. To save computational time, the pattern search algorithm was performed on a reduced model with roughly 640, 000 agents.

5.11
With a Direct Search algorithm, we no longer aim to find a global minimum and hence do not have to scan the exponentially increasing parameter space. However, the convergence of the algorithm to some local minimum is still very slow. This can once again be explained by the high dimensionality of the parameter space, which slows down the evaluation of the local neighborhood.

5.12
The gradient-based optimization is performed with the Adam optimizer (Kingma & Ba 2014) which needs the first-order derivative calculated in Section 4. Compared to vanilla gradient descent, Adam has the advantage of faster convergence due to its adaptive learning rate, the ability to escape local minima, and the ability to handle stochastic functions. The optimization ends after a given number of steps.

5.13
To reduce the computational time in the gradient-based calibration, we start with a relatively small model with few agents, and enlarge the model during the optimization process. The idea is that the rough direction of the gradients can also be calculated with a small model, especially if the calibration process has not yet advanced far from its random starting point. Only later on, a large model is needed to further improve the fitting result. We start the calibration with ∼ 16, 000 agents, where each agent represents 1, 000 privately owned households, and increase the model during the calibration up to the value of 1.6 million agents, where each agent represents 10 households.

5.14
The result can be seen in Figure 4a. The ABM can be fitted to R 2 ≃ 0.7 with the gradient-based Adam Optimizer. For the first 600 steps, the number of agents in the model is increased linearly up to 640, 000 agents. For the last steps, a larger number of agents is added per step, which explains the small bend of the curve. The model size used for the gradient-free direct search method is passed for the gradient-based optimization at the 600th step, where the quality of the fit is already above R 2 = 0.6.

5.15
We also want to add that we tested the gradient-based calibration on the Threshold Model as well. In the Threshold Model, the gradients are not exact as we have discussed in Section 4. Nevertheless, the calibration result is equally good with R 2 ≃ 0.7 and can be found in the Appendix.

5.16
Next, we want to evaluate the ABM that was calibrated with the gradient-based Adam optimizer and compare it to the PV data set. In Figure 4 b, both Y g,t andỸ g,t are shown for two of the 401 districts. We see that the model output (dotted line) is quite close to the actual data. In Figure 4 c, the number of total adopters (lines) and the number of new adopters (bar) are shown for each year for Germany, which is the sum of Y g,t over all districts. Since the calibration occurs on the level of single districts, in Figure 4 d the ratio of agents that have adopted is shown for all districts for the years 2000, 2008 and 2016, where the ratio of adopters is defined as q adopt = N adopt/N total . It can be seen that the model catches all relevant spatial and temporal patterns quite well.

6.1
The topic of validation and overfitting a model was not discussed in this work. The reason for this is that the focus of the work lies on calibrating ABMs of Innvovation Diffusion with gradients. Validating a model, however, depends on the specific model and data set. We still want to emphasize that the prevention of overfitting is of immense importance, especially when there is no upper limitation for choosing the number of free parameters. It is necessary to match the number of parameters to the size of the empirical data set and to spend time on validating the specific model, either by using cross-validation or other techniques (Miller 1998;Fagiolo et al. 2019).
Possible applications of the proposed ABM 6.2 To identify fields where the ABM of Innovation Diffusion can be applied, we shortly summarize the two main aspects of the ABM: • The focus of the model lies on the decision of agents to adopt, which is manifested in the change of the adoption variableỹ. An agent can only adopt once and stays an adopter then.
• The decision process of an agent only depends on the differentiable utility function u n,t . Decision processes that are more complex can not be included.

6.3
For the evaluation of PV systems diffusion, we delivered one example where the ABM can be applied. At least for smaller time horizons, one can fairly say that people who bought a PV system still own this system and neglect the number of deconstructions. Hence we fulfill the requirement that agents can only adopt once and that they stay adopters afterwards. A counter example, where the ABM can not be applied, is to use the ABM to simulate the number of agents that own a car. This is due to the fact that in a data set of car owners, people appear who once owned a car in the past, but sold it. We then used a linear sum as utility function which is probably the simplest example for a differentiable function and fulfills the second requirement.

6.4
We also want to stress the drawback of the gradient-based calibration method: While other researchers worked on improving the calibration of ABMs on a very general level (Lamperti et al. 2018;Platt 2020;Carrella 2021), our method is only applicable in the described ABM of Innovation Diffusion and can make no statement about other types of models. Additionally, to use gradients, the ABM has to be rather simple and no complex decision rules for the agents are possible. While this seems applicable for the problem of Innovation Diffusion, other fields where ABMs are used need more complex models.

6.5
Regarding the question of when to use the ABM or when to use other methods, we make the following recommendation: If the goal is to find relations between input and output quantities, we recommend to use statistical methods. If the goal is to make a forecast and the microscopic view is not of interest, equation-based models are the method of choice. If a detailed bottom-up view is of interest, the proposed ABM together with its gradientbased calibration will be an appropriate choice.

7.1
In this work, we described a gradient-based calibration method for ABMs of Innovation Diffusion that can efficiently calibrate models with many free parameters in large-scale simulations. Traditional approaches in this field are restricted to using black-box function optimizers for model parametrization. Our general ABM together with its gradients enables researchers to empirically ground models with many parameters on large data sets. This is an important step to scale up ABMs and model Innovation Diffusion on large scales, such as the country or even global level.

7.2
In the beginning of this work, we identified a common method to simulate the Diffusion of Innovations with ABMs in literature and formulated this method in a general ABM. We then discussed the importance of the utility function with special reference to its differentiability. We highlighted the cases where the restriction of a differentiable utility function causes difficulties and how this can be handled.

7.3
Thereafter, we calculated the gradients of the general ABM which are needed to efficiently calibrate the model and discussed the meaning of the different terms appearing in the gradients. We showed that the new calibration method achieves much better results than gradient-free approaches. In contrast to these gradient-free approaches, the new method is able to produce good fits even in the case of many free parameters.

7.4
Finally, we tested the calibration technique on the data set of privately owned photovoltaic systems in Germany between 2000 − 2016. We achieved a coefficient of determination R 2 ≃ 0.7 by calibrating a model including 1.6 million agents to more than 6000 data points using 25 free parameters. In the discussion section, we made recommendations on when and when not to use the ABM for simulating the Diffusion of Innovations.

7.5
In future works, the ABM can be applied to model other diffusion processes. The calibrated parameters can be analyzed to gain information about the influence of different agent attributes. A calibrated ABM can then be used to make predictions about future diffusion pathways.

7.6
The formulated ABM togther with its gradients enables researchers to create empirically grounded models of Innovation Diffusion with many free parameters that can be calibrated to large data sets. Simplifications to reduce the computational time of the calibration process, such as a prior fixation of some parameters or a reduced number of agents, are not obligatory anymore.

Model documentation
This work is accompanied by an implementation of a toy model and its gradients in Python that can be customized for own purposes. The Python notebook together with a step by step description can be found on GitHub: https://github.com/FlorianK13/ABM-Calibration.
We use each privately owned household as a potential candidate for adopting a PV installation. Hence we make two approximations. First, we neglect that other types of buildings, such as apartment houses or commercial buildings can install these small systems as well. Second, households in private property could also install systems with a larger capacity and are not restricted to small systems. Both approximations are necessary due to the limited information in the available data. The number of households in private property for each district is obtained from the census data set (Zensus 2011). Overall, a total number of ∼ 1.6 · 10 7 houses in Germany are in private property, which all are assumed to be candidates for installing a small PV system. After the time horizon of 17 years, roughly 10 6 PV installations with a capacity < 10 kW p exist in Germany. We assume that each PV installation can be matched to one household and therefore about 6 % of the households in Germany became an adopter.
Various temporal and spatial explanatory data sets are used to model the PV adoption (see Figure 5 and Table  1). The feed-in tariffs are taken from the German Bundesnetzagentur (BNetzA 2020). The capital cost per kW p of installed capacity is from BSW Solar (BSW-Solar 2016). Electricity retail prices come from the German Federal Statistical Office (Destatis 2020). The time series on lending interest rates (Bundesbank 2020b) and deposit interest rates (Bundesbank 2020a) are from the German Bundesbank. The spatial data of the annual global radiation is from the open data collection of the public German weather administration Deutscher Wetterdienst (DWD 2020). To decrease fluctuations in the global radiation, the data is averaged over the years 2015 to 2019. The population density of each district is calculated from the census data set and the area of each district, taken from the Federal Statistical Office of Germany (Destatis 2019), which is also the source of the average income taxation per head within each district (Regionalstatistik 2016).
Following Candas et al. (2019), we use a Net Present Value N P V n,t as the economic profitability , which is the expected profitability per installed kW p .
with CC being the capital cost per kW p , t lif e = 20 is the life time of a PV installation, FIT is the Feed-in tariff, SC the self-consumption ratio, RP the retail price, E rad the solar radiation energy per m 2 . The efficiency factor η increases linearly over time, starting at η(t = 0) = 0.125 and ending at η(t = 18) = 0.17. The area needed for an installation of one kW p A kW p = 6 m 2 /kWp (Mitsubishi 2020), r dep is the deposit interest rate and af is the annuity factor af (t) = r lend (st)(1 + r lend (t)) t lif e (1 + r lend (t)) t lif e − 1 with the lending interest rate r lend .   (2020), the Retail Price from Destatis (2020), the Lending and Deposit rates are from Bundesbank (2020a,b) and the Self-Consumption ratio is from BDEW (2016) and BNetzA (2019). Due to the lack of data for the capital cost in 2017, we use the value of 2016 twice, and for the Self-Consumption ratio, we use the even value 10% for 2017.

Opinion formation model
In the field of Innovation Diffusion, customers have various and individual motives to adopt a new innovation. The submodel of opinion formation reflects this bounded rationality. Here we will describe the opinion formation model and explain how agents can interact with each other, with nearby adopters, and with a central agent (media).
The utilized model of opinion formation is similar to the kinetic model from Toscani (2006). It can be used to model the dynamical changes in opinion among N agents. Each agent n has a continuous opinion w n,t ∈ [−1, 1]. At time t = 0, the opinionw n is randomly drawn from a gamma distribution with k = 7 and θ = 0.1. Then we set w n,t=0 =w n − 1, where the small number of opinions lying outside of the interval [−1, 1] are set to the mean value of the shifted gamma distribution -0.4. At each time step, agent n first interacts directly with one other agent, then it interacts with its environment, and finally with the central agent.

Direct interaction
Two agents can interact directly to change their opinion. Their initial opinions w n,t , w n ′ ,t are updated at each time step.
where w n,t+1 , w n ′ ,t+1 are the updated opinions, γ = 0.7 is a real constant number, η is a random number uniformly drawn from the interval I = [−(1 − γ), 1 − γ]. The function P (w) is a Gaussian distribution with σ = 0.7, P (−1) = P (1) = 0 and ∂ ∂w P | w=±1 = ∓1. D(w) is chosen to depend linearly on the absolute value of w: D(w) = 1 − |w|. The specific choice of P (w), D(w) and I controls that the new opinions at time t + 1 always stay within the bounded opinion interval [−1, 1]. For a detailed examination of this opinion diffusion model, we refer to Toscani (2006). In the implemented model, 90% of the agents are randomly chosen at each time step to interact with agents from the same district, while 10% of the agents interact with agents from any other district.

Interaction with the environment
The opinion of agent n is also influenced by its environment, which is the number of new PV installations in the district. A large number of adopters in the district positively affects the opinion of other agents. w n,t+1 = w n,t + γ env P (w n,t ) ·Ỹ (g, t − 1) with γ env = 2 and N g the total number of agents in district g. We highlight that the opinion of the agent at time t depends on the predicted number of PV adopters at time t − 1 and therefore on the free parameters ⃗ α. This is important for the differentiability of the utility function, which was discussed in Section 3.

Interaction with the central agent
The interaction with a central agent represents the media's influence on the individual agent.
with γ CA = 2 constant. The opinion of the central agent w CA,t changes over time and is connected to the total number of adopters, w CA,t = 0.3 + 0.7 N n,t ′ <tỹ n,t ′ . Based on recent evaluation of the press coverage on renewable energies in Germany (Rochyadi-Reetz et al. 2019), the initial opinion of the central agent is set to w CA (t = 0) = 0.3. This opinion grows linearly with the fraction of adopters in Germany, where it becomes w CA = 1 if all households in Germany have adopted.

Utility function of the PV diffusion model
The specific form of the utility function is the core of the ABM. For the PV diffusion model, we chose the product ⃗ α · X from Equation 8. The M = 25 attributes are listed below.
• X 1 = N P V n,norm is the normalized economic profitability, see Equation 17, with N P V n,norm = N P V n − N P V n,min N P V n,max − N P V n,min .
• X 2 = w n is the opinion of agent n.
• X 3 = N P V n,norm · w n + 1 2 is a product of the normalized economic profitability and the opinion, shifted to the interval [0, 1].
• X 4 is the normalized population density of the district the agent lives in.
• X 5 is the normalized average income taxation per head of the district the agent lives in.
• X 6 = N P V n,norm (t + 1) − N P V n,norm (t) is the difference of the normalized economic profitability at time t + 1 and time t.
• X 7 = N P V n,norm is the root of the normalized economic profitability, .
• X 8 = FIT(t + 1) − FIT(t) is the normalized difference of the Feed-in Tariff at times t + 1 and t.
• X 9 is the ratio of agents in each district that have adopted at the previous time step t − 1 • X 10 -X 25 are 16 variables representing the 16 states of Germany. They are binaries and are 1 if the agent lives in the state they're referring to. Since each agent lives in exactly one state, we know that 24 m=9 X m,n = 1 ∀n.
Additionally, a threshold θ n is subtracted from the utility in both the Threshold and the Gaussian Model. The threshold θ n is randomly drawn from a Beta distribution f (x, a, b) = 1 B(a, b) · x a−1 (1 − x) b−1 , where B(a, b) is the euler beta function, and a = 5, b = 2 are fixed. The threshold introduces a time-independent heterogeneity among agents. In the Gaussian Model, we set σ = 1 in the mapping function F.

Model
α 1 α 2 α 3 α 4 α 5 α 6 α 7 α 8 α 9 Threshold -0.18 0.77 0.34 -0.15 -0.08 0.01 -0.09 0.05 2.87 Gaussian -0.20 0.64 1.73 -0.79 -0.27 0.27 -1.08 0.66 4.20 Table 2: One set of optimal parameters found in the gradient-based calibration process for both models. The parameters correspond to the following attributes: α 1 is the parameter from the economic profitability, α 2 represents the opinion, α 3 is from the product of the opinion and profitability, α 4 is from the population density, α 5 is from the income, α 6 is from the difference of the profitability, α 7 is from the root of the profitability, α 8 is from the difference in the FIT and α 9 is from the adoption ratio in the district at earlier times. Figure 6: One set of of optimal state parameters α 10 to α 25 is shown for the Gaussian Model (a) and the Threshold Model (b). In c), the loss function L is plotted for eight calibrations of the same model with randomly chosen starting points. The average L over all calibrations is shown in blue, single calibrations are in grey.

Optimal parameters and convergence
The convergence of the optimization process is highly dependant on the choice of the utility function, hence no general statements can be made for the ABM. In general, optimizers that work with the principle of gradient descent can get trapped in local minima. The Adam optimizer used in this work is able to overcome local minima, but it is not guaranteed to do so. For the PV diffusion model, we run 8 calibrations of a smaller model with 320 thousand agents. The value of the loss function L during the calibration can be seen in Figure 6, where grey lines represent single calibration processes. It can be seen that all optimizations roughly behave the same and find an equally good solution. The average distance of the final parameter vectors is d avg = 2 N 2 −N i,j,i̸ =j (⃗ α i − ⃗ α j ) 2 ≃ 1.65. This distance is not too large, but it cannot be neglected. It shows that there is no unique solution for the parameter values for this specific PV diffusion model. We assume that this is due to the irrelevance of some of the agents' attributes.
One set of optimal values for the free parameters of the utility in Equation 8 are presented in Table 2 and Figure  6 a) and b). These parameters are from a calibrated Gaussian Model with 1.6 million agents.

Calibration of the threshold model
The gradient-based calibration method was also tested for the Threshold Model, as the Threshold Model is predominantly used in the literature. As can be seen in Figure 7, the Threshold model also achieves R 2 ≈ 0.7.

Notes
1 Since we describe a general method and not a specific ABM, some aspects of the ODD are not applicable