Generation of Synthetic Populations in Social Simulations: A Review of Methods and Practices

: To build realistic models of social systems, designers of agent-based models tend to incorporate a considerable amount of data, which influence the model outcomes. Data concerning the attributes of social agents, which compose synthetic populations, are particularly important but usually difficult to collect and therefore use in simulations. In this paper, we have reviewed state of the art methodologies and theories for building realistic synthetic populations for agent-based simulation models and practices in social simulations. We also highlight the discrepancies between theory and practice and outline the challenges in bridging this gap through a quantitative and narrative review of work published in JASSS between 2011 and 2021. Finally, we present several recommendations that could help modellers adopt best practices for synthetic population generation.


Introduction
with or without a sample, such as Bayesian-based generation (Sun & Erath ), while classical algorithms, such as the iterative proportional fitting (IPF) procedure (see Subsection . ), do not need to be used in the generation process. Therefore, instead of the input data type, we set the distinguishing criterion as the synthesis technique to either create the properties of the entities or reproduce known real entities, termed as synthetic reconstruction (SR) and combinatorial optimization (CO), respectively. .
The first approach is based on the idea of synthetic reconstruction (Wilson & Pownall ) and consists of building populations through the random generation of individual characteristics. This process is usually conducted by drawing attribute values either from the available distributions (Gargiulo et al. ; Barthelemy & Toint ) or from an estimated joint distribution based on techniques such as the IPF algorithm (Stephan ) or the Markov chain Monte Carlo techniques (Casati et al. ). When individual profiles are available, the generation can be a replication of the individual profiles to fit the macroscopic descriptors available. This approach refers to combinatorial optimization (Williamson et al. ). Although it is less popular, several promising techniques have recently been established (Harland et al. ; Ma & Srinivasan ). Figure summarizes the key ideas behind each procedure: As shown in the le part of the figure, the CO approach ultimately reproduces real records of individuals to fit a desired global statistical state of the synthetic population. In contrast, as shown in the right part of the figure, the SR approach is centered around extrapolation techniques to build the most relevant underlying distribution to draw synthetic entities from. Figure : Graphical description of the two main methods to build a synthetic population, i.e., combinatorial optimization (CO) on the le and synthetic reconstruction (SR) on the right .
The two techniques are paradigmatic, meaning that they define the principles of an approach to build a synthetic population rather than a procedure that may be applied to generate it. Therefore, most of the reviewed procedures in this section deviate from the basic principles of CO and SR. For instance, SR techniques can be used to enhance the results of the CO algorithm to add new information to an existing synthetic population (Thiriot & Sevenet ) or use CO techniques to mix SR-based synthetic populations at several scales (Huynh et al. ; Watthanasutthi & Muangsin ). In addition to combining these approaches, many models build on the concepts of one technique, for instance, by using weights attached to individual records of microdata to build the targeted marginal objective in the CO perspective, known as re-weighting of the population sample (Tanton et al. ; Yameogo et al. ), or by using statistical learning techniques, such as copula functions (Jeong et al. ) or hierarchical mixtures (Sun et al. ) to build the underlying distribution of attributes in the SR perspective. In both cases, the procedure follows the concept of CO and SR approaches, i.e., to replicate known individual entities from a sample of real records and draw characteristics of the synthetic entities from an estimated distribution of attributes, respectively.
. In the following subsections, we have presented a panorama of studies of synthetic population generation for use in agent-based modeling. Our perspective first explores the available data before the generation process: We first review typical data that can be used to generate a synthetic population. We then describe the main algorithms and techniques to generate a synthetic population from these inputs, mainly focusing on the SR and CO archetypal methodologies. Finally, we conclude the section by discussing how researchers assess the quality of the generated synthetic population.
First step: Working with data .
To realize synthetic population generation, researchers usually adopt two types of data. First, macrolevel data, which consist of distributions (income distribution, age structure, etc.) or aggregated values (average age, quantiles of revenue, etc.); and second, data that consist of a set of individual records at the microlevel that represents a portion of the whole population.
The first type of data may be available as a contingency or frequency matrix known as a distribution of attributes. For example, a cell of the distribution matrix could be the number or proportion of male laborers aged between and . The matrix is usually presented as table data in which the columns and rows describe attributes such as the age and gender, specified using categorical values, e.g., under years old or male/female, respectively. The data table may have an unlimited number of dimensions and reflects the joint multiway distribution/contingencies of attributes. However, the data are o en scattered, that is, one may access several tables, each displaying few attribute relationships. Because of the data heterogeneity associated with populations, the table o en has missing values, usually in the form of unrecorded relationships between attributes (attributes that are not present in every data table). In certain cases, the data content may be relative to a specific attribute, especially in terms of the spatial distribution, for example, the proportion of males and females per ward. In this case, the frequency cannot be generalized and must be used according to the reference attribute distribution. A related issue arises when the attribute has a divergent encoding form across scattered data tables: for example, one matrix that crosses age encoded by range of years with gender, and another matrix in which the age is encoded by custom ranges, as is usually the case when crossed with the occupation status; in this case, the age usually start at the legal age to work and is grouped into carrier dependent age ranges (e.g., -,or above ).

.
Another type of macro data is the statistical aggregated data, such as the average, median or quantile value of an attribute. These data concern a single numerical attribute, such as the age, salary or size of the households. Such data are classically ignored in synthetic population generation but have been employed in recent methods (Gallagher et al. ; Saadi et al. ). .
Most national statistical institutes release such data with open access. The data are o en updated each year and reflect many social dimensions, such as work, consumption or opinion, in addition to basic demographic attributes.
The second type of data represents a limited sample of the whole population: The data can be composed of individual records of %, %, %, and in rare cases, % or more of the population. Because microdata directly depict real individual characteristics, they are o en limited in scope due to practical and ethical considerations, either by the number of individual records or number of characteristics per individual, and in most cases, on both points. In certain cases, the records represent a class of entities rather than a particular individual and are assigned weights that represent the degree of importance of this class of entities in the sample, i.e., entities with the same vector of attribute values, considering the limited scope of represented attributes.
. Samples or microdata are o en presented as table-based data, in which each row represents an individual entity, and each column represents an attribute for which the entity has a particular value. As mentioned before, when rows do not refer to individual entities but a class of entity, a column is dedicated to the weights that the type of entity represents in the overall sample.
. Just like macro-data descriptors, the microdata of a population can usually be accessed through national statistical institutes on demand. A notable initiative led by the Integrated Public Use Microdata Series (IPUMS) makes it possible to freely access (requires identification) country-level sample data for approximately countries worldwide.
Remarks on the data processing step in synthetic population generation .
Most of available tools to generate synthetic populations have loose or heavily constrained ways to define the required input data. For instance, the iterative proportional updating (IPU) procedure proposed in Ye et al. ( ) must be fed with Public Use Microdata Series data (or PUMS, i.e., a sample of individual records with household identifiers, specified by the US census bureau), whereas the SPEW library (Gallagher et al. ) requires IPUMS microdata. In other words, even if modelers do have microdata, these must be formated in the style of the US-based census micro-data. Generalizing the principle means that for each synthetic population generation procedure, it is necessary to preprocess the data and adapt it to a specific data format. In this context, the existing synthetic population algorithms opt for one of two perspectives: focus on loosely defined types of data or manage a unique source of data. For the former approach, a considerable amount of e ort is necessary to adapt data to the processing pipeline. The latter approach forces people to rely on a certain data format, leading to data manipulation to fit the required format and type, which is not always possible. Even if no processing pipeline can handle every data format, engaging in the generation of a synthetic population will always require data preprocessing to address several limitations. .
In this section, we have summarized the major issues encountered by modelers interesting in building a synthetic population based on available loosely structured data. .

Data incompleteness:
The first issue relates to the missing parts in the data. This aspect is critical because it represents the principle of population synthesis: having a complete view of the entities' attributes ensures the generation of the best possible synthetic population by simply reading the source data. In this regard, having access to a sample of the targeted population is o en a simple and e ective technique to build a synthetic population, which requires simply initializing agents as the exact reflections of individual records. When modelers want to adapt these microdata to particular constraints (e.g., specified spatial extensions, creation of more individuals than those in the data, or use of statistical weights attached to records), to add unrecorded attributes or when there is no sample available, missing data must be identified. In most cases, the lack of data is expressed by unrecorded relationships between attributes, and synthetic reconstruction attempts to combine multiple sources to compensate for this lack of information. .

Data incongruity:
When several pieces of information target the same attribute but with various encoding forms, there exists a mismatch between data records. For example, age can be encoded in various forms from continuous integer to categorical values of various ranges, and each range can be used di erently when crossed with other attributes. For example, in the data provided by the French Institute of Statistics (INSEE), the age category begins at and ends at when crossed with professional status, whereas it is usually expressed as an integer when crossed with gender. .

Data inconsistency:
Available data can be present in various shapes, formats and contents. Except in the scenario in which modelers can base the generation process on a single source, data are usually scattered into pieces of information that first require harmonization to be used in conjunction for the population synthesis process. Generic tools to generate synthetic population, e.g., those proposed by Gallagher et al. ( ) and Chapuis et al. ( ), emphasis this harmonization step. In contrast, most influential theoretically proposed approaches in this domain minimize or simply ignore this aspect (Müller & Axhausen ; Ye et al. ; Sun et al. ). In particular, the approaches are based on a given and controlled set of data, and thus, the procedure may not be e ective when slightly di erent data are input. This aspect is a key concern for the usability of the proposed methods in generating synthetic populations.

Second step: Generation of a synthetic population .
In this sub-section, we describe the main procedures for generating synthetic populations in agent-based social simulations. The section involves two subsections: The first sub-section focuses on techniques similar to SR principles, in which the generation process is based on the estimation of the best possible underlying distribution of attributes; the second subsection is centered on techniques related to CO principles, in which the generation process is based on the available individual records of real entities to reproduce in the synthetic population.
SR defines a set of methods that reconstruct individual entities. From the perspective of SR generation, synthetic individuals are attribute vectors and the generation process consists of fulfilling each individual vector with appropriate values. We detail this generation procedure known as sampling in the following sub-section.
In the next sub-section, we discuss distribution estimation algorithms that can be used to facilitate the sampling procedure.
. Sampling procedure: Sampling methods include all methods based on the drawing of records associated with given distributions of probabilities. When applied to synthetic population generation, sampling is performed to determine the characteristics of individuals using discrete distribution of probabilities. In only very few cases, continuous distribution can be used. Sampling can be performed by selecting an entire vector of characteristics, e.g., an individual, or by gathering separately drawn characteristics. .
The most basic sampling technique is the Monte Carlo method (Harland et al. ) that uses raw input data knowledge regarding the distribution of attributes. This default technique can be labeled direct sampling (DS) to emphasis that it relies on the raw available data. Individual characteristics can either be drawn sequentially or several at a time if the input data include records on relationships between attributes. However, with the increase in the number of population attributes, multiway joint distribution becomes infeasible. Therefore, a realistic dependency structure between attributes must be determined upstream. Subsequently, the synthetic population can be generated iteratively using Bayesian rules on conditional distributions (Gargiulo et al. ).

Hierarchical sampling (HS) (Barthelemy & Toint
) is the most basic model to account for this type of sampling method. This solution is extremely flexible, but the user must manually define the hierarchy of attributes: which attribute(s) should be drawn first, followed by the other attributes to be drawn given the previously determined attribute value(s), to the last attribute. .
The use of a Bayesian Network (BN) in synthetic population generation extends and generalizes the principle of HS by using a graphical model (Sun & Erath ). BN-based samplers work iteratively, drawing characteristics starting from the root node(s) and continuing down the network path of the graphical model. Such samplers have a notable advantage compared to HS as they can manipulate classical learning techniques to automatically determine the graphical model and/or missing parameters (see Paragraph . ). This approach is very flexible in terms of input data: it is relatively easy to learn parameters of the BN from macro and/or microdata (Thiriot & Kant ). The graphical model can be retrieved from the distribution of attributes when they are available but must be estimated when initialized using microdata (Sun & Erath ).
. ) has been used to directly sample individuals using a simulated distribution through Markov chains. Combined with a fitting algorithm, this procedure can generate complex and reliable synthetic populations (Casati et al. ).
In addition, SR methodologies based on graphical model such as HS, BN and MCMC, are suitable in cases in which modelers need a multilayered synthetic population. In contrast to the basic Monte Carlo sampling methodologies, the graphical model can mix items -i.e., the nodes with attached parameters or states with transition state probabilities -to assess the attribute probability distribution of several types of entities. For example, Sun & Erath ( ) generated individuals into households using a unique graphical model to represent the distribution of attributes for both types of entities, while Casati et al. ( ) generated the same type of synthetic population using a Markov Chain model to represent both individual and household entities. .
Distribution estimation algorithm: These algorithms sample characteristics from known distribution(s) of attributes. A key drawback is that all unknown relationships between attributes are statistically independent in the generated synthetic population. Because a higher distribution quality corresponds to a superior sampled population, the estimation of the underlying multi-way joint distribution of attributes can enhance the sampling output. In the literature, this procedure is also referred to as the fitting step in SR methodologies. We would like to emphasize that this step is not mandatory, and many SR algorithms, such as DS or HS, do not use distribution estimation algorithms. Recently, algorithms such as the Gibbs sampling MCMC, which is a distribution estimation and sampling technique in one framework, have been adopted. Here, we have reviewed the most commonly used techniques to estimate the underlying multiway distribution.
. The iterative proportional fitting (IPF) (Beckman et al. ) process is widely used in the domain of synthetic population generation (Müller & Axhausen ). The algorithm fits each cell of a n-dimensional matrix (distribution of attributes) according to known marginal controls (Stephan ). The algorithm uses sample data as a seed to fill the matrix that describes the distribution of attributes, and iteratively updates the matrix cells to fit the known contingency dimensions. For more details regarding the mathematical description and new insights, please refer to the work of Lovelace et al. ( ).
. IPF has been criticized in many aspects, including the "zero cell problem" (Choupani & Mamdoohi ), the "curse of dimensionality" (Casati et al. ) and potential non-convergence of the algorithm. The most notable issue, however, is the inability of considering multiple statistical levels of constraints -i.e., multi-layered populations (Guo & Bhat ). Hierarchical IPF (Müller & Axhausen ) and IPU (Ye et al. ) have been developed to overcome this issue. These methods compute the factors and weights associated with individual and household records either iteratively or by using household and individual categorization, respectively. The basic idea of proportional updating is the same as that of the original IPF procedure. The main di erence pertains to the definition of the matrix dimension and marginals. These techniques have been developed in the narrow context of individuals in households and have only been used with a two-layered population. Moreover, these approaches necessitate preprocessing of the input data to fit the algorithm requirements and are extremely stringent regarding the type of data needed: sample and aggregated data must be available at both the individual and household levels. .
Recently, MCMC (Farooq et al. ) and BN (Sun & Erath ) models have been used to represent the population distribution of attributes. Both framework provide the user with a graphical model and techniques to estimate the missing data. For example, the Metropolis-Hasting algorithm can estimate the multiway joint distribution through the MCMC procedure (Kim & Lee ), while several fitness-based learning algorithms can be used to estimate the BN graphical model and its parameters (Sun et al. ). These two methodologies can help establish sampling algorithms based on a Markov chain or Bayesian network and joint/conditional distribution estimation techniques that can be used for single-layered populations and multilayered populations. .
Emerging deep learning trends: The last addition to the set of SR techniques are deep learning generative methods. The use of a deep neural network (DNN) is straightforward. A DNN refers to as a set of techniques to learn an extremely sophisticated network embedded version of the underlying distribution from a vector based representation of records in a data set. Here, the autoencoder approach is particularly e ective for data synthesis.
The concept of such DNNs is to train two networks. The first network decreases the dimensionality of the data to a bottleneck representation, while the second network expands this shortened representation to a fully explicit record. Using unsupervised learning techniques, these two adversarial networks can learn how to generate new records and have been used in the context of transport-related research, especially in the form of a variational autoencoder (VAE) (Garrido et al. ). However, the learning curve of such algorithms is a limitation. Although these algorithms are e ective when provided with an extremely large set of data, their performance is inferior when there are few data records, there are missing data (i.e., data incompleteness) and they appear in various shapes and format (i.e., data incongruity and inconsistency), as is usually the case in population synthesis (see Subsection . ).
CO methods draw individuals from a sample, with or without replacement, to satisfy a fitness criterion. This convergence criterion is usually built using input data regarding the distribution of attributes at the macrolevel. In the following sub-section, we have briefly examined the fitness computation procedures and optimization algorithms used to monitor CO-based generation. .

Fitness computation procedures:
The objective of fitness computation is to assess the distance between the distribution of attributes in a generated population and information available regarding the real distribution of attributes. The fitness can be computed using two types of aggregations: Numerical aggregation, based on an aggregated account of distance, such as the standard root mean square error (SRMSE) (Otani et al. ); and categorical aggregation, based on a Hamming-like distance, for example the total absolute error (TAE), which is the sum of misclassified records in a synthetic population (Williamson et al. ). In most cases, the fitness indicator is similar to that used to evaluate the quality of the generated population (Subsection . ). Hence, the CO principle is to maximize the general quality of the generated synthetic population through an iterative optimization process. A custom fitness function could be used based on several indicators, such as the statistical moment on di erent attributes (e.g., quartile of income or mean age), combined with several well-known indicators such as SRMSE and TAE. In most cases, a single aggregated fitness criterion fulfills the requirement even if multi-criteria fitness function could be used. However, these functions are di icult to monitor and may considerably increase computation time. Finally, although fitness computation can be realized using only raw input data, CO methods generally rely on a distribution estimation algorithm such as IPF (Voas & Williamson ) to enhance the knowledge regarding the targeted distribution of attributes. .

Optimized sampling algorithm:
In principle, all fitness-based optimization algorithms can be used to generate a synthetic population from a CO perspective. Simulated annealing (Harland et al. ), hill climbing (Kurban et al.
), genetic algorithms (Said et al. ) and greedy heuristic (Srinivasan et al. ) approaches have been used in this context. The procedure involves the establishment of an initial random population from a sample and iterative modification of the initial solution to obtain a higher fitness (Williamson et al. ). The mechanism that changes the population depends on the algorithm and modeler's choice. The most commonly used strategy is to swap a randomly selected individual with a potential replacement individual; however, the scope is not limited to this transition function. For example, in their hybrid solution to synthetic population generation, Barthelemy & Toint ( ) modified the individual's characteristics to enhance the overall fitness. Globally, the crucial concept is to compute a close solution (o en termed a neighbor) and move along a path of randomly selected solutions to find the most satisfying solution. No theoretical requirement is imposed on the movement of one solution to another, and several individuals can be swapped along with stringent selection criteria (for example, two individuals who have vectors of attributes with a given Hamming distance in their respective vectors of attributes can be swapped). In the optimization process, genetic algorithms have not been widely used but appear to be promising and flexible candidates (Williamson et al. ). These algorithms maintain multiple solutions and combine them iteratively to achieve a superior solution. Unlike simulated annealing, tabu search or hill climbing algorithms, genetic algorithms are less susceptible to be stuck in local optimal solutions (Otani et al. ). However, these algorithms involve a large number of parameters and require considerable modeling e ort. Notably, the overall optimization procedure from the CO perspective depends considerably on the modelers' choices regarding the fitness criteria and generation and exploration of neighbors' solutions. These questions do not have default or basic answers and must be addressed by the modelers.

Last step: Validation of the synthetic population .
Validation is usually performed considering the distance metric between generated and input marginals for the attributes of synthetic entities, e.g., distance between the distribution of age in synthetic population and input data regarding the distribution of age. In most cases, the quality of the synthetic population is assessed using the same dataset that has been used for the generation process. Hence, the validation of the synthetic population is performed out by measuring the distortion introduced by the generation process using the aforementioned distance metric. To this end, several indicators have been proposed. .

Indicators:
The total absolute error (TAE) is the simplest quality indicator. This error is the record of misclassified entities in the synthetic population (Williamson et al. ), with the misclassification evaluated using the absolute di erences in the table or matrix cell. The TAE index examines the number of entities with particular attribute characteristics, such as being male or unemployed, and compares it to the number of people with these characteristics in the targeted population. When the relationships between the attributes are available, the indicator can examine the cross-classification, such as married males aged between and years. As an absolute indicator, TAE may be di icult to interpret. To alleviate this di iculty, the proportion of good prediction (PGP), which is the proportion of misclassified entities, can be used. In this case, the TAE is divided by the maximum absolute error that depends on the TAE computation and known relationship between attributes in the input data, i.e., the number of "classes" available in the input data and the number of classes compared to assess TAE. .
In addition to the error for the overall population, the absolute average percentage di erence (AAPD) or relative absolute error (RAE) can be considered to assess the average disruption introduced by the generation process. In contrast to PGP and TAE, these indicators focus on the expected error for any class of attribute (or combination of attributes) characteristics. Moreover, it would be interesting to examine the standard deviation to better understand how misclassification is distributed along the distribution of attributes. The (standard) root mean square error (RMSE or SRMSE) is the most commonly used indicator. This indicator is similar to the two preceding indicators as its core mechanics is to aggregate the error over each class of records in the input data. However, this value can be computed in several ways, rendering it complex to setup and understand in the con- .
More promising but complex indicators include the relative sum of squared Z-score (RSSZ) (Huang & Williamson ) and RSSZ* (with a modified Z-score) (Williamson ). These indicators can aggregate both errors for the entire population and each class of records into a single indicator. .
Table presents an overview of the reported fitting measures used to assess the synthetic population quality. We have attempted to provide synthetic information regarding these measures: the type of di erence that the measures encode (relative or absolute), the scale at which these measures operate (global or local) and a synthetic mathematical notation. It is di icult to specify a unique notation to describe how each measure is computed, mostly because di erent authors use their own notation and there is a lack of homogeneity in the definition of certain measures. For simplicity, we denote a vector of attribute values as x, where X is the set of x Cx with C x the 5%χ 2 for x ∈ X Table : Main indices used to assess the synthetic population quality, with the name of the measure, type of measures and scale of reference to build these measures. We provide an abstract formulation of each indicator to ensure that every indicator can be examined in the context of the other indicators. For details regarding the computation, please refer to the previously cited studies. .
For an overview of indicators to assess the synthetic population goodness-of-fit, readers can refer to the related works focusing on that point, e.g., Voas & Williamson ( ) and Timmins & Edwards ( ).

Synthetic Population Generation in Social Simulations
. As mentioned in the previous section, many methods have been established to build synthetic populations. As described in the following sections, we investigated how these methods are actually used by social simulation modelers. To answer this question, we analyzed published models and identified the methods used to generate the set of simulated agents. .
In general, performing such analyses is time consuming because it requires the examination of the models regarding the initialization of agents and the code to identify the actual algorithm(s) used to generate agents' attributes. Following a semisystematic literature review methodology, we focused on models published solely in JASSS. This method is semisystematic because it limits the search to resources published in a single journal, while relying on the systematic framework to filter and select relevant articles and extract and analyze the content of the models.
. We reviewed papers published over ten years in JASSS as an indicator of the practices of social simulation modelers related to synthetic agent population generation. This review was limited to a single journal for two reasons: First, this journal is a key resource in the field of social simulation, and second, a canonical systematic search of thousands of publication titles yielded certain results on methodological aspects, but almost none on the actual initialization of social entities in simulation models. For instance, as of July , a Google Scholar search with "synthetic population generation" returned mostly theoretical papers presenting a dedicated algorithm or approach to generate a synthetic population.

Methodology .
With the exception of the search phase, we chose to adhere to the requirements of systematic review and the PRISMA statements (Moher et al. ). These requirements included the selection of relevant articles based on a careful reading of the titles, abstract, and/or part of the article (selection phase in Screening Subsection); definition of a content extraction framework to build a coherent set of data from the relevant articles (content extraction phase in Extraction Subsection) and systematic analysis of the recorded content using a quantitative and narrative analysis review (analysis, synthesis and review phases in Section . ). Finally, we summarized the gaps and lessons learned from the practices to identify issues and challenges in using synthetic population generation in the field of agent-based social simulation. JASSS, ( ) , http://jasss.soc.surrey.ac.uk/ / / .html Doi: . /jasss.

Screening .
The main criterion for selecting relevant papers for the analysis phase was simple: papers in JASSS that included a simulation model description were considered eligible. We excluded, a priori, all review papers and contributions that were not based on a simulation model, for example, a theoretical proposal such as that of Jager ( ). In several proposals, although a simulation model was included, the main focus was not the model itself but either the analysis of the simulation (e.g., Thiele et al. ), extensions of existing models (e.g., Taghikhah et al. ) or an application of a generic framework (e.g., Bourgais et al. ). Conversely, many papers presented theoretical models, which were not prima facie concerned with synthetic population generation due to the abstract nature of the simulated entities. Nevertheless, we chose to include these papers in the analysis to identify how modelers in this context choose to obtain the value of agents' attributes and how this aspect relates to more refined synthetic population generation practices. .
Moreover, we analyzed articles published between the rd issue of (June ) and the nd issue of (March , the last available issue when this search was conducted). This range involved articles spread over issues, among which featured a simulation model. Figure   . Despite the wide variety in the number of papers per issue, most issues were composed of a majority of papers describing a simulation model (between and ), with the exception of two special issues. Among the selected articles, focus on the synthetic population, i.e., the main objective of the article is to present a method for generating a set of representative agents based on demographic data (Yameogo et

Content extraction .
In the systematic review exercise, content extraction referred to the extraction of data that served as the basis for a systematic content analysis. Explicit rules were used to format the content in a way that decreases the reader's subjectivity. .
The Hypothesis presented in the introduction guided the codebook used for content extraction. Specifically, we identified key aspects to answer questions regarding the procedure, input data, application scenario and adopted models. Content extraction was performed by first examining the eligibility in terms of the inclusion criteria, description of the model features and generation process.
• Following the PRISMA framework (Moher et al. ), the first step was to establish the eligibility criteria based on the title, abstract, and sequential reading of the article to assess inclusion or exclusion in the subsequent analysis step. If the article description contained a simulated model, the article was selected and we continued to extract the content; otherwise the article was excluded for further analysis.
• Once an article was considered for further analysis, we highlighted two main features of the model: first, the entities to be generated (i.e., agents) and their attributes, and second, information regarding the case study, including location, time, and whether GIS data were used. If the article involved a systematic protocol to describe the simulation model such as ODD or STESS, it was recorded, and most of the content extraction was based on parsing the protocol corresponding section, e.g., in the case of ODD, the entities, initialization, and input data sections (Grimm et al. ). Otherwise, we identified the corresponding sections in the article, and possibly examined the source code when possible.
• Finally, we examined the process of the agents' generation. First, we identified whether an explicit process was presented in the article. We noted all the algorithms used to choose the value of the attributes, the algorithm parameters and the data types and sources that the algorithm is based on. The algorithm and data type were recorded using a closed list: constant, random function, calibration, synthetic reconstruction, combinatorial optimization, raw data, and NA for algorithm, and Sample, contingency, distribution, expert knowledge, statistical moment, and NA for data type. Readers can refer to Appendix A for more details on these categories. .
All the information was recorded on a record sheet in a tabular format. Each criterion extracted from the reviewed articles and their related resources answered a given question, and the set of questions that constituted the codebook guide are presented in Appendix A.

Quantitative and narrative review of practices of synthetic population generation modelers in JASSS
.
The corpus was analyzed in both a quantitative and narrative manner: In the quantitative analysis, basic statistics from the codebook (Appendix B), were derived, while the latter analysis was based on a systematic narrative report of the researchers involved in building the codebook.

Algorithm and data used to initialize the synthetic population of agents in simulation models
.
First, we drew a simple distribution of procedures and data used in these procedures within the corpus of reviewed models. In many cases, several techniques were used for one model, and thus, the distribution of procedures did not add up to %: For instance, Gore et al. ( ) used techniques: Constant values, random function and CO algorithm based on various input data to initialize the agent attributes. .
In terms of the synthetic generation techniques, we could not identify models, and of the techniques referred to unclear procedures (i.e., NA code). Table presents the global distribution of algorithms used for the synthetic population initialization: of models relied on a generic random function, such as uniform or normal distribution, and inn models relied on a constant value. SR techniques were used in more than % of the models published in JASSS inn the past ten years. Overall, Hypothesis H was confirmed by the data, despite a fair proportion of models referring to an explicit synthetic population generation procedure. Table : Number and proportion of models based on specific generation. Each model can use several algorithms to initialize the synthetic population of the agent. The number of models that relied solely on the corresponding procedure are specified in parentheses.

.
Valuable information regarding the agent attribute initialization can be obtained by examining the input data used by modelers. We could not identify any empirical data to ground the process of synthetic population generation in a majority of the reviewed models, i.e., models ( . %) did not use any source of data to inform agent attribute generation. .
Table presents the number of models that initialize the agent attributes according to the data type. The sample, survey and distribution constituted most of the real data regarding the population used to drive the generation process. However, the relatively high use rate of "expert knowledge" combined with unknown sources (NA code) rendered it challenging to understand how the generation process builds upon data. In addition to the majority of models that initialize agent using no data, the loosely structured / empty data sources accounted for almost of models ( . %). Hypothesis H a was thus confirmed by the practice that the initialization of the agent attributes was rarely based on the data regarding the target population. Crossing the procedure of synthetic population generation with input data types and use cases .
When we crossed models for which the input data type was available with their approach to synthetic population generation, the distribution of mobilized algorithms changed: As shown in Figure , synthetic reconstruction (yellow bars with cumulative count to ) was the second most commonly used procedure to create agents a er the random function ( ). As expected, when a sample of the original population was available, the model relied on the raw data (i.e., one record in the sample equaled one agent in the simulation model) and extremely few models ( ) built the synthetic population using SR/CO methods ( applied IPF and applied CO). Modelers o en implemented SR techniques when they could manipulate the contingencies, distributions and statistical moments, i.e., aggregated data regarding the target population. This observation is consistent with early synthetic population methodological contributions in JASSS recommending sample-free procedures to generate the initial set of agents (Barthelemy & Toint ; Lenormand & De uant ). Another interesting outcome was the overall complexity of the generation procedure when modelers relied on rich data sources, such as samples and surveys. In these cases, the method to create the agents' set of attributes was o en a combination of two or three approaches among raw data, random function, constant values and SR techniques. When no data were used, modelers relied heavily on random functions ( , . % of the corpus) and constant values ( , . % of the corpus) .
In addition to source data, we also correlated synthetic population generation processes with the use of GIS data and reports of a real-world case study application. In both cases, SR techniques were overrepresented, with . % and % of models using these techniques to build the initial population of agents. If raw data and CO were included in the pool of established synthetic population techniques, half of the models associated with a real case study relied on these procedures. In terms of the use of data and GIS/real-world case studies, models more o en built their synthetic population on refined techniques. Hence, the studied corpus tended to confirm Hypothesis H : When modelers used data (including GIS data) and real-world applications, more models generated the synthetic population of agents with established procedures.

General trends over time and subdomains of the simulation models .
To examine the evolution of the use of synthetic populations, we correlated the descriptive statistics with years of publication and keywords. Figures show the trend regarding both aspect of agent initialization: Figure  a aggregates the proportion (number) of models that relied on real world data (all models except those with the NA and Expert knowledge code) per issue, while Figure b depicts the cumulative proportion (number) of models that used data to implement a known procedure in synthetic population generation (i.e., using either raw data, synthetic reconstruction or combinatorial optimization). .
Despite the deviations, we observed a trend promoting the inclusion of data regarding targeted real population and explicit procedures related to synthetic population generation. Indeed, both the best fit linear regression models exhibited a gradual trend toward a more descriptive population of agents in models published in JASSS. However, comparison of the proportion of data used and actual use of well-established methodologies to construct the synthetic population of agents indicated a significant di erence: On average . % of the published models used conventional data regarding the target population, and only % relied on dedicated algorithm to generate synthetic population. .
To link practices with sub-domains, we performed an occurrence-based analysis of terms in article keywords. Keywords can well approximate the subdomain as they are specified by the model developers. We identified each keyword as a token and collected occurrences for each model; all keywords similar to "agent-based model and simulation" were removed. .
The results of the most mentioned keywords are plotted in Figure . Blank bars show the number of articles with corresponding keywords, while yellow bars display the subset of models using synthetic reconstruction. Opinion dynamics and social networks represented the most influential keyword in the journal, followed by more topic related items such as social influence, cooperation, segregation or trust. Except for social networkrelated models, most synthetic populations built using synthetic reconstruction did not pertain to the most prominent sub-domains of simulation studies in the journal. The most influential themes in the corpus tended to use a random function or constant. Nonetheless, social network studies emphasized initialization, with an average of . distinct procedures used to generate the synthetic population of agent per model. These approaches o en relied on a random function and constant. However, these approaches were driven by a clear tendency to comply with the available data (Section on synthetic population and synthetic network). As described here, we performed a more qualitative assessment of the reviewed articles. This narrative explanation of the field practices regarding synthetic population generation focused on three dimensions of the analyzed models: Agent attributes and initialization, di erences in these aspects for di erent types of models and actual methods to realize these aspects.

.
Description of the agent attributes and their initialization. The process used to generate the set of agents is o en poorly described. This is alleviated by the use of the ODD protocol as it forces modelers to explicitly discuss the model initialization at the start of simulation and data used to implement the process. However, even though the adoption of standardized narratives to describe the early simulation step facilitates the identification of the key aspect of synthetic population generation, these narratives do not provide adequate guidance in this aspect. In this framework, the descriptions of di erent models are di icult to be compared. Identifying the method with which the modelers initialize the population of agents requires the examination of the source code when available. Similarly, a standard procedure regarding the description of the simulation experiment is lacking, even when the ODD protocol is used. For instance, in numerous examples (Chen et al. ; Houssou et al. ; Xiong et al. ; Muelder & Filatova ), it is di icult to identify the number of agents simulated and the number of time steps for which the simulation is performed, either because information is missing or because several values are specified for various experimental setups. Moreover, for clarity of the model presentation, the attribute descriptions do not match variable names in the model.

.
In all cases, basic information regarding the simulated agent properties cannot be extracted from articles as no standard methods are available to describe the model and simulation. The model, its implementation and the simulation experiments are three aspects of most reviewed models. Therefore, the description of these entities must be extensive to ensure that a reader can fill the gaps between these aspects. The synthetic population of agents lies between the model and simulation: The model defines the attributes that must be assigned to the agent, while simulation pertains to the initialization of values. In both cases, considerable work must be done to ensure that a reader can clearly observe how these aspects are managed, implemented and transformed to simulation results. Furthermore, for certain models, generation of the agent attributes is part of the simulation process or outcome of the model: Agents may be created during the simulation rather than in the initial stage (Houssou et al. ). In other cases, agent attributes must be generated by the model (Silverman et al. ). .
Synthetic populations are not considered in all models. While descriptive ABMs with social entities fit our extraction framework, there is an important set of models ranging from extremely abstract agent-based models (e.g., bounded confidence and derivative models of option dynamic) to more classical agent-based systems (e.g., swarm or business process) in which the population generation is not of significance or ignored. The initial value of the agent attributes is o en randomly drawn in an interval (using a uniform or Gaussian distribution) or simply not considered as important/relevant in the narrative of the model presentation. Even in the social simulation domain, having a coherent and well-generated synthetic population of agents is not mandatory in many cases. As mentioned previously, extremely abstract models subscribing to the KISS adage, and even models closer to the agent-based system paradigm do not seek to build a realistic set of agents in their experiment (one extreme example is presented in Tang & Zeng ( ), in which an agent is not mentioned throughout the whole article).

.
However, the set of models that lies between abstract and descriptive models, among which the more popular models, such as opinion dynamics and game theoretical models, would benefit from using realistic synthetic populations of agents (Flache et al. ). For instance, considerable e ort is expended in modeling realistic social networks while maintaining a low representativeness of the agent attributes. In general, abstract models tend to focus more on global or aggregated determinants of the considered implemented mechanisms (e.g., social influence, cooperation, segregation or trust) rather than determinants that lie within agents. In this regard, the status of agent attributes remains unclear, especially when compared to those of the inner state variables. Presumably, attributes to be generated in a synthetic population of agents pertain to a category of agent variables that drive the behavior and decisions of agents instead of being determined during the simulation. Hence, there exists a blurred distinction between agent attributes considered state variables (e.g., opinion in opinion dynamics models or utility in game theoretical models) and agent attributes as determinant variables. However, most of these determinant agent variables are inherently generated as a global property link to the relative position of agents, i.e., agents who the entities are connected either in a grid or a network. .
Synthetic network generation. Most initial setups of agent attributes lie in their position in a network (or a simplified grid, which is simply used as a lattice considering a Moore or Von Neumann neighborhood), with most e ort being focused on synthetic network generation. The position of the agent may be considered partly as an attribute (e.g., agent's living address) and partly as an environmental feature (e.g., the distance between agents is defined by an underlying grid). In most cases, the second option is chosen, and parameters and/or data used to generate the synthetic network express aggregated characteristics rather than local properties or agent attributes. There is a set of models that should be attached to the network model rather than ABM, with most "agent" (node) attributes being related to their ties.
A key issue related to the generation of a set of agents is defining the scope and problem of the synthetic population. In most cases, the problem is not well defined; e.g., not all models require a heavily data-oriented synthetic population generation process, whereas models disregarding this aspect can consider the use of simple yet dedicated sub-models for synthetic population generation. Global disinterest in the established approach can be attributed to this low expectancy regarding the realism of agents. However, we recommend the examination of the limited use in terms of the accessibility: (i) methods may not be well understood because the domain of synthetic population may involve ambiguities (Section . ) and dispersion of the proposed methodologies; (ii) available tools are di icult to find and adapt to a particular case study; and (iii) tools may not be incorporated in the platform used to implement models and conduct simulation experiments, which renders the inclusion of the tools in the simulation pipeline challenging. .
In addition to the neglect of the synthetic population generation procedure, there are issues related to the description of agent creation and agent attribute definitions. While the use of ODD can enhance the reusability and replicability of models, the initialization step remains a bottleneck to the model description.
• Conflict within ODD between initialization (what is the process to build all the elements needed for the model simulation) and data (how are the processes of the model, rather than agent attributes, based on data? What is the influence of the initial value for a changing attributes?) • Models o en rely on a constellation of loosely categorized inputs generalized into a parameter or data dichotomy, while there exist constants, parameters for the simulation, environment (global) and agents (local), attributes and inner states of agents, and raw and preprocessed data, among other types of inputs.  The source code is o en easy to interpret and understand depending on the language / platform / toolkit used. Notably, it is considerably easier to understand a noncompiled and readable code (text file) developed with an agent-oriented language (Netlogo, GAMA, etc.) or an object-oriented language .
Within the scope of this review, the limited search for a specific journal involves certain biases. In particular, several well-known references cited in the first part of the paper come from a related field of research, i.e., transport modeling. Such models might be under-represented in JASSS while at the same time, considerably influencing how researchers generate synthetic populations for use in ABM. It would be interesting to focus on simulation models published in journals such as Transportation Research, for instance, to gain knowledge regarding practices from a related field. As kindly stated by a reviewer of this paper, "the strategy of basing a review around material in JASSS made reasonable sense in view of the importance of the journal and its large canon of relevant literature". We believe JASSS o ers a relatively free of domain view on practices related to ABM, although future work should be focused on synthetic population generation for other types of simulation models, such as microsimulations. In summary, outcomes from practices in JASSS cannot be generalized to ABM, especially when simulation models use mixed modeling techniques, which is o en the case in transport modeling research but also and more recently in epidemiological research.
. We did not perform a systematic review due to the initial results that we obtained from a systematic search: In the ABM domain, synthetic population represents a sub-field. In other words, for the search based on generic tools such as Google Scholar, semantic Scholar or Scopus, and dedicated search engines, such as iris.ai, most of the results were concerned with methodological aspects, i.e., proposal of a procedure to generate a synthetic population instead of that of a model featuring a synthetic population generation. The outcome of a systematic search, although interesting when studying synthetic population procedures, did not reflect our subject of interest: Actual synthetic population generation processes in agent-based simulation models. To review the practices, we were required to review the simulation models more broadly. To ensure manageability, we therefore selected a specific journal.

Conclusion
. Despite the trend toward integration of realistic synthetic populations of agents, our review underlines several practical biases in the domain. We observe that models are tending to integrate increasingly more data but disregarding the proposed methodologies to guide the creation of agents. Our investigation validates Hypothesis H-: The use of the synthetic population generation approach is still uncommon in social simulation. Specifically, modelers more o en use generic purpose initialization procedures such as the assignment of constant values or sampling from a given continuous distribution (H-b). While data regarding the target population remain of limited use (H-a), an increasing number of models are driving the generation of agents attributed to them. Finally, we identify disparities in practices according to the modeling target: When models are attached to real-world applications, they more frequently apply well-established synthetic population generation methodologies (H-).
. Several hypotheses can be made to explain this state. The first aspect is the lack of data concerning several attributes, in particular, the abstract attributes related to social attitudes and mental states render it impossible to consider all agent attributes using the current synthetic population generation procedure. Another aspect is the lack of knowledge and control of the population generation approaches by modelers: In this case, modelers rely heavily on simple random methods such as uniform sampling of attribute values. Even if we consider that dedicated methodologies are available and known by most modelers, there remains a lack of accessibility because a specific tool (usually, an API) or programming language (such as R, Python or Java) must be used, which di ers from the one used to implement the model and conduct experiments. .
Considering these aspects, we have outlined several ways to foster the use of dedicated methods to build a realistic set of agents and describe how the synthetic population is built in simulation models. First, the proposed methodologies presented in the first section must truly focus on data harmonization and integration of agent population synthesis. Furthermore, the models must use easy-to-couple so ware with simulation tools in the form of plugins for generic platforms or comprehensive APIs. From the modeler viewpoint, enhancement of the model description, in particular, the data and initialization step of simulation experiments, can enable the identification of appropriate features and tools to address various requirements in terms of synthetic population. Not all models require the same type of agent population, although little is known regarding the diversity of goals and the extent to which the agent must be realistic. While considerable e ort has been expended to generate realistic social networks, future work must focus on establishing reliable and reproducible synthetic populations in social simulations. • Combinatorial optimization: Attribute value drawn with replacement from a known real individual (Section ) • NA: Unclear or unknown procedure to assign attribute values

Categories of input data type to implement synthetic population generation (Q . )
• Sample: Equivalent to microdata (Subsection . ) • Contingency: Aggregated data regarding the population in the form of counts of corresponding people, e.g., the number of men and women (Subsection . ) • Distribution: Aggregated data regarding the population in the form of proportion, e.g,. percentage of people aged under (Subsection . ) • Statistical moment: Aggregated data about population in the form of a synthetic statistical indicator, e.g. mean age of the population (Subsection . ) • Expert knowledge: second-hand information without a clearly identified source of data • Survey: Social endeavor in the form of a questionnaire, direct observation or any participatory survey focused on a particular subject, e.g., time use survey, in which people are asked to describe in a closed form how their schedule is actually organized, see, for instance, Eurostats: https://ec.europa.eu/eurosta t/web/microdata/time-use-survey.