Generating a Two-Layered Synthetic Population for French Municipalities: Results and Evaluation of Four Synthetic Reconstruction Methods

: This article describes the generation of a detailed two-layered synthetic population of households and individuals for French municipalities. Using French census data, four synthetic reconstruction methods associated with two probabilistic integerization methods are applied. The paper offers an in-depth description of each method through a common framework. A comparison of these methods is then carried out on the basis of various criteria. Results showed that the tested algorithms produce realistic synthetic populations with the most efficient synthetic reconstruction methods assessed being the Hierarchical Iterative Proportional Fitting and the relative entropy minimization algorithms. Combined with the Truncation Replication Sampling allocation method for performing integerization, these algorithms generate household-level and individual-level data whose values lie closest to those of the actual population. using four indicators: R SAE, all good results, a characteristics to the some the fitting step, minimization (HIPF) the step,


Introduction
. Agent-Based Models (ABMs) have grown in popularity since the 's and are now applied in a range of sectors: healthcare (Tomintz et al. ; Edwards & Clarke ), economic policy evaluation (Avram et al. ; Sutherland & Figari ), geography (O'Sullivan ), and transport (Kickhöfer & Kern ; Hörl et al. ). These models require comprehensive data on the demographic and socioeconomic characteristics of individuals and households. However, for privacy reasons, no complete dataset can be compiled on the socio-demographic characteristics of individuals at the small geographic scale. To perform a microsimulation, one necessary step consists of generating a "synthetic population" that is representative of the actual population. During this process, the characteristics of (all) the individuals within a given study area are normally inferred from the characteristics of individuals in the sample from that area, as well as from the marginal distributions (aggregate data). The resulting synthetic population is a simplified microscopic representation of the actual population because only the variables of interest are to be reproduced (Chapuis & Taillandier ). Most approaches developed to generate a synthetic population have focused on deriving either individual-centred or individual-centred populations. For example, Iterative Proportional Fitting (IPF), which by far is the most widely used algorithm for generating a synthetic population, does not yield populations linking households and individuals (and thus controlling at both levels); use of this algorithm, outputs a population yet does not link individual characteristics to household information. In many cases, this absence of a link clearly constitutes a shortcoming since an Literature Review . The methods utilized to generate a synthetic population can be grouped into three main categories: Synthetic Reconstruction (SR), Combinatorial Optimization (CO), and Statistical Learning (SL) (Sun et al. ). These methods will be described and compared hereina er.

Synthetic reconstruction .
This category of methods is the most widely used to generate synthetic populations. A synthetic population is produced according to a two-step procedure: fitting and allocation. The fitting step involves assigning positive weights to the individuals and households contained in the sample with the resulting weights typically being non-integers. During the allocation step, these non-integer weights are converted into integer weights in order to replicate individuals and households.
. SR methods are deterministic methods, meaning that depending on the sample studied, the weights obtained during the fitting step never vary. The prerequisite to applying SR methods is to possess both a sample and aggregate data. The underlying assumptions here are twofold: the sample represents the true correlation structure among the attributes (Farooq et al. ); and the interactions present in the sample are, to a great extent, preserved for the synthetic agents (Müller & Axhausen ). The sample therefore needs to be consistent, representative and composed of at least one observation for each type of individual in the actual population. .
One of the commonly used SR techniques is Iterative Proportional Fitting (IPF) (Beckman et al. ; Pritchard & Miller ), which adjusts a contingency table constructed from the sample so as to match marginal distributions. .
In its original formulation, IPF cannot simultaneously estimate both household and individual-level attributes. Some IPF-based algorithms have attempted to address both household and individual attributes ( Pritchard & Miller ). However, in all these studies, the joint distribution of household and individual-level attributes is fitted either separately or sequentially which fails to guarantee the consistency between these two levels. Another approach consists of a fitting stage using IPF and a simulation stage where individuals are grouped into households with a household allocation procedure using the concept of "spouse matching" and "kids matching" (Rich ). .
In order to generate a two-layered synthetic population, four main algorithms have been proposed: Iterative Proportional Update (IPU) (Ye et al. ), Hierarchical Iterative Proportional Fitting (HIPF) (Müller & Axhausen ; Müller ), relative entropy minimization (ent) (Lee & Fu ), and Generalized Raking (GR) (Deville et al. ). In e ect, these techniques generate populations of individuals grouped into households by computing household-level weights that satisfy the marginals at both the household and individual levels. Such algorithms will prove to be the most appropriate for the case study presented below, in considering the available input data, and will be presented in greater detail in the third section.

Combinatorial optimization
. The second category of approaches falls under to Combinatorial Optimization (CO) techniques. CO based techniques are two-layered since they can directly generate a list of households and individuals (Ma & Srinivasan ).
. Similar to SR methods, CO requires information on both the sample and marginal level, with the synthetic population being obtained by replicating individuals (without explicitly determining the joint distribution across all controlled attributes). But unlike SR methods, Combinatorial Optimization is a stochastic process. The data requirements for CO methods are less restrictive than those for SR methods (Templ et al. ), though on the other hand they do su er from computational complexity when the population size is large (Lee & Fu ). A description of this method has been given by Voas & Williamson ( ) and Templ et al. ( ).

Statistical learning .
The third approach available to generate a two-layered synthetic population is Statistical Learning (SL), also known as the simulation-based approach. SL focuses on the joint distribution of all attributes in the sample by directly estimating a probability for each combination, including those not observed in the sample (Sun et al. ).
. SL methods o er greater flexibility in terms of data requirements and data sources; in general, they display good performance both in treating the lack of heterogeneity problem encountered in SR and CO (Sun et al.
) and with small samples (Borysov et al. ; Sun et al. ). However, a major drawback of SL methods is their inability to satisfy the conditional distributions while satisfying the marginal distributions of all variables simultaneously. During the population synthesis process, when marginals are available, it is indeed necessary to precisely match the observed marginal distributions with the population generated at the zonal level. Some of the two-layered SL-based algorithms derived for synthetic population generation include: the hierarchical Chain Monte Carlo method (hMCMC) (Farooq et al. ), the Bayesian Networks-based method (Sun & Erath ; Zhang et al. ), hierarchical mixture modeling (HM) (Sun et al. ), and deep generative modeling based on a Variational Autoencoder (VAE) (Borysov et al. ). Ye et al. ( ) proposed a tensor decomposition method to guarantee the consistency between three levels of constraints: individual, household and enterprise.

A comparison of methods .
Two-layered SR methods (IPU, HIPF, ent and GR) can generate high-quality two-layered synthetic populations that closely represent the actual population. Nonetheless, such techniques require a major preprocessing effort and are very stringent in terms of data needs. In fact, they require a representative sample and aggregate statistics at both the individual and household levels (Chapuis & Taillandier ). CO methods are less restrictive on data quality than SR methods for generating a two-layered synthetic population, yet they cannot always guarantee the optimal solution with respect to matching marginals and moreover require too much computing time. This CO category is better suited for generating small synthetic populations. SL methods are able to produce consistent results even for small sample sizes and generate a synthetic population from sample data only when necessary. On the other hand, SL methods make it impossible to satisfy marginal distributions of variables, which constitutes a major drawback when these marginals are available. In some configurations, combining SR and SL methods could be the most relevant option to have an accurate synthetic population satisfying marginal distributions. For example, combining a Variational Autoencoder model with IPF and quotabased random sampling (Borysov et al. ) or Bayesian Networks with Generalized Raking techniques (Sun & Erath ).
. The aim of this paper is to generate two-layered synthetic populations using French census data. The particularity of this dataset is the availability of a representative sample at the municipality level. The sample size is roughly % of the total municipal population; furthermore, aggregate statistics for both individual and household attributes are available which ensures data consistency. The data requirements for using SR methods in order to generate a synthetic population of individuals and households are therefore being met. Hence, SR methods are best suited since neither CO nor SL methods will not provide any advantage over SR. CO methods will in fact limit the population size potentially generated while SL methods prevent fitting to the marginals. The SR methods adopted to generate the synthetic population will be detailed in the next section.

Synthetic Population Generation Methodology
. The synthetic population is generated using a two-step procedure: -fitting, and -allocation (see Figure ). In the first subsection, the four two-layered SR methods (Iterative Proportional Update (IPU), Hierarchical Iterative Proportional Fitting (HIPF), relative entropy minimization (ent) and Generalized Raking (GR)) available for use during the first step are presented. The second subsection then describes the two methods (proportional probabilities approach and TRS method) that convert non-integer weights resulting from the fitting step into integer weights in order to replicate individuals and households. The fitting step .
The objective of this step is to find the vector of household weights: W = (w h ), where h = 1 . . . n s h . n s h is the number of households in the sample and w h is a positive real measuring the importance of the corresponding household. This weight will be used in the allocation step to repeat or draw its corresponding household. Marginals are modeled as constraints on the weights vector. .
We propose formulating this problem within the framework of the regularization of ill-posed inverse problems in order to clarify the comparison among the various algorithms. From this point of view, the objective here is to find W that satisfies the marginal constraints, i.e. : M H (resp. M I ) is the n mh × 1 (resp. n mi × 1) vector of marginals for the households (resp. individuals) (with n mh constraints on the households and n mi on the individuals). O H (resp. O I ) is the n mh × n s h (resp. n mi × n s h ) occurrence matrix that codes the sample according to the marginals on the households (resp. on the individuals). The next section on IPU will detail these equations.
. This problem is ill-posed inasmuch as there are more variables than constraints (n m = n mh + n mi < n s h ). The constraints are o en inconsistent with one another. The solution therefore must be regularized. An intuitive regularization consists of seeking a solution that is not too far removed from the sample, i.e. the vector solution W is not too distant from the vector of the prior weights, W prior , which models the sample. The following optimization problem serves to translate this idea.
with dist being a measurement of the distance between the vector W and the prior weight . Without consistent information on sampling, all the components of the prior weight vector have the same value: (W prior = 1). The proposed methods tolerate some small deviations to the marginal constraints, which is why the constraints are no longer strict equalities. In the following, O is the concatenation of the occurrence matrices O H and O I , .
The Statistical Reconstruction (SR) methods described in this paper, i.e. Iterative Proportional Update (IPU), Hierarchical Iterative Proportional Fitting (HIPF), Relative Entropy Minimization (ent) and Generalized Raking (GR), can all be interpreted within this common framework: these methods are in fact di erent views of the regularized Problem of ill-posed Problem . For ent and GR, the minimization is explicit though the distance measurement di ers. For IPU and HIPF, the minimization is implicit.
• Iterative Proportional Update o ers a geometric point of view of Problem . The IPU method starts from the sample, with initial weights being uniform. This vector is projected onto the hyperplane corresponding to household constraints before being projected onto a second hyperplane corresponding to the constraints on individuals. The process is iterative in support of the purpose of this algorithm to find a solution which is not too far removed from the initial sample and consistent with the constraints. It can be interpreted as a heuristic solution of Problem . IPU has been proven to have some limitations when generating a synthetic population at both individual and household levels. In particular, Ye et al. ( ) have shown that theoretically, IPU is unable to converge to an optimal population distribution that simultaneously satisfies the constraints from individual and household levels. The authors have proposed an extension of IPU in order to address IPU failures. However, in our use case, IPU generates suitable solutions because the sample is large.
• Hierarchical Iterative Proportional Fitting presents a dual view of Problem : it minimizes the distance to both constraint types on households and individuals, starting from uniform initial weights. The weights are modified as little as possible while optimizing the distance to the constraints.
• Relative Entropy Minimization conveys a probabilistic point of view of Problem . The objective is to determine a probability, p h , associated with each household that can be interpreted as the weight, w h , divided by the number of households in the target population, n h . The solution must satisfy the marginal constraints and minimize the relative entropy to a prior, nearly uniform probability. By dividing in Problem , the weights vectors by the size of the target population and then by replacing the distance operator , the entropy formulation can be derived.
• Generalized Raking provides an optimization point of view of Problem . It proposes solving this problem by setting up the Lagrangian. .
A er this more comprehensive introduction of the framework for treating inverse problems, the various methods will now be presented in greater depth.

Iterative proportional update
. The Iterative Proportional Update (IPU), developed by Ye et al. ( ), is an iterative heuristic algorithm that simultaneously controls individual and household-level marginals during the fitting procedure. The corresponding mathematical optimization problem can be formulated with the following objective function (Ye et al. ): Subject to w h 0, where: h denotes a household (h= , ,...,n s h ); j denotes the constraint or population characteristic of interest (j= , ,...,n mh ); and o j,h represents the frequency of the constraint j in household h (i.e. the occurrence), as one element of the matrix of occurrence, O. Moreover, w h is the weight attributed to the h th household and m j the value of constraint j. .
The objective function measures the inconsistency between the weighted sample and the given constraints. At the first iteration, all households have a weight of one. IPU typically starts by adjusting weights to satisfy household constraints first, then updating them to satisfy individual constraints. At each iteration, a statistical measurement δ provides a goodness-of-fit result; it is the average of the absolute value of the relative di erence between the weighted sum and the constraints, i.e.: with n m = n mh + n mi being the number of marginals. .
The gain in fit between two consecutive iterations is then calculated (∆ = |δ a − δ b |). The entire process is continued until the gain in fit is negligible or below a preset tolerance level. This tolerance level serves as the convergence criterion for terminating the algorithm (Ye et al. ).
Hierarchical iterative proportional fitting .
The HIPF algorithm (Müller & Axhausen ; Müller ) converts the household-level weights into individuallevel weights and vice versa. It also proceeds in iterations and the procedure can be defined as follows (Müller & Axhausen ): by adjusting the individuals-per-household ratio using the relative entropy minimizing.
In the above algorithm, h denotes a household, i an individual, k the iteration number, w h the weight attributed to the h th household, w .
At the first iteration, all households have a weight of one. For all households, weights are computed to fit household-level control totals and converted to individual-level weights. These weights are then used as initial values to estimate new individual-level weights to fit the individual-level control totals. The next step (Step ) is to convert these new individual-level weights to household-level weights by considering that the weight of each household equals the average of the sum of the weights of the individuals in that household.
. The sixth step consists of recomputing new household weights (w (k+5) h ) by minimizing the relative entropy from weights obtained in Step (w (k+4) h ) to these latest weights, as defined below: subject to the following constraints: where: n h represents households totals, n individuals totals, and n mi (h) the number of individuals in household h. This process is then repeated until convergence.

Entropy minimization .
A mathematical formulation, using the relative entropy minimization function as the objective function, to generate synthetic data was proposed by Bar-Gera et al. ( ) and Lee & Fu ( ). According to this approach, both household and individual-level characteristics are contained in the constraints. The entropy optimization (ent) method described in this section closely follows that of Lee & Fu ( ). .
Let's consider the following notations: n h and n are respectively the total number of households and total population in the research area; n v and n u respectively the number of household-level and individual-level characteristics (factors); α and β are two subsets of respectively {1, 2, . . . , n v } and {1, 2, . . . , n u } (α and β will be used to model the marginals); x h v represents one household-level characteristic and x i u represents one individuallevel characteristic; x h α represents the household-level characteristics associated with subset α ( Furthermore, let's consider: . Using the above notations, p α (h α ), E β (nx i β ), and p [h,nx i ] are defined as follows: • p α (h α ) = joint distribution across household-level characteristics x α , where p α (h α ) = m h (h α )/n h and where m h (h α ) is the aggregate summary count across household-level characteristic x h α ; • E β (nx i β ) = expected number of people in one household across person-level characteristics This formulation is an implementation of Problem , in considering probability p where w prior h are the prior weights (inverse of the prior probabilities). However, we have no information about y and only have n m auxiliary variables X i = (x i1 , . . . , x ij , . . . , x inm ) ∈ R for each member of the population. Also, the vector-valued population total M := h∈U X h is known accurately (i.e. the auxiliary variables and vector-valued population total correspond respectively to the occurrence matrix and marginals vector in Problem ). In order to estimate y, we must seek new weights denoted w h , by modifying the prior weights w prior h in light of the auxiliary information while remaining close to the original weights. Let's consider a distance function G to minimize the gap between w h and w prior h subject to the constraints h∈s w h X h = h∈U X h = M . G must be positive and strictly convex, with G(1) = G (1) = 0 and G (1) = 1. In the context of synthetic population generation, we hold a sample to match the aggregate data, and the auxiliary variables are the marginals. The objective then is to minimize the di erence existing between initial weights and final weights in order to fit the constraints at both the individual and household levels. The objective function is given by the following formula: min

Generation process comparison across the four methods .
Iterative Proportional Update (IPU), Hierarchical Iterative Proportional Fitting (HIPF) and relative entropy minimization (ent) all generate populations of individuals grouped into households by computing household-level weights that satisfy the marginals at both the household and individual levels. The HIPF algorithm constantly switches between household and individual domains, in employing an entropy-optimizing adjustment step (Müller & Axhausen ). With the IPU and ent algorithms, weights are adjusted to satisfy household-level constraints first and then updated to satisfy individual-level constraints. The di erence between IPU and ent lies in the procedure applied to adjust weights for a given individual-level control: if a household contains two or more individuals of the same category, ent reweights this household more heavily than a household with just one individual from this category, while IPU makes no distinction (Müller ). The Generalized Raking method directly adjusts weights to satisfy both individual and household-level constraints.

The allocation step .
All the methods described above generate fractional weights of households and individuals, making the results di icult to analyse. To construct the final population, we thus need to integerize these weights. The integerization process refers to converting these fractional weights into integer weights. To achieve this, two probabilistic methods are used: the proportional probabilities approach, and the truncate replicate sample (TRS) method. According to Lovelace et al. ( ), both of them outperform deterministic methods (simple rounding, threshold approach) in terms of final population counts and accuracy.
The proportional probabilities approach .
The proportional probabilities (PP) approach considers fractional weights as probabilities (Lovelace et al. ; Joubert ). For example, the probability p h of a given household lies in the final synthetic population is thus given by: p h = w h / w h . The higher the fractional weight, the more likely an individual/household lies in the final population. As a result, an individual with a very high weight may be replicated several times, while one with a very low weight might not be included in the final synthetic population.

The TRS approach .
The TRS approach (Lovelace & Ballas ) combines deterministic and probabilistic sampling in order to generate integer weights according to a three-step process: truncation, replication, and sampling.
. The truncation step yields integer values by removing all information to the right of the decimal point.
The decimal remainders (between and ) are then kept. As an illustration, a household with a weight of . will have a truncated value of . Its decimal remainder is . .
. During the second step, individuals/households are replicated depending on their integer weights obtained during the truncation step. Only truncated weights greater than are replicated. For example, the household with a weight of . will be replicated times. Another household with a weight of . would not be replicated in this step (its truncated value is ). When performing truncation and replication, no chance of oversampling exists (i.e. the sum of all integer weights will always be less than the population size).
. During the last step, only the decimal weight remainders are included in applying a weighted random sampling without replacement. The rest of the individuals/households are selected from the entire sample, with selection probabilities set equal to the decimal weight remainders. In our example, the household with the starting weight of . will have a . probability of being chosen again, while the other household will have a . probability.
This section has demonstrated how four SR methods function in abstract terms; a test scenario is now needed to conduct a practical comparison. The next section will describe the case study and data implemented.

Case Study
. The performance of the various methods described above will now be assessed using data drawn from the French census. This dataset has been collected by the French National Institute of Statistics and Economic Studies (INSEE). Since , this census has covered all municipalities and is valid over a five-year period. By compiling successive five-year surveys, an array of population statistics could be obtained. To build a more robust database, the collected data were then adjusted to a single reference date, thus ensuring that all municipalities were being treated equally. This reference date was set on January st of the median five-year survey period.

.
The data provided by INSEE are available in two distinct forms: a sample of individuals and households, and control variables, both at the level of an IRIS (acronym for "aggregated units for statistical information"), which represents the basic unit for dissemination of intra-municipal data. Municipalities with over , inhabitants, and a large proportion of those with , to , population, are divided into several IRIS units and, by extension, all municipalities not divided into IRIS units constitute IRIS units in themselves. .
We are specifically using census data from the Nantes Urban Area (NUA) from (these data were collected from to ). The total population of the NUA was approximately , individuals, residing in , households within IRIS or equivalent units. The sample included , individuals and , households. Each observation in the sample represents a unique individual with his or her personal characteristics, as well as the household and main residence characteristics. Table describes the attributes used in the generation process.

Descriptive statistics
.   The next section will compare the four previously described approaches to generating a synthetic population of households and individuals: IPU, HIPF, ent and Generalized Raking. For each of these, we have used the pro-portional probabilities (PP) and truncate replicate sample (TRS) methods to integerize the weights. We have thus evaluated not only the performance of the four generation approaches but also that of the two integerization techniques.
. Two main aspects can be considered regarding an evaluation of the accuracy of a synthetic population: internal validation and external validation. Internal validation consists of comparing the variables of the synthetic population with the marginals in order to test the reliability of the generated data (e.g. does the estimated distribution of family composition correspond to distribution given by the census data?). In other words, an internal validation tests the ability of the population to fit with aggregate data. A validation is external if the estimated variables of the synthetic population are compared with another data source not used in the estimation process. Our case study does not feature a data source external to the French census at the IRIS level. Hence, we have solely focused on the internal validation. .
According to the literature, internal validation can be carried out on either variables (marginals are compared with corresponding ones in the synthetic population), cells or the entire synthetic population. Many quantitative methods are available for internal validation (Lovelace et al. ; Timmins et al. ). The following performance metrics have been considered herein: • The coe icient of determination R 2 is the square of the Pearson correlation; it is a quantitative indicator that varies between and and moreover reveals how closely the simulated values fit the census data. An R 2 value of denotes a perfect fit, while an R 2 value close to zero suggests no correspondence between constraints and simulated values (Lovelace et al. ).
• Total absolute error (TAE) and the standardized absolute error (SAE). TAE is the sum of the di erence between simulated values and the marginals and SAE is TAE divided by the total population.
• Standardized Root Mean Squared Error (SRMSE). This indicator focuses on error dispersion and is used to evaluate the goodness of fit between the estimated synthetic population and the marginals; it is the one of the most common indicators used (Lee & Fu ; Lovelace et al. ; Sun & Erath ; Saadi et al. ). A zero value indicates a perfect match between census data and synthetic population, while a high SRMSE value suggests a poor fit.
• The Bland-Altman method. Widely employed in healthcare studies to compare two measurements of the same variable, this graphical method can also be used to complement the other indicators (Timmins et al. ). The Bland-Altman method consists of plotting of the di erence between simulated and census counts versus the averages of the two counts.

Results and Discussion
. This section presents the results of the internal validation procedure. The four SR algorithms have been implemented in the open-source MultiLevelIPF extension to the R statistical so ware package. A synthetic population has been generated for each IRIS of the NUA.
The validation results show that all the proposed methods produce synthetic populations that are representative of the actual population, yet some methods prove to be more e icient. The first indicator, R 2 , revealed that even though all methods tested performed well, the TRS integerization method yielded better results than the proportional probabilities method. Moreover, the results of the Generalized Raking method results were less accurate compared than the other three generation methods. A more detailed description of the R 2 results follows: • HIPF or IPU combined with TRS (HIPF+TRS or IPU+TRS) yields coe icients above . for all individual and household-level variables; • HIPF or IPU combined with the proportional probabilities method (HIPF+PP or IPU+PP) and entropy minimization combined with either the TRS or proportional probabilities method (ent+TRS or ent+PP) yield coe icients greater than or equal to . for all individual and household-level variables; • GR combined with either TRS or proportional probabilities method (GR+TRS or GR+PP) yield coe icients greater than or equal to . for all individual and household-level variables. .
The R 2 validation method merely provides an indication of fit and is influenced by outliers. A further analysis based on three other indicators (TAE, SAE and SRMSE), is therefore displayed in Table . These results confirm that all methods are globally e icient, but entropy minimization and HIPF do outperform the others. . Table , the method can be ranked in the following order from most to least accurate:

Based on
• entropy minimization, HIPF, IPU and GR for the individual level; • HIPF, entropy minimization, IPU and GR for the household level; • TRS and proportional probabilities.
. According to all the validation indicators considered (R 2 , TAE, SAE and SRMSE), it can be concluded as regards the generation methods, slight di erences exist between entropy minimization and HIPF. Moreover, these two methods outperform IPU and GR. For the integerization methods, the TRS approach outperforms the proportional probabilities approach. HIPF and entropy minimization combined with TRS therefore provide the best possible approximation of the actual population.

IRIS-level analysis .
In addition to the global analysis given above, a local analysis, IRIS by IRIS, has been conducted in order to identify the zones with the highest errors (i.e. IRIS with the highest SAE values). For each method tested, whether at the individual or household level, three IRIS always stood out.  . A qualitative analysis of the constraints from these three IRIS underscores their particular characteristics. IRIS and are activity zones with a small number of households and individuals. The population of IRIS (resp. ) is (resp. ) households and (resp. ) individuals. In these two IRIS, % (resp. %) of households have just one member; also, most of the inhabitants of these IRIS are men ( % (resp. %)) and belong to the -age group. IRIS is a residential area with , households and , individuals. However, a significant portion of the territory is occupied by a psychiatric hospital. In this IRIS, % of the households are single-member and % of the individuals are between and years old. In conclusion, the simulation runs prove to be accurate for all IRIS except a few due to the particular population breakdown of these IRIS.

Bland-Altman approach .
A Bland-Altman plot analysis of the data has been performed for comparing the census and simulated values of each IRIS for a given variable. This graphical method studies the mean di erence and constructs limits of agreement (Bland & Altman ). The X-axis corresponds to the mean of the two values, and the Y-axis is the di erence between these two values. The limits of agreement are defined by ± . × the standard deviation of the mean di erence. Analysis of the plot can help to identify some anomalies such as systematic overestimation or underestimation of census values by a synthetic reconstruction approach (Kalra et al. ).
. Our analysis has been applied to the possible cases ( categories of variables × synthetic reconstruction approaches). The average of the di erences (in both real and absolute terms) between simulated and census values by category for each of method has been computed. Figure shows the mean and standard deviation of these mean values for each fitting method. Let us note that the average of the di erences between simulated values and census values, expressed in real terms lies close to zero for the HIPF and entropy methods with a rather low standard deviation. The two measurements listed in Figure would seem to confirm the HIPF and entropy methods outperform Generalized Raking and IPU.  .
For purposes of illustration, Figure plots Bland-Altman values for the five categories of the household-level variable "Family composition", generated with the HIPF method (in association with TRS allocation approach). The Y-axis shows the di erence between the two populations (synthetic and actual), while the X-axis depicts the average of the two values. Depending on the IRIS, the simulated values are in some cases higher and in other cases lower than the census values. The average of the di erences (middle line) is close to zero. % of the data points lie within the 'limits of agreement'(top and bottom lines) which indicates that there is agreement between census and simulated values. Depending on the IRIS, the simulated values are in some cases higher and in other cases lower than the census values. This suggests that there is no consistent bias.

Conclusions
. This paper has provided a synopsis of the synthetic methods aimed at generating a population of individuals and households. We o ered a detailed description of four synthetic reconstruction methods for the fitting step through use of a common framework. These methods are Hierarchical Iterative Proportional Fitting (HIPF), Iterative Proportional Update (IPU), Generalized Raking (GR), and relative entropy minimization (ent). Two integerization methods were also discussed, namely proportional probabilities and truncation replication and sampling (TRS). Next, an evaluation was performed of the most relevant method for generating a two-layered synthetic population. These methods were implemented using the R language. A case study involving the synthesis of agents ( , households, , individuals) from the Nantes Urban Area (western France) was considered, beginning with a sample of , households, including , individuals and , marginals. The synthetic population was generated with four household-level attributes and six individual-level attributes. .
Results were evaluated using four indicators: R 2 , TAE, SAE, and SRMSE. The validation findings indicate that all methods considered yield good results, i.e. a two-layered synthetic population whose aggregate characteristics lie close to the census marginals. However, some methods output better results than others. For the fitting step, entropy minimization (ent) and Hierarchical Iterative Proportional Fitting (HIPF) prove to be the most efficient methods. For the allocation step, the truncation, replication and sampling (TRS) approach outperforms proportional probabilities.

.
We believe that the comparison of di erent statistical reconstruction (SR) algorithms performed in this paper with common notations and a common theoretical framework will facilitate a better dissemination of existing algorithms. This common framework will stimulate the development of new algorithms and position them with respect to existing methods.
. The next step is to spatially allocate households or add to the demographic characteristics of households and individuals other socio-economic variables such as income by using other databases, e.g. fiscal database. The mathematical model used in this article inspires our current research to propose data fusion algorithms that enrich the synthetic population.

Appendix: Model Documentation
In this appendix, we describe in five steps the approach used in the paper in a more detailed fashion with an R script. The first step presents the databases used and how to download them. The other steps describe the statistical analyses performed with a toy model. Interested readers can directly contact the authors to get the complete codes.
. For each control variable, we must have a R dataframe. In our case study, we have variables, so we must have R dataframes.
Conditions: Ensure data consistency. Each row of the sample dataframe must represent an individual with his or her personal and households characteristics, a unique personnal ID number, a household ID and IRIS ID. Each row of a control variable dataframe must represent a category with the number of people/households in this category and iris ID.
Step : Allocation step Use of TRS method. Outputs: synthetic population of households to merge with individuals by household ID.
Step : Validation step Use of classical performance metrics (R 2 , TAE, SRMSE and Bland-Altman) to compare algorithms.

Notes
An abundance of literature exists of this subject ever since the seminal work by Tikhonov & Arsenin ( ).
In this paper, distance is not intended in its strict mathematical definition.
In the literature on synthetic population generation, this method is o en called cross-entropy minimization; from a strictly mathematical point of view, it is not valid. We have chosen to replace the term cross-entropy by relative entropy, also known as Kullback-Leibler divergence, which is correct and consistent with the notation of D for the measurement of this relative entropy.
https://www.insee.fr/fr/information/2383265, Consulted on April According to the INSEE Institute, an urban area is a group of adjoining municipalities, without pockets of clear land, encompassing an urban centre (urban unit) providing at least jobs, and whose neighboring rural districts or suburban units (urban periphery) account for at least % of the employed residents working in the center or in the municipalities attracted by this center.
For the Generalized Raking approach, four distance functions G can be used: the linear method, the raking ratio method, the logit method, and the truncated linear method. We tested all these functions, but convergence was only achieved for the logit method.
https://github.com/krlmlr/MultiLevelIPF, Consulted on April We used a computer of x . GHz CPU cores and GB RAM.