Modeling Spatial Contacts for Epidemic Prediction in a Large-Scale Artificial City

: Spatial contacts among human beings are considered as one of the influential factors during the transmission of contagious diseases, such as influenza and tuberculosis. Therefore, representing and understanding spatial contacts plays an important role in epidemic modeling research. However, most current research only con-siders regular spatial contacts such as contacts at home/school/office, or they assume static social networks for modeling social contacts and omit travel contacts in their epidemic models. This paper describes a way to model relatively complete spatial contacts in the context of a large-scale artificial city, which combines different data sources to construct an agent-based model of the city Beijing. In this model, agents have regular contacts when executing their daily activity patterns which is similar to other large-scale agent-based epidemic models. Besides, a microscopic public transportation component is included in the artificial city to model public travel contacts. Moreover, social contacts also emerge in this model due to the dynamic generation of social networks. To systematically examine the effect of the relatively complete spatial contacts have for epidemic prediction in the artificial city, a pandemic influenza disease progression model was implemented in this artificial city. The simulation results validated the model. In addition, the way to model spatial contacts in this paper shows potential not only for improving comprehension of disease spread dynamics, but also for use in other social systems, such as public transportation systems and city level evacuation planning.


Introduction
. Transmission of an infectious disease may occur from one person to another by one or more of the following means (Straif-Bourgeois et al. ): direct physical contact (e.g., touching), indirect physical contact (e.g., contaminated food) or vector-borne contact (e.g., a droplet). However, most of the means can be summarized with the term 'spatial contact'. A spatial contact usually occurs between two persons in a geographical space, either an open environment or an interior space, where they can quickly or easily get in touch with each other directly or indirectly. For example, if an infected person coughs or sneezes in a bus, then the droplets containing microorganisms may enter another person's body, which causes a disease to spread. This is considered a transmission through a spatial contact. Based on this definition, spatial contacts among human beings are regarded as one of the most influential factors during the transmission of most diseases (Perez & Dragicevic ) and incorporating the contact patterns into epidemic modeling can bring a deeper understanding of the transmission patterns of a hypothetical epidemic among a susceptible population (Mossong et al. ). .
Typical epidemic models are based on mathematical models or agent-based models (Ajelli et al. ). Mathematical models can estimate the speed of a disease outbreak based on the basic reproduction number which depends on the number of adequate contacts (Del Valle et al. ), while the contact details o en rely on priori • System scale and complexity of communication. When the number of agents increases linearly, the communication complexity could increase exponentially, which creates a scalability issue that is hard to deal with (Hawe et al. ). Thus, the current practical solutions mentioned above either use random/predefined contact networks to reduce the number of communications or implement the model on distributed architectures to improve the performance. However, there could be a huge overhead for enabling coordination between agents on distributed architectures as it increases the number of communication messages and leads to a higher communication complexity. As a matter of fact, to balance between performance and accuracy for large-scale agent-based models, reducing communications by simplifying the contact network model is an o en used compromise (see (Stroud & Valle ; Parker & Epstein ; Ajelli et al. ; Rakowski et al. ; Bisset et al. ; Ge et al. ).
• The inclusion of a microscopic transportation component in the model. Since there is a lot of research on transport demand modeling which can easily monitor detailed traveling contacts (Zhang et al. , ; Zhao & Sadek ), it seems to be a rather simple task to include it in an epidemic model as it is easy to define a travel activity in the agent's schedule so that there is not much additional information required except the tra ic networks. However, this is not the case in simulation practice as the simulation time resolution in both the microscopic tra ic model and the epidemic model are not at the same level. Moreover, a large part of the tra ic (e.g., by private car) seems to be less useful for studying disease spread, although a crowded bus can be an ideal location for spreading disease.
• The dynamics and unpredictability of social contacts. Social contacts, in the form of joint activities, can frequently change in real life and influence an individual's plans and schedules. As the plans for each person who will participate in a joint social activity have to be synchronized in both time and location, it is a more complicated task than it may seem (Ronald et al. ).
• Friendship formation. Friendship, as a special form of social networks, has many other characteristics, such as the 'small world e ect' and the power-law distribution of the number of degrees of connectivity (Singer et al. ; Hamill & Gilbert ). To include these characteristics in a large-scale model, e icient algorithms and approaches which balance e iciency and memory usage are required. .
The above discussion motivates the need to design novel algorithms and approaches to model spatial contacts including travel contacts and social contacts in a large-scale epidemic model. In this paper, we tried to achieve this in the context of a large-scale model of the city of Beijing. .
In detail, contributions and organization of the rest of the paper are as follows: • Firstly, we constructed a model of the city of Beijing including four key model components by a datadriven approach in section . This artificial city is considered as the basis for modeling disease spread. In total million agents and million locations were modeled. The major algorithms and approaches are introduced in this section as well.
• Secondly, we presented a classification of spatial contacts and statistically analyzed the modeled spatial contacts by presenting a set of simulation results in section .
• Finally, we implemented a disease model in this artificial city validated the model results in section , by which we show the e ect of the modeled spatial contacts for epidemic prediction.

Agent-based Artificial City
What is an artificial city . An artificial city, as a city-scale artificial society, is a multi-agent simulation system where a set of autonomous agents carry out activities in parallel, move around the environment locations and communicate with each other (Sawyer ). It requires individual agents representing humans that have daily behaviors, together with locations (households, schools, workplaces, hospitals, stations, etc.) that have a function for agents' activities. Based on the artificial city model, fundamental collective behaviors are seen to "emerge" from the interaction of individual agents following a few simple rules (Epstein & Axtell ). There are a lot of relevant research topics to modeling an artificial city, such as using agent-based modeling for urban simulation (Navarro et al. ), simulation of residential dynamics in the city (Bhaduri et al. ), and the dynamics of pedestrian behavior (Pelechano et al. ). .
In this paper we define the artificial city we construct as a set of located agents and geo-referenced locations, together with a public transportation system. Located means that the agent has a location associated at any time in the simulation, both when performing activities in physical locations (for example, eating in a restaurant), and during traveling (walking or riding on a bus). As a matter of fact, every object in this artificial city has a geographic reference (longitude and latitude) assigned to it in order to locate it, either static (physical locations) or dynamic (agents). This definition gives a strict requirement for the completeness and consistency of data required for modeling.

Data preparation .
Beijing, as the context of this case study, is the capital of the People's Republic of China and the second largest Chinese city by urban population. The population as of was . million.

.
In the preparation phase of this research, the di iculty for this case study is the source of the initial data, such as population and environment. Large-scale real world data sets are expensive to collect and di icult to obtain high fidelity ground truth for (Bernstein & O'Brien ). Thus, there is a trilemma of inadequate data from realworld datasets, statistical simulation models, and agent-based simulation models. This di iculty is reflected in other similar research as well, such as the model of the spread of SARS in Beijing conducted by Huang ( ).

Population
Number of agents Age Scope of age -Location Number of physical locations Families Number of families To solve this plight, firstly we acquired the raw data in an independent research by Ge et al. ( ). They adopted a mixing method which collect real data (statistical data and geographic information) and generate the other minimum required data by algorithms, which are the synthetic population and physical locations by utilizing the real data. More detailed information about the raw data on synthetic population and physical locations are as follows: • The statistical population and location data were collected from the National Bureau of Statistics (NBS) at the city scale, and from the Municipal Bureau of Statistics (MBS) at the district scale, which include population, age-sex distribution, number of children distribution among families, family size distribution and geographic distribution of families among districts.
• With the algorithms in Ge et al. ( ), each individual person is specified with the attributes of age, gender, family role, family index and social role to specify this individual's demographic characteristics. The family role can be defined as a set {grandparent, parent, child}. The social role is defined as a set {infant, student, worker, retired}. This design is based on findings from the China census data (available at http://www.stats.gov.cn) that households with more than three generations are a small proportion (less than %) of the total number of households.
• Besides individual persons, physical locations were generated where individuals can perform a variety of activities. Currently, there are location types, and these location types are classified into categories: houses, educational institutions, workplaces, consumption locations, entertainment locations, and medical institutions. Each location has a geographic reference and the distribution of these locations was generated according to both statistical data and the geographic distribution of the population.
• The consistency between the individual person and the physical location was guaranteed. For example, a student of age will be assigned a location which belongs to location type 'university' rather than 'primary school'. .
The statistics of the synthetic population and physical locations are listed in Table . . The statistical results of the generated synthetic population are shown in Figure in the form of an age distribution. According to the previous results, the standard deviation of errors between the generated age and the statistical data is . ( % confidence interval (CI) from . to . ). .
With the generated data, Ge et al. ( ) constructed a large-scale agent-based epidemic model. Based on the same source of data, this research built a large-scale agent-based model in a new way. A key issue and challenge of utilizing the raw data to our model is the redundancy of the data, such as the agents' preferred location list for shopping, eating and entertainment. Together with the predefined social networks for agents in the data, the size of the data is initially around Gb. Since the way to implement the large-scale agent-based model in this research does not require the predefined location choices and social networks which is entirely di erent from Ge et al. ( )'s method, we post-processed the raw data by extracting only the relevant fields of data items from the original database. In addition, to speed up the initialization phase, we converted the data from the database (mysql) to a compressed format (e.g., gzip) to reduce disk transfer time. With these post-processing steps, the time e iciency for loading the model could be improved by % in our case.

Location .
With the data generated by the statistical information, we modeled each of the million physical locations in the artificial city Beijing, which represent schools, restaurants, shops, hospitals, etc. The exact numbers of   Table . . Each location is characterized by its geographic reference (longitude and latitude), and the total area in square meters. The added total area parameter to a location is unique in this research, which is used to generate sublocations (e.g., classrooms in a school) and serve as an important parameter for disease spread in the location. From Table , we can find that currently there are location types which are categorized into location categories. Apparently, these can not cover all the location types in reality in Beijing, for example, small shops in 'Consumption locations' and cinemas in 'Entertainment locations' are missing in the current data. Further research should be conducted on generating or collecting real data for these missing locations which are important for disease spread, as well. .
We partition each location into sub-locations by giving each location an attribute 'sub-location size'. Sub-locations can represent separated classrooms in a school, stores in a shopping mall, or o ices in a working place. Agents can only have direct contacts when they are in the same location and assigned the same sub-location index.
. An e icient method 'calculateDistance' is realized in the location class, which calculates the distance between two locations based on the geographic coordinate information (latitude and longitude). Since the process of calculating the distance between two locations is an indispensable step for including a transport component in the model, this method is one of the most frequently called methods during a simulation run. Thus, we optimized this method by using an approximation of one degree in longitude and latitude when transforming geographic coordinates into Cartesian coordinates. Compared with the accurate calculation, it speeds up the calculation up to %, while the relative errors are less than . %. .
To manage the locations in each location type category, a 'LocationType' class is created, which can be instantiated for each location type. Besides the necessary methods to manage locations, such as getting a location by index, the two most frequently called methods are 'getNearestLocation' and 'getLocationArrayMaxDistanceM'. The first method returns the nearest location of the current location type to any location, and the second one returns an array of locations of the current location type within a max distance to any location. These two methods will be frequently called due to the fact that some people are more willing to visit the nearest places for shopping, eating and leisure when they have no particular preference. Due to the fact that most activities of agents in the simulation need to ask for a list of closest locations for carrying out that activity, calling these two methods would take a lot of computing resources.
. Thus, a three-level cache mechanism was creatively designed to achieve a balance between CPU utilization and memory usage. The first cache is the nearest cache, which stores the closet location of the current location type to a certain location. New items will be added to this cache only a er they have been calculated for the first time.
The second cache is the grid cache. We divide the whole city map into grids and keep indexes of locations in the grids. The third cache is the distance cache, which is used when no results can be found in the nearest cache or the grid cache. To any specific location, this cache can keep nearby locations ordered by distance. Based on this design, the algorithm to implement method 'getNearestLocation' is listed in algorithm , and the method 'getLocationArrayMaxDistanceM' is listed in algorithm . In order to save memory for the million locations, we keep the indexes of locations as values in these three caches and encode the key into a 'Long' data type as the reference of a location.
Algorithm Get Nearest Location Input: Start location SL Output: Nearest location NL to SL : Calculate key k n of SL for nearest cache map M n ; : if M n contains key k n then : get location NL from M n ; : return NL; : else : Calculate key k g of SL for grid cache map M g ; : get all locations L g by retrieving k g from M g ; : if L g not empty then : min Distance D m = Double.MAX_VALUE; : for all L ∈ L g do : Calculate distance D L between L and SL; : : for all G ∈ G s do : Calculate key k g for each G; : get all locations L g in G from grid cache map M g ; : for all L ∈ L g do : Calculate distance D L between L and SL; : if D L <= D then : add L into location array L d ; : end if : end for : end for : end if : return L d ;

Transportation .
There are many papers on activity-based transportation simulation (see e.g. Raney & Nagel ; Nagel & Rickert ; Zhang et al. ). These papers mainly focus on the prediction of tra ic peaks and congestions. In our implementation of the artificial city Beijing, a microscopic public transportation system is simulated and integrated with the daily activities of the population with the aim to model the 'realistic' travel contacts.

.
The public transportation system is associated with the execution of travel activities, which are considered as a connection between two activities of agents in two di erent physical locations. An agent that has to commute by public transport between two locations to conduct its next activity, will execute a travel activity in the modeled transportation system. The transportation system will determine a route for the commuting agent and calculate the travel duration for the simulation. .
The public transportation component is microscopic as we modeled all lines and stops of the metro and the bus system in Beijing. No tram lines exist in Beijing's public transport system. We also exclude the rail train lines in this model as the trains lines in Beijing are only used as inter-city connections. During each simulation day, modeled buses and metro trains will execute their schedules on these routes based on timetables. The geographic information and routing data of the transportation infrastructure network were acquired from OpenStreetMap by using the Java library called Osmosis . It o ers stop information as nodes and route information as links that together form a graph. This graph shows the topology of the whole public transportation network in Beijing.

.
For commuting vehicles (private cars) on the road networks, the real road network was not modeled but estimated travel duration can be calculated according to the distance and historical statistical data on congestion.
. The metro stops and bus stops of the public transportation system are modeled as extensions of the general locations in Section . . In addition to the functions of a general location, a bus/metro stop can 'move' the waiting agent from the current stop to the arriving transporter (bus/metro train) if this transporter has enough space and is on the right route for the waiting agent in the stop. Moreover, in order to keep the agents 'simple' enough for large-scale simulation but 'heterogeneous' enough for public transportation, only the stops know and record transfer information of the waiting agents, and will pass the information to the transporter when the agents are on board. Then the transporter will 'move' the agent from the bus to a stop when it arrives at the right transfer or destination stop. .
Agents that transfer in/between stops cause realistic delays, while the transporter also takes a certain delay when arriving at a stop to 'move' agents out and accept new passengers. In order to be realistic, we also enabled the bus or metro train to operate through a timetable. This data driven method enables this public transportation component to simulate people's real travel behavior. .
To enable the modeled tra ic infrastructure components to o er routing information for commuting agents, a graph for routing was constructed using an open source Java library called jgrapht to connect the bus and metro stops. Every two stops of the same bus/metro line are linked and the edge of each link is assigned a travel duration. We also link stops that are not on the same route but within walkable distance, and assign an estimated duration by foot on this edge of the link. By default, this graph can o er a shortest (in travel duration) path to a potential public transport user. Since this graph will be called millions of times per simulated day in our model of Beijing, we added a cache in each node (stop) to store the next transfer stop information with its destination node as the key in the cache.

.
However, there is a big challenge for an agent to use this graph to get a travel route, which is to find the first stop to use as there could be more than one public transport stop close to the agent. An explicit solution is comparing all the nearby stops for every travel request. This could decrease the simulation performance drastically. We solved this challenge by creating 'GridZones' as nodes and adding them to the existing graph. We divided the map into grid cells, and the resolution of the grid can be set flexibly. We call the center of each cell 'GridZone'. Each 'GridZone' is a node and is linked to the graph by linking the 'GridZone' with all stops in this grid cell. The weight of each edge is assigned an estimated walking duration. When an agent plans to use public transport, the public transportation model will use the agent's current 'GridZone' as the start node to calculate the shortest path. The destination location is treated in a similar manner. The details are shown in Figure . .
Besides public transportation, an agent can also choose to commute by his or her own private car (taxis are not included in this research). An approximate duration of commuting by cars will be given by the transportation system for the execution of the simulation. .
When the location of an agent's next activity is within walkable distance, a travel activity 'walk' is conducted. Similar to taking a car, no actual road networks are modeled for walking agents in our model but a 'walk' location is created instead. This enables people to meet others by chance when walking, although the probability is rather small. In our model, there is a 'walk' location with a large area into which all walking agents will be put temporarily.

Agent .
Artificial city Beijing simulates . million agents and their daily behavior. Typical implementations of agents' behavior in artificial city research are activity-based, where all activities for the whole simulation are predefined in the input data source (Ge et al. ) or generated before the simulation run (Stroud & Valle ) which consumes a lot of memory. Assume there are around million agents and each agent has activities per day, then the total number of activities for a weeks simulation period is . billion. To reduce memory consumption, we designed an agent as activity pattern based. This design is based on Mossong et al. ( )'s research that human behavior patterns are remarkably similar among people in di erent countries and the patterns are highly correlated with age. .
Since the age of a person is highly related to the social role (Kite ), each agent was given a social role (infant, student, worker, elder, unemployed) in the dataset prepared in Section Section . . We distinguish between roles by giving agents di erent week patterns. For instance, a university student will be assigned one of the university student week patterns, and a worker will be assigned a worker pattern. To increase the heterogeneity and richness of these schedules, more than one week pattern are designed for each social role. A week pattern is made up of seven day patterns. For a typical worker week pattern, the first five days patterns can remain the same as weekday patterns, and the last two days can be the same as weekend patterns. In the week pattern for retired agents, the seven day patterns can be the same, for instance.

.
In this research, we designed around di erent day patterns for all social roles in the artificial city Beijing, which is based on other independent research conclusions. Ta et al. ( ) distinguished the working people in the suburb area of Beijing into types by recording the real GPS data and combining the di erence in activity (work, eat and shop) distance and commuting frequency. To summarize, they di erentiated between types of workers: ( ) people who work at home and seldom go out; ( ) people who work and do other activities nearby (within km); ( ) people who do activities in average distance of km to home; ( ) people who do activities in an average distance of km to home; ( ) people who do activities further than km. Based on this research, firstly we merged type ( )( ) and ( ), and then separate the resulting type into new types by the way of commuting to work, which are commuting by public transportation and by private vehicles. The people of the type of commuting by private vehicles were separated into another new types, which are those who need to carpool their children to school every school day and those who don't. For workers during weekend days, types of day patterns were designed according to the conclusions made by the research in Yue et al. ( ), which are: ( ) people who stay at home during weekend; ( ) people who do activities nearby (within km); ( ) people who do activities further than km by public transportation; ( ) people who do activities further than km by driving. .
For people who are retired, Ta et al. ( ) concluded that they behave mostly like Type ( ) and ( ) of workers. Thus, we designed day patterns for them. The first type prefers to stay at home and the other prefers to do activities outside but nearby. Besides, there is no di erence for retired people between weekdays and weekends in this research. For students, due to the scarce data, types of weekday patterns were designed for typical students according to the way they commute to school. For weekend days, types of day patterns were designed which are similar to workers. Since the commuting ways for students are highly correlated to the distance to schools in the initial dataset and the patterns of their parents (those who carpool their children to school or to other shopping and entertainment places), the proportion of assigning patterns to students were determined by the simulation model, both for weekdays and weekends. For babies, we assumed there is only one typical day pattern for them which is associated with their parents who work at home. Since this model is used to predict epidemics, a special day pattern for hospitalized people was designed as well. .
A list of all designed day patterns are presented in Table . An algorithm was implemented to pick the proper weekday patterns and weekend patterns to form a week pattern, and to assign the resulting week pattern to agents during the initialization phase of the simulation. .
To give a detailed impression of the designed typical day patterns, a weekday pattern example for workers who carpool their children to school in weekdays is presented in Table , and a day pattern example for workers who drive outside during weekends is presented in Table . . Every activity in any day pattern belongs to an activity type, and we categorized the activity types into three root categories in Figure , which are the regular activity, the travel activity and the social activity. Typical activities, such as sleeping, staying at home, working, shopping and attending school belong to the regular activity .
Much like the agent life cycle in a FIPA agent (Poslad ), an agent realized in this model has an implicit life cycle describing the agent states with the execution of activities (see Figure ). .
The di erence between the life cycle of FIPA agents and agents in this model is how states are transited. Each FIPA agent keeps the exact current state in its life cycle and needs a specific transition instruction for updating to the next state. To achieve this, every agent should maintain a list of future instructions which consumes a lot of memory. In our model, the current state of the agents is not clear as there are no explicitly defined states in the agents. Instead we keep a current activity index within the current day pattern of an agent. When executing   an activity, the activity itself or activity executor (if this activity is a travel activity or social activity) will specify a duration for this agent to schedule its next activity. During this period, the agent remains in an implicit state (e.g., suspended), which is shown in Figure . Based on this design, the day patterns and the week patterns are reusable for agents who have the same social role, which considerably reduces memory usage compared to the FIPA solution. Take the same assumption mentioned above, assume there are around million agents and each agent has activities per day, then we can design day patterns instead of the initial . billion activities for a week simulation period, which are only around activities in total. Moreover, the week pattern of an agent in our model can be changed as a result of the state of the system (e.g., a policy intervention) as the week pattern is treated as an index attribute for an agent, which increases the flexibility of the model.

Social networks .
There are three types of social networks modeled in this research, which are family, colleagues/ classmates and friendships. Family, colleagues and classmates relations can easily arise from defining a complete topology that clearly specifies all relation connections, which is shown in Figure . . Friendships, as the most complex social relation, are relatively di icult to define. The topology of friend connections changes over time due to the dynamics of friendship relations (Pujol & Flache ). This is even more complicated on a large scale (Gatti et al. ). Thus, egocentric friend networks are dynamically generated to represent friendship connections. In this research, friendships will be generated before planning and negotiating social activities based on an algorithm that we will present below. The candidates for the friends come from three kinds of sources: neighbors, classmates/colleagues and a random selection. When agent A is planning a social activity, the algorithm for generating friends can be described as follow: .
First, the number of friends N s is assigned to A which follows a power-law distribution (Hamill & Gilbert ). According to the fact that Dunbar's number (Hill & Dunbar ) ranges from to , the largest size of friends in this research is set to the lower boundary to reduce the computational complexity. The skewness is set to . , which is an example experiment setting in Hamill & Gilbert ( ).
. Second, the percentage of A's friends from di erent sources is calculated according to a combination of uniform distributions (see Table ) as the source composition of A's friends may di er from another agent. For example, agent A may like to make friends with neighbors while agent B may prefer making new friends randomly in places like shops or restaurants.
. Third, select one candidate randomly from the source and calculate the possibility that the candidate and agent A are friends. If the calculation result exceeds a predefined threshold (e.g., . as an initial setting), put the  Number of friends from random selection N r N s − N n − N c candidate in agent A's friends list. Otherwise, select a new candidate and repeat the calculation process till all A's friends are generated. If the new friends list is still not full, increase the threshold and repeat the calculation process again. The calculation process is based on a concept called 'social similarity', which is proposed in this paper. It calculates the similarity between two agents. The considered variables include age, social role (week pattern), family role and the number of friends. In this research, the 'social similarity' S(A, B) between two agents A and B is evaluated by a weighted Euclidean distance which is shown is in Equation , where a represents age, s represents social role (converted to an index), f represents family role, n represents the agent's friends size and µ represents the weights for di erent variables.

Architecture of the artificial city .
Models of locations, agents, social networks and the public transportation component constitute the main part of the artificial city. The system architecture of the artificial city can be summarized by a class diagram containing the major classes in our implementation which is shown in Figure   . Based on this architecture of the artificial city and our research interest in this paper, we built simulations to study how spatial contacts can be modeled and observed, which will be detailed in Section .

Spatial Contacts
. In Section we constructed an artificial city with a large population by combining diverse data sets, including generated data from census information, open map data, etc. With this model, spatial contacts emerge during the execution of the model. We will separate the spatial contacts into three di erent types and describe how each type of contact can be observed and measured in the following subsections. .
The simulation of the proposed artificial city is implemented using the DSOL package (Jacobs et al. ) which is a Java-based discrete event simulation architecture. We ran the simulation on a PC (Intel Core i -M CPU, . GB RAM) for a simulation period of days.

Regular contacts .
Regular contacts emerge when agents execute their daily regular activities in physical locations. For example, regular contacts can emerge among students who are in the same school location. When a student is executing a school activity, and another student is executing a school activity at the same location and the periods have overlap with each other, these two students are considered to have a regular contact in this model. More strictly, we divided a location into sub-locations. For example, classrooms are considered as the sub-locations in the school location. Hence, a student can only have regular contacts with other students when they are in the same classroom. .
In addition to the household for each agent, the school (in the form of ID) is initially predefined for every student, as well as the workplace for each worker. The other locations for activities like shopping and sports are dynamically chosen according to the nearest location algorithms described in Section . .

.
Through the execution of the simulation model, the number of people in several typical location types in a simulated weekday is shown in Figure , where the time of the day ( : -: ) goes on the x-axis. The 'others' item in the figure represents all the other location types according to Table . . From Figure , it can be found in this model that the largest part of the population during the day time in a simulated weekday are in their workplaces.
. As an example, the statistical results of the hourly number of people in the house location for ten replications are presented in Figure , where the % confidence interval is drawn in the sample point (each hour). .
Since all the population in this research are modeled into four social roles (baby, worker, student and retired), the hourly results of agents with di erent role in the house location as an example are presented in Figure  agents are the same between the weekday and weekend experiments, results are excluded for babies in Figure  . .
Due to the design of the activity in the pattern, the duration of staying in di erent types of locations varies among agents even when they use the same activity pattern. To verify this design, the average duration of agents staying in di erent locations in the weekday experiment is presented in Table . . From Table , we can find that the longest duration of stay occurs in households, followed by work or study places.
. ) and the duration in di erent places is simply categorized into home, out-of-home and travel, we recorded the duration for workers in di erent locations separately and made a comparison in Table , where the duration   [ . , . ] [ . , . ] [ . , . ] in travel is excluded.

It's not di icult to find the causal relationship between the designed day patterns in
. From Table , we can find that the relative error of the average duration of staying In-Home between TRA (equiv- alent to household in this research) and the experiment is relatively high ( . %), compared to the average duration of staying Out-of-home between TRA ( . %) and the experiment. This di erence can be caused by many factors, such as the season of the survey, the monotonicity of the surveyed neighborhood and the incompleteness of our designed activity pattern. As our interest in this research is in a new agent-based modeling method, we accept this error while more surveys on human behavior patterns in Beijing is required in future research. .
Due to the inclusion of public transportation, agents can have travel contacts which is considered as one novel contribution in this research. Thus, the patterns of agents' contacts during commuting are discussed in the following section.

Travel contacts .
Travel contacts emerge from the inclusion of the public transportation component in this model. We observed the information on the number of people in the public transportation infrastructure components, such as metro stops, metro trains, buses and bus stops during a working day. As an example, how the numbers of agents with di erent social roles in the bus location change in a weekday is shown in Figure . Through this transportation component, travel contacts emerge. In this research, stops or metro trains are divided into several sub-locations to represent platforms or train compartments, where agents can have travel contacts when they are in the same sub-locations at the same time. .
As we described before, the duration of a travel activity by bus/metro is decided by the simulation model, and is dependent on several factors, such as the travel distance, the path that the agent chooses (e.g. Dijkstra shortest path) and the waiting queue in the metro stops.   . Validation of a model with a wide range of parameters would be very di icult (Stocker et al. ). Thus, this simulation study shi s the focus to validation using several travel statistics. In order to validate the results in this public transportation component of the whole model, the average travel volume in a weekday by bus and by metro are compared to the historical tra ic statistics report in (Guo & Li ) in Table . The reason for adopting the tra ic statistics report in is to keep this research consistent as the generated population data is based on the census data of .
. Table , we can find that the relative errors between simulation results and the historical tra ic statistics are within %. Several factors are responsible for the di erences and one of the crucial di erences is that the data collected in the report (Guo & Li ) only covers part of Beijing city (within the th Ring Road). This di erence will increase the total relative errors to % as the daily travel volume within the th Ring Road only accounts for % of the whole travel volume in Beijing.

From the comparison in
.
Regarding the travel purpose, Table shows the comparison of the main purposes of using public transportation in a weekday. The relative errors are less than %.
. From Figure , it can be found that the rush hours for public traveling are from am to am and from pm to pm, which match the historical tra ic statistics (Guo & Li ). .
Besides travel volume and travel purpose, travel duration is used to make a comparison for validation as well. .
The relative errors between simulation results and the real data mainly come from the lack of certain activity patterns in the model, which results in the missing of a large amount of travel volume. For example, the model does not include patterns for business people and tourists who would use the public transportation multiple times in one day. These patterns were excluded in the model due to the lack of available data.
. As a conclusion, we listed the missing components in the artificial city model that can be easily improved when the associated data becomes available.
• More refined activity patterns, such as worker pattern in night shi , tourist pattern, business people pattern.
• More rules in agents' architecture when making decisions. For example, people in reality would consider the choice of routes based on the price of tickets before traveling while agents in this research only consider the shortest path.
• More accurate distribution of the starting time, duration and ending time of activities. For example, the departure time to workplaces for workers who are employed by universities should be earlier than those who work in restaurants in general. For now, the departure time for workers with di erent type of jobs follows the same distribution in this research.

Social contacts .
In this paper, social contacts are defined as the contacts among agents when executing joint social activities. The challenges for modeling these contacts are manifold. .
The first is that no friendship social network is predefined in the initial data. All friendship social networks should be generated before the execution process of friendship social activities based on the algorithms described in Section . . For example, part of the friendship relations of agents are generated among his/her neighbors and colleagues. The reasons to generate friendship social networks dynamically for the agents are twofold: first, it is too memory-consuming to store all friends lists for all million agents (up to friends for each agent); and secondly, the real human friendship social networks are dynamic and evolve over time.
To make this friendship relation generated by the stochastic method as stable as possible (most friends of an agent still remain the same over time), a reproducible random generator was designed using the agent id as the seed. Hence, every time when agents want to invite his/her friends to conduct a social activity in the simulation, the dynamically generated friendship relations will mostly remain the same although no static friends list are predefined, or need to be stored. The slight di erence comes from the sequence of selecting candidates for friendship calculation from friends sources, which is on a first come, first served basis. .
Another challenge is the consequences of the first challenge that the joint social activities are not pre-scheduled for all participants and only the organizer agent of the joint social activity foresees this activity in its schedule. Because there are no predefined friendship social networks, it is impossible to assign two consistent and semantically matched week patterns to two individual agents before the simulation starts while the two agents  Figure : Execution process of a social activity are modeled dynamically as friends during the simulation. This is solved through dynamically generating artificial 'Group Agents' to help execute the friendship social activities. When the originator/organizer agent tries to execute a social activity, a helping 'Group Agent' is dynamically generated to take over the task to execute the social activity. At first it will generate a social network, and then invite the members in the network to attend this joint social activity. A er a decision tree considering several rules and conditions (for example time and distance), each invitee can either decline or accept the invitation. A er collecting all the response, the 'Group Agent' will request all the participants to travel to the social location where agents can be late due to real travel delay which is caused by the transportation model. The major process of executing a social activity is presented in Figure . .
The detailed interaction procedure can be described as follows: . Before an agent starts to execute the current activity in the activity pattern, it will check the next activity to see if it is a joint social activity. If yes, check if the conditions are met for organizing it. Then a proposal of the joint social activity will be sent to all involved social networks members. It is worth noting that the friendship relations in social networks will only be generated in this step and the agent will only schedule a social activity within its current pattern.
. Calculate the attendance possibility a er receiving a social activity proposal for every agent I i according to Equation , where N is the total number of agents involved in the planned social activity, I o is the organizer of this activity, S(I i , I j ) calculates the link weight between the two agents based on a concept 'social similarity', which calculates the 'social similarity' between the two agents. The considered variables include age a, social role s, family role f and the number of friends n. In this research, the 'social similarity' is calculated as a weighted Euclidean distance , where µ represents the weight for di erent variables. By setting the weight coe icient {µ a , µ s , µ f , µ n }, the calculation result S(I i , I j ) will be constrained between and . means they are fully connected while means no relations. A(d, E) calculates the interest degree of the activity to the agent, where d is the distance between agent's current location and proposed activity location, E gives out the degree that the agent is interested in the activity and σ is a corrective . For each agent, compare the attendance possibility with its own attendance threshold t. If it is negative, send a decline response to the activity organizer and continue its own schedule. Otherwise, start the second stage process for decision-making based on a decision tree (see Figure ).
. Two kinds of decisions can be made by the agents a er the decision-tree based process, which are accept and decline. The decisions will be responded to the organizer immediately, and the organizer will make a decision on continuing the activity a er collecting all responses.
. Social activity organizers will only negotiate with other members for one time, which is necessary to avoid deadlocks.
. When the final decision is made, the agents who are willing to join in the coming social activity will authorize a dynamically generated Functional Entity, 'Group Agent', to take the responsibility for state updating and moving agents back to their original schedule when the social activity is finished.
. For social contacts among family members and colleagues, the execution process of their joint social activities is almost the same as the process in Figure . However, the di erence with the friendship social contacts is that the social networks for family members and colleagues are pre-defined in the initial data. .
To evaluate the emerged social contacts, we constructed a model. The parameters in this experiment are initialized using the data from Table . Since the four factors (age, social role, family role and the number of friends) are considered to be equally weighted to generate a friendship link, the corresponding weight coe icients (µ a ,µ s , µ f , µ n ) are calculated according to boundary conditions, which is to enable the resultS(I i , I j ) to be constrained between and . means they are fully connected while means that they have no relations. The other parameters are initialized as one possible experimental setting and the sensitivity of them will not be discussed in this paper.
. Based on this initial setting, agents' friends can be generated when 'FriendsSynchronizedActivity' is scheduled during a simulation run. The number of agents' friends is assigned to agents by the algorithm in Section . which follows a power-law distribution (Hamill & Gilbert ). The average number of resulted friends is around , which is not well validated due to the missing of actual data in Beijing. .
Together with the family and the classmates/colleagues network, agents' social networks are formed. However, agents will only generate their social networks when they need execute social activities.
. Agents, who receive invitations from their friends for attending social activities which are unscheduled in their activity patterns, can make interactions with the organizing agents in order to make a final decision.
. Table shows

Decisions Equation based Process Decision Tree based Process
Accept . . Decline . .

.
From Table , it can be found that % of agents decide to decline the invitation a er the decision tree process.
. Table , Table shows the average distribution of agents' decisions on a new colleague/classmate social activity. The biggest di erence between the figures is that more agents are willing to participate in a colleague/classmate social activity than in a family social activity. This is because colleague/classmate social activities are o en scheduled during the time when there are no conflicts in the agents' schedules.

Similar to
. Table shows the average distribution of agents' decisions on a new social activity a er executing the planning processes. Compared with the other two figures, the unusual aspect of the figure is that fewer agents accept the new proposal. This demonstrates that the composition of members in a friendship network can be  This indicates that people are more willing to plan activities with their families. For colleagues or students, there are two peaks in the morning and the a ernoon. This is caused by the day patterns where working people have to attend meetings in the morning and a ernoon and students attend joint sports activities in the a ernoon. For the friend activity, it seems that most friends will only meet in the evening, to have dinner, go shopping or go to cinemas together. This phenomenon can be verified by the fact that Chinese people are more willing to have joint dinner as a social interaction (HorizonKey ). However, these results can't be well validated in this paper as no independent data exists at this moment.

Modeling Disease Spread for Validation
. To systematically validate the resulted spatial contact network, we implemented a pandemic influenza disease progression model on this artificial city.

Disease model .
Pandemic influenza is modeled to be contagious in the resulted spatial contact networks. The phase transitions are modeled according to the research in Stroud & Valle ( ). In addition to their disease transition model, a phase called 'Vaccinated' was added in this research, which can be used for policy modeling. The phase transitions and details about the transition time and probability are presented in Figure . .
An infected agent is contagious as of the phase of 'Asymptomatic_ Contagious_Early_Stage' until the phase of 'Convalescent' or the end phases of 'Dead' or 'IMMUNITY'. However, the contagious probability varies for di erent transition phases. The basic contagious rates in the phases are defined in Table . .
Besides the basic contagious rates, the probability to infect a susceptible person is also highly related to factors such as the space of the sub-location, the number of infected persons in the same sub-location and the contact  duration. Because of this, we added more parameters in the disease progression model. The final contagious rate for a susceptible person i in a sub-location L containing N infected persons can be calculated through Equation , where R j can be found in Table , β is a corrective coe icient for the basic contagious probability, σ L is a corrective coe icient for the sub-location, S L is the space of the sub-location (in square meters) and t_ij is the contact duration between person i and j.
In this research, the corrective coe icients β and σ L in Equation are both set to . . This simplification is determined as one possible experimental setting and the sensitivity of this set will not be discussed in this paper.
. With this disease model implemented in the artificial city, a simulation was conducted to test the e ect of the modeled spatial contacts have on disease outbreak. The initial condition for the disease model was that in million people in the population was in the 'Suspect' phase.

Model validation .
In the disease spread model, the potential phases are categorized into two types: end phases and transitional  phases. Firstly, we present the number of agents in the end phases of 'IMMUNITY' in Figure . .
One example of the transitional phases is 'Hospitalized', which is presented in Figure . .    .

As
From Figure , we can find that household (home) is the most possible location type for disease spread among the full population, followed by workplaces, schools and transportation. This result is also consistent with the conclusion in Stroud & Valle ( ). To give a detailed view, the distribution of infected agents with di erent social roles in di erent location categories is presented in Figure . .
We also presented the distribution of infection sources for di erent social roles in Figure . . From Figure , we can find that the biggest part of the infections for a given social role are from the same social role type except for babies. It can be explained by the fact that students, workers and retired people stay with each other in most of their day time while babies always stay with their parents. Especially, workers get a higher infection possibility from their companions than the other social roles. It is caused by both the facts that workers are the biggest part of the population and workers are in more closed spaces during the day. Distribution of source of infection for Student ) reported that about % of the population will be symptomatic or convalescing at the pandemic peak a er days of stable and exponential growth, while we get a result of around % of the population that will be convalescing a er days. Furthermore, there are are also di erence in the concrete numerical values in terms of the the breakout of cumulative infections by location type and the clinical attack rates by age groups. Since we believe these di erences are correlated to the artificial city model, and the underlying population data (China vs USA) are di erent, these indicators will not be validated in this research.

.
In reality, there was an H N outbreak in Beijing in which lasted more than six months. However, the historical data, including the peak number of 'infection' and the time of the peak, will not be used for validation in this research due to many factors. First of all, the peak number of reported 'infection' was based on confirmed cases. These cases do not distinguish the disease phases and do not contain detailed personal information. Secondly, the reported peak time (day) lasted a rather long period as a series of interventions were conducted by di erent authorities in di erent part of the city among di erent social roles from the first case in May to the peak time on October , . Therefore, the simulated results of the extreme situation in this research cannot be validated. Instead, an expert validation process is required as part of future research.

Comparison with Related Works
Model with same disease transition model . Since this research shares the same disease (H N ) model with the research EpiSimS by Stroud & Valle ( ) for prediction in southern California, it is meaningful to compare the model by EpiSimS with the model in this research. .
Besides the di erences in data and parameters such as the basic contagious rate (R ) and the data source, the major di erences are reflected in the choices in the design and implementation phases.
• Although sublocations are modeled in EpiSimS, the activity locations are organized in a more hierarchical way in this research.
• Weekdays and weekend days are averaged to get a representative day in EpiSimS while they are separately modeled in this research.
• EpiSimS does not capture disease transmission during travel while this research includes a public transportation component for commuting.
• Agents' behavior are based on fixed schedules in EpiSimS while both activity pattern can be replaced and specified activities (e.g., social activity) in the pattern can be rescheduled in this research.

Model with same data source .
Although the population and environmental data originates from Ge et al. ( )'s research, this research is independent and the way to design and implement the artificial city model and epidemic prediction model is di erent. To show how the research in this paper is unique and innovative, we made a comparison between this research and the KD-ACP framework (Chen et al. ) which was used to implement an epidemic model based on the same data.
• Agents implemented by KD-ACP behave according to fixed activity schedules in terms of the activity sequence, the activity locations (fixed choices) and duration. That is, agents in KD-ACP do not have decisionmaking capabilities. This paper models agents in a di erent way by which agents own multi-level decisionmaking capabilities while still staying "simple" and "small" enough for computational e iciency.
• Social networks in KD-ACP are predefined in the initial data, thus, no unscheduled joint social activities can be executed in the simulation. This paper generates social networks for agents dynamically by which agents can have complex social interactions in order to join in unscheduled joint social activities.
• Subway networks are modeled to represent the whole public transportation in KD-ACP. A lot of e orts are required to complete the public transportation networks. However, this paper archives this task easily by connecting tra ic objects (buses and metro trains) with travel activities.
• The disease model are considered to be validated in KD-ACP in two indicators, the infection trend and the basic reproduction number. This paper verifies and validates the model in both people's daily behavior and infection details, which include more model details. ) whose research trend is on studying some of the interventions for specific interests, the contribution of this paper is mainly the proposal of a way to model relatively complete spatial contacts among a large-scale population, by which policy makers can test multiple interventions for controlling disease spread using one epidemic model. The novelty of this model consists of the following aspects: • A microscopic public transport system (subways and buses) together with a predicted road tra ic system are simulated in an artificial city and are well integrated with the daily activities of the population.
• Social networks can be dynamically generated to execute joint social activities.
• The model is scalable ( million agents) and can still be simulated on a PC.

Conclusions
. This paper designed algorithms and approaches to model complete spatial contacts for epidemic prediction in the context of a large-scale artificial city. Firstly, by combining diverse data sets, including generated censusbased data, open source maps, activity patterns, an artificial city with a large population was constructed. In this artificial city, each of the million physical locations and . million citizens were modeled. All of these individuals can carry out regular activities, travel around, and join non-predefined social activities by executing their daily activities according to a pattern. With this model, spatial contact networks emerge and can be observed during the execution of the model.

.
Among these e orts, the activity pattern based design of agents can be considered as the foundation for modeling complete spatial contacts for epidemic predictions. With this design, the memory usage for keeping the necessary information for millions of agents can be constrained to an acceptable level while agents can still show diverse behavior in terms of activity locations, activity durations, travel routes and decisions for non-predefined social activities, even when agents have the same activity pattern. Through the execution of di erent types of activities in the agent patterns, the spatial contact networks emerge.
. Secondly, to investigate the e ect of the emerging spatial contact network for epidemic prediction, a pandemic influenza disease progression model was implemented. The results are consistent with other independent research. We believe this research and the constructed model could be an e ective starting point as the model in this research can observe relatively complete spatial contacts for the first time. .
Since this research can also be considered as a proof of concept which exemplifies how complete spatial contact networks in a large-scale city with complex social networks can be modeled using an agent-based method, it also indicates potential use in areas such as public transportation systems and city level evacuation planning.
. As for future research, two more e orts are required to refine the model. The first is more actual data, such as adding more optional week/day patterns by surveying statistics of people's actual activity patterns, and more surveying on distribution of people's friends. The other is to improve the simulation performance by distributing this model, as it still takes approximately hours to run one replication.