Computer Science- Python razzy Transport assignment,arcgis pro and python

Inside the folder, two websites are assessment examples of 2021 and 2020, and three articles are assigned research papers (could be references), and the PTUA summative assessment file is the requirement.

Sinclair et al. – 2023 – Assessing the socio-demographic representativeness

Applied Geography 158 (2023) 102997

Available online 13 July 2023
0143-6228/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Assessing the socio-demographic representativeness of mobile phone
application data

Michael Sinclair a,*, Saeed Maadi a, Qunshan Zhao a, Jinhyun Hong b, Andrea Ghermandi c,
Nick Bailey a

a Urban Big Data Centre, University of Glasgow, Glasgow, UK
b Department of Smart Cities, University of Seoul, South Korea
c Department of Natural Resources and Environmental Management, University of Haifa, Israel

A R T I C L E I N F O

Handling Editor: Y.D. Wei

Keywords:
Mobile phone data
Socio-demographic representativeness
Tamoco
Huq

A B S T R A C T

Emerging forms of mobile phone data generated from the use of mobile phone applications have the potential to
advance scientific research across a range of disciplines. However, there are risks regarding uncertainties in the
socio-demographic representativeness of these data, which may introduce bias and mislead policy recommen-
dations. This paper addresses the issue directly by developing a novel approach to assessing socio-demographic
representativeness, demonstrating this with two large independent mobile phone application datasets, Huq and
Tamoco, each with three years data for a large and diverse city-region (Glasgow, Scotland) home to over 1.8
million people. We advance methods for detecting home location by including high-resolution land use data in
the process and test representativeness across multiple dimensions. Our findings offer greater confidence in using
mobile phone app data for research and planning. Both datasets show good representativeness compared to the
known population distribution. Indeed, they achieve better population coverage than the ‘gold standard’ random
sample survey which is the alternative source of data on population mobility in this region. More importantly,
our approach provides an improved benchmark for assessing the quality of similar data sources in the future.

1. Introduction

New forms of mobile phone (MP) data from the use of applications,
or ‘apps’, offer enormous potential as an alternative or complement to
traditional survey data sources to enhance our understanding of human
activity and mobility (Huang et al., 2022; Kang et al., 2020). The huge
volumes of data available from these novel sources as well as the spatial
and temporal details they provide, create unprecedented opportunities
across a wide range of disciplines to advance scientific research. How-
ever, there are critical unanswered questions concerning the
socio-demographic representativeness of these new forms of MP data.
This creates a risk that underlying bias could produce unreliable results
which are then used as the basis for policy (Grantz et al., 2020).
Furthermore, the limited analysis of the issue of socio-demographic
representativeness restricts the progress of applied research seeking to
utilise these novel and emerging form of MP app data as an alternative or
complement to more traditional data sources.

Traditionally, scientific research has utilized MP data from call detail

records, which track phone locations during potentially billable events
(Calabrese et al., 2013; Grantz et al., 2020; Pappalardo et al., 2021; Ren
& Guan, 2022; Vanhoof et al., 2018b; Wang et al., 2020; Yabe et al.,
2022). More recently, a new form of location data from the use of
GPS-enabled smartphone applications has emerged, which also offer
large data volumes but with much higher spatial accuracy (Berke et al.,
2022; Grantz et al., 2020; Huang et al., 2022; Wang et al., 2020; Yabe
et al., 2020). This mobile phone application (MPA) data, generated and
collected from the use of a wide range of apps, provides point location
information which supports more detailed analysis and opens up the
range of possible analytical applications (Cameron et al., 2020; Heo
et al., 2020; Mears et al., 2021; Sinclair et al., 2021; Yabe et al., 2020).
So far, these have included disaster and pandemic response (Huang
et al., 2022; Kishore et al., 2022; Yabe et al., 2020), nature-based rec-
reation (Mears et al., 2021; Sinclair et al., 2021) and analyses of human
mobility (Calafiore et al., 2021; Gao et al., 2020; Kang et al., 2020). The
applications of these novel data are in their infancy and their potential
spans a wide range of disciplines.

* Corresponding author.
E-mail address: michael.sinclair@glasgow.ac.uk (M. Sinclair).

Contents lists available at ScienceDirect

Applied Geography

journal homepage: www.elsevier.com/locate/apgeog

https://doi.org/10.1016/j.apgeog.2023.102997
Received 17 February 2023; Received in revised form 20 April 2023; Accepted 11 May 2023

mailto:michael.sinclair@glasgow.ac.uk

www.sciencedirect.com/science/journal/01436228

https://www.elsevier.com/locate/apgeog

https://doi.org/10.1016/j.apgeog.2023.102997

http://creativecommons.org/licenses/by/4.0/

Applied Geography 158 (2023) 102997

New spatial data forms such as MPA data are frequently contrasted
with traditional survey data as an alternative source for applied research
(Mayer-Schönberger and Cukier, 2013; Savage & Burrows, 2007).
Although household surveys are considered the ‘gold standard’ for
research due to generalizable random samples, they face declining
response rates (Brick & Williams, 2013; Meyer et al., 2015), interviewer
effects, recall error and normative bias (Marsh, 1982). Surveys are un-
suitable for rapid response situations like the Covid-19 pandemic, and
their low sample sizes limit fine-grained spatial/temporal detail. Addi-
tionally, some highly marginalized groups such as the homeless or those
in temporary forms of accommodation may be excluded. In this context,
MPA data may offer advantages as a complement or alternative to
traditional data (boyd & Crawford, 2012). MPA data provide
spatial-temporal detail, often in (near) real-time or with low lag. While
consent is still required, data collection is less burdensome, potentially
reducing non-response bias and including previously excluded groups.
Despite this potential, there are uncertainties of these novel data. A
major one concerns data quality since so much of the data capture and
processing is unavailable to researchers due to commercial concerns.
This raises worries that the data may under-represent marginalized
groups, particularly the ‘digitally excluded’ (boyd & Crawford, 2012).

There are critical questions, therefore, about the quality of MPA data
which need to be addressed before more widespread use is justified in
research (Grantz et al., 2020). Key among them is the question of how
accurately MPA data represents the population of interest, given they
are not the result of a carefully-planned sampling strategy (Ranjan et al.,
2012; Zhao et al., 2016) and are constructed from the use a wide range of
different applications. In particular, the concern is that inequalities in
access to and use of mobile phones may be reflected in these data. The
resulting research may skew attention and possibly resources towards
already advantaged social groups (Grantz et al., 2020) or fail to
adequately include groups such as the elderly population (Guo et al.,
2019; Lee et al., 2021). Though the question of bias is common to all
forms of MP data, it is perhaps especially relevant to MPA data where
datasets are assembled by commercial intermediaries. These in-
termediaries gather data across a wide and diverse set of apps with the
aim of achieving scale and broad representativeness, but these are not
transparent with few metrics provided to evidence the latter and there
are currently no standards by which they can be evaluated. Despite its
importance, few studies using MPA data explore the topic of
socio-demographic representativeness directly (Huang et al., 2020,
2022).

Assessing representativeness is challenging due to the steps taken by
MP data providers to protect user privacy. MP data are often provided to
researchers as aggregated totals, making it impossible to identify the
characteristics of individuals at all. Where data are provided at the in-
dividual level, such as with MPA data, information is rarely if ever
provided on a user’s personal characteristics, so representativeness
cannot be examined directly (Grantz et al., 2020). Researchers therefore
often use techniques to infer the user’s home location based on location
histories. These home locations allow the geographic distribution of the
sample of MP users to be compared to ‘ground truth’ sources such as
official population statistics (Berke et al., 2022; Calabrese et al., 2013;
Huang et al., 2022; Mao et al., 2015; Phithakkitnukoon et al., 2012;
Wang et al., 2019; Çolak et al., 2015). This process provides a very useful
measure of differential geographic coverage (Yabe et al., 2020) as well
as variations by the socio-demographic status of different areas (Ber-
nabeu-Bautista et al., 2021; Huang et al., 2020, 2020, 2022, 2020).
Enriching the data in this way also greatly increases the potential impact
of research.

Different approaches have been adopted to estimate or infer home
locations from MP and other locational data based on the volume of
content generated by a user in space and/or time (Calafiore et al., 2021;
Pappalardo et al., 2021; Sinclair et al., 2020). Since it is rarely possible
to validate home detection algorithms against known home locations for
users (Pappalardo et al., 2021), techniques are designed with the aim of

reducing potential error. To assign a home location at a country or city
level, daily activity counts are generally sufficient (Bojic et al., 2015;
Sinclair et al., 2020). However, to infer socio-demographic information
for users requires predicting home locations for much smaller geogra-
phies. Including the full range of an individual’s daily activities towards
this end could lead to an increase in false predictions as users might
record volumes of data around places designated for work or socialising
(Pappalardo et al., 2021; Vanhoof et al., 2018a). This is especially true
for MPA data where datasets represent a wide range of activities, due to
the mix of apps involved. The main approach to overcome this is to
utilise activity heuristics, by including a time element in the algorithm.
Restricting the analysis to night-time data, based on the assumption that
people are more often at home during this period (Berke et al., 2022;
Bojic et al., 2015; Calabrese et al., 2013; Calafiore et al., 2021; Phi-
thakkitnukoon et al., 2012; Sinclair et al., 2020; Vanhoof et al., 2018;
Çolak et al., 2015), has been shown to improve results (Pappalardo et al.,
2021).

There are two main limitations with current approaches to assessing
representativeness of these novel data. The first is that, even with ac-
tivity heuristics, problems remain in inferring home locations as many
people spend periods of the night at sites of work, leisure, or transit. This
is especially true for MPA data where the data represent various activ-
ities based on a diversity of apps. The second is that, once home loca-
tions have been inferred, researchers rarely explore representativeness
in a systematic or comprehensive way. In this paper, we address both
issues and hence provide a more appropriate standard for assessing
representativeness of MPA data. First, with home locations, we propose
a novel approach which incorporates high-resolution land use data into
the process. By relying only on data captured within buildings which
have a designated residential use, we greatly reduce the chance of
identifying night-time work, leisure, or transit locations as home loca-
tions. Second, we use these potentially improved home location esti-
mates to examine representativeness using multiple independent
dimensions. These cover the geographical distribution but also socio-
economic and socio-demographic status.

To illustrate our approach, we apply it to an assessment of repre-
sentativeness for two extensive and independent sources of MPA data.
Each contains data from a diverse portfolio of apps covering a wide time
period (three years) for a large and socio-demographically diverse city-
region (Glasgow). First, we apply our home detection approach which
incorporates high-resolution residential land use data into the process.
Second, we compare the distribution of the resulting samples of MPA
users from both data sources to the known population distribution
across three years (2019–2021). Comparisons are made by geographic
location and against two different measures of area socio-demographic
status. One is an official index of area deprivation in Scotland, the
Scottish Index of Multiple Deprivation (SIMD). The other is a commer-
cial socio-demographic classification, CACI’s Acorn consumer classifi-
cation (CACI), which segments areas by analysing a wide range of data
on demographics and consumer behaviour. Third, we compare the re-
sults on representativeness found using our novel home detection
approach against those found using a more conventional approach,
which does not utilise residential land use data, to illustrate the impact
of this innovation. Finally, we compare the distribution of our sample of
MP users to the distribution of the sample of households captured by a
traditional survey which is widely used in the study area for mobility
analysis and transport planning, the Scottish Household Survey (SHS).
Such traditional forms of data are frequently held up as the ‘gold stan-
dard’ against which new forms of data are compared since they are built
round a structured random sample. Comparison against such a sample
provides arguably a fairer test of representativeness.

M. Sinclair et al.

Applied Geography 158 (2023) 102997

2. Data and methods

2.1. Study area

The study area is Glasgow city-region, comprising the core city (the
largest in Scotland) and seven surrounding councils (Fig. 1). Glasgow is
home to over 600,000 people, while the wider city-region houses over
1.8 million people. The city-region covers areas or neighbourhoods with
a wide range of socio-demographic circumstances, which makes it
particularly suitable to test for inequalities in sample coverage by socio-
economic status. Fig. 1 also shows the eight council boundaries used for
reporting results as well as the built-up residential areas within each.

2.2. Mobile phone application datasets

The core data for this research are MPA datasets from two private
companies, Huq and Tamoco1. Both are examples of smartphone GPS
location data (Yabe et al., 2022) which are timestamped point data
generated using MP apps on GPS-enabled smartphones (Table 1). This
type of big data generally offers a higher spatial precision than tradi-
tional sources of MP data such as call detail records which are often
limited to cell tower regions. The data used in this study are confined to
the extent of the study area (Fig. 1) and consist of hundreds of millions of
data points per year. Wider, Huq currently offers data across the UK and
Tamoco across the UK and the United States of America.

The construction and structure of the datasets are similar across both
providers. Each contains data from a range of partner apps on an
informed consent basis, with data limited to users aged 16+. Data is
collected when an app records the time and location of a device based on
the most accurate location sensor available at the time, including GPS,
Bluetooth, cellular tower, Wi-Fi or a combination of sources (Wang &
Chen, 2018). Due to a lack of transparency from the commercial pro-
viders, the specific applications included in the datasets are unknown to
researchers. However, data are pooled from a wide and diverse set of
apps with the aim of achieving scale and broad representativeness. In
one of the years, for example, one provider was collecting data from over
200 unique apps. The data represent timestamped point locations with a
certain degree of error. Each MP device has the personal identifiers
replaced with non-reversible hashed identifiers. This means that data
points from the same user can be linked over time. With Huq, the points
from individual users can be linked over the whole period while Tamoco
resets its hashed identifiers every month. Data volumes are vast and
fluctuate year-to-year (Table 1), reflecting in part the changes in the
apps with whom the intermediaries have contracts. The challenge of
assessing representativeness will therefore always be an on-going
exercise.

2.3. Other secondary data sources used in the analysis

Different levels of geographic boundaries are used in the analysis, all
of which are represented visually in Appendix 1. The highest level used
is Council (n = 8) which is also visualised in Fig. 1. The next level is the
Intermediate Zone (n = 417) which nest within councils. We also use
Datazones, which nest within Intermediate Zones, and are the key ge-
ography for small area statistics in Scotland. These are the spatial unit
used in this paper for home location detection. Datazones are also used
to assign the Scottish Index of Multiple Deprivation measure of socio-
demographic status to mobile devices (see below). Datazones are
designed to have a population of 500–1000 and there are 2336 in the
study area. The finest spatial boundary used is the unit postcode (n =

44,829) which nest within Datazones. These boundaries are used to
assign the second measure of socio-demographic status to users, the
CACI Acorn Consumer Classification (see below).

In comparing the socio-demographic representativeness of MP users
to the population, we use two sources, one public and one private. The
Scottish Index of Multiple Deprivation (SIMD, https://simd.scot/) as-
signs a relative measure of area deprivation to Datazones across Scot-
land. The SIMD combines measures of deprivation across multiple
domains (income, employment, education, health, crime, housing and
access to services) into a scaleless relative ranking. SIMD 2020 is used in
this research. Our analysis assigns MP users an SIMD quintile and
percentile, using national rankings, based on the Datazone where they
are estimated to live. As Glasgow city-region is a relatively deprived
area, there is an over-representation in more deprived quintiles (1 and
2). See the supplementary material for population breakdown by SIMD
groups. The CACI Acorn Consumer Classification (https://www.caci.co.
uk/) is a private socio-demographic data source which segments the UK
population by analysing a wide range of data on demographics and
consumer behaviour. CACI segments unit postcodes into 6 categories, 18
groups and 62 types. The subdivisions are nested, with the 6 categories
broken into between 2 and 4 groups2, and the 18 groups broken into
between 3 and 6 types. Our analysis assigns mobile phone users with a
category, group and type based on the postcode where they are esti-
mated to live. In this study we use 2020 CACI data. See the supple-
mentary material for population breakdown by CACI groups.

As a key step in the home location detection process, which is
explained in the next section, we make use of high-resolution land use
data from Geomni’s UKBuildings layer. This dataset is a multi-polygon
spatial dataset representing the footprint of all buildings in the UK,
including residential buildings (see Fig. 1). Each building is assigned a
usage, classified into various types. For this study, we use all buildings
with a residential or mixed-residential use. Data from 2020 is used in
this study.

In the final section of the results, we compare the MPA samples to a
traditional survey dataset widely used in social research across Scotland,
the Scottish Household Survey (SHS, http://www.scottishhouseholdsu
rvey.com/). The SHS is an annual survey of over 10,000 households,
used as the basis of a range of official statistics. The SHS has a repeat
cross-sectional design with a sample for the Glasgow city-region of N =
3495 in 2019 (the most recent available).3 For Glasgow City, less than
half the eligible adults completed a travel diary. Younger adults were
significantly under-represented while those 65+ were over-represented.
Some population groups are excluded by the sample design including
households living on military bases, in communal establishments, in
mobile homes or sites for traveling people, or homeless (Scottish Gov-
ernment, 2020).

2.4. Home location detection techniques

To compare the distribution of each MPA sample to the population, it
is necessary to estimate the home location for each MP user in the
dataset and this is typically done based on night-time locations, as dis-
cussed in the Introduction. The specific period which constitutes night-
time varies between studies but a window beginning between 19.00 and

1 Information for Huq available at: https://www.ubdc.ac.uk/data-services/
data-catalogue/transport-and-mobility-data/huq-data/; and Tamoco available
at: https://www.ubdc.ac.uk/data-services/data-catalogue/transport-and-mobi
lity-data/tamoco-data/.

2 This is with the exception of the category/group ‘Not Private Households’
which is not disaggregated between the levels of category and group (and
related to areas which generally do not have a residential population).

3 Households are selected using a random sample stratified by council which
over-represents smaller councils to ensure each achieves a minimum sample
size. A travel survey portion is completed by one randomly-selected adult in
each household (Scottish Government, 2020) and is by definition therefore
skewed towards adults from smaller households. For 2019, the response rate for
households nationally was 63%, with random adults completing the travel
survey in 92% of cases but this varied across the country.

M. Sinclair et al.

https://simd.scot/

At CACI, our talented people understand how the power of data plus technology can enable our clients’ success.

http://www.scottishhouseholdsurvey.com/

https://www.ubdc.ac.uk/data-services/data-catalogue/transport-and-mobility-data/huq-data/

https://www.ubdc.ac.uk/data-services/data-catalogue/transport-and-mobility-data/tamoco-data/

Applied Geography 158 (2023) 102997

22.00 and ending between 05.00 and 09.00 is common (Pappalardo
et al., 2021; Vanhoof et al., 2018). It is rarely possible to verify the ac-
curacy of home location estimates with ‘ground truth’ data. One study
which achieved this found that using night-time data was more accurate
than taking data which covered the whole day (Pappalardo et al., 2021).
Accordingly, this is the approach we build on here using the night time
period of 20.00 to 06.00. Box 1 explains in more details how we estimate
home locations using our approach (Method 1) and a more conventional
approach (Method 2).

2.5. Comparing the representativeness of mobile phone application data

The results from section 2.4 allow us to allocate each unique MP user
to a Council, Intermediate Zone, Datazone, and unit postcode. Using
these we can assign MP users to a SIMD deprivation quintile and
percentile (from Datazone), as well as a CACI Acorn category, group,
and type (from postcode). We assess representativeness in three ways.
For the geographic distribution, we focus initially on the eight council

areas but later report the distribution across the 417 Intermediate Zones.
For variations by deprivation status, we initially examine the distribu-
tion across the five quintiles of the SIMD index but later present results
at the percentile level. Lastly, for variations by socio-demographic sta-
tus, we initially examine the distribution across CACI’s six broadest
categories but later use the 18 groups and the 62 types. Following these
comparison to the population, we make a further comparison with the
sample of travel diary respondents in the SHS. We make comparisons
across the eight councils using the measure of SIMD deprivation quin-
tile, the latter being the finest spatial disaggregation available on the
publicly-available SHS files.

2.6. Transparency and reproducibility

The MP datasets used in this analysis can be accessed for research
purposes by application to the Urban Big Data Centre, an Economic and
Social Research Council funded research centre and national data ser-
vice based at the University of Glasgow. Datazone and higher
geographic boundaries are available under Open Government license
(http://spatialdata.gov.scot/). Postcode boundaries are freely available
from the Scottish Postcode Directory (National Records of Scotland, n.
d.) under the ‘Public Sector Geospatial Agreement’ which covers
non-commercial use of the data. SIMD data is available under Open
Government licence. CACI data are accessed here under a licence agreed
with CACI for this particular study. Geomni’s UKBuildings layer (Digital
Map Data © The GeoInformation Group Limited (2022), created and
maintained by Geomni, a Verisk company) is accessed under a general
academic license via Digimap (https://digimap.edina.ac.uk/). SHS data
are accessed through the UK Data Service under their standard End User
Licence (Scottish Government & Ipsos MORI, 2021). All analysis is
completed using a combination of PostgreSQL and R programming
language (R Core Team, 2022). The code to process the data and esti-
mate home location is openly available on GitHub (https://github.co
m/sinclairmichael/appliedgeography_representativeness.git).

Fig. 1. Glasgow City-region, council areas and built-up areas
Residential and mixed residential buildings are from Geomni’s UKBuildings layer which is created and maintained by Geomni, a Verisk company (see section on
data sources).

Table 1
Summary of mobile phone application data collections in the study area.

Provider Measure 2019 2020 2021

Huq Unique users 19,399 29,741 25,233
Datapoints (millions) 21.9 161.8 346.8
Mean datapoints per user 1129 5440 13,744

Tamoco Unique users 81,203 85,258 81,136
Datapoints (millions) 442.5 808.1 471.8
Mean datapoints per user 5449 9478 5814

Notes: Unique users are based on the number of unique hashed identifiers active
in a given year. The number of Tamoco users is based on the monthly average for
each year since identifiers are reset monthly.

M. Sinclair et al.

http://spatialdata.gov.scot/

https://digimap.edina.ac.uk/

https://github.com/sinclairmichael/appliedgeography_representativeness.git

Applied Geography 158 (2023) 102997

3. Results

3.1. Estimated home locations of mobile phone application data

Results on home location detection for both methods are presented in
Table 2. Applying the more restrictive home detection approach which
incorporates residential land use data (Method 1), we allocate between
20% and 37% of users with a home location across the years. These users
are responsible for generating over 75% of the annual data in all but one
case (Tamoco in 2019). The more basic and less restrictive home
detection approach (Method 2) leads to a large number of home location
estimates of between 37% and 51%. These users are responsible for
generating between 88% and 98% of the data in a given year. In both
cases, the algorithms tend to cut out the long ‘tail’ of users who generate
relatively few datapoints. These users may be relatively inactive or make
infrequent use of the relevant apps, or they may be occasional visitors
from outside the region. Overall, a greater portion of Huq users are
assigned a home location than Tamoco. This may be due to the persistent
hashed identifier for Huq which provides a picture of movements for
each user over a wider time period whereas data for Tamoco users can
only be linked over one month.

3.2. Representativeness of mobile phone application data

Using our home detection approach which includes residential land
use data (Method 1), Figs. 2 and 3 show three measures of sample
representativeness for the Huq and Tamoco datasets respectively. The
Figures show the distribution geographically by council area (N = 8), by

deprivation quintile (N = 5) and by broad CACI Acorn group (N = 6).
Overall, the Figures show a very close fit between the distribution of
both samples and the population on each measure. There is some vari-
ation from year to year but also a great deal of consistency. In terms of
council areas, the sample share is almost always within 2.5% of the
population share. The exceptions are for Glasgow City where there is
some under-representation across both datasets in one or two years,
although proportionately this is still a modest deviation given Glasgow
is home to more than a third of the city-region population.

A concern that these new forms of data may fail to adequately cap-
ture poorer groups does not appear justified. With both datasets, there is
both under- and over-representation of the most deprived area (quintile
1) of the SIMD with the same true of the ‘Urban Adversity’ category from
CACI, which will cover similar population groups. Where the Tamoco
sample shows a shift towards more affluent areas over time, the opposite
trend is observed with Huq. Both MPA datasets have a very small
number of users in locations identified by CACI as ‘Not Private House-
holds’. These are locations with primarily non-residential uses such as
retail, industry, or transport infrastructure. Despite the label, they are
home to 1.1% of the population and a similar proportion of the two
samples.

Using the CACI classification, we can make comparisons at a finer
level based on 18 groups, focusing on 2020 only for clarity where there
is a direct comparison to population data (Fig. 4). As before, we find a
very good fit to the population across both data sources. While we might
be concerned in advance about risks of under-representation of less
affluent groups and over-representation of more affluent groups, there is
little evidence from our analysis to support this, with the possible

Box 1
Summary of home location detection techniques

Method 1: Using activity heuristics and land use to estimate home location.

Each user’s datapoints (see Table 1) are first limited to those with an accuracy reported as 100m or better. Their data is then further limited to
that which falls within the footprint of a building identified as having residential or mixed-residential use. Using this subset, the home location
for each user is estimated as the Datazone where they record the maximum number of active evenings in the study area during the time period41,
where an evening is considered between 20.00 and 06.00 h. For Huq, the home location is estimated using data across each calendar year. For
Tamoco, home locations have to be estimated monthly because the user ID is re-hashed each month (see section on mobile data sources). To
identify a unit postcode to the home location (which is important to assign the private socio-demographic data to the user), we take the set of
night-time residential datapoints within the home Datazone and identify the postcode which contains the average (geographic centroid).

Method 2: Using activity heuristics only to estimate home location.

As with Method 1, each user’s datapoints are limited to those with an accuracy reported as 100m or better. This method does not apply the
additional restriction of being within a building of residential or mixed-residential use. For each unique user, as previously, the home location is
the Datazone where they record the most active evenings in the study area during the time period4, where an evening is considered between
20.00 and 06.00 h. Home locations are estimated yearly for Huq and monthly for Tamoco as with Method 1. A unit postcode is similarly assigned
as with Method 1 based on the average (geographic centroid) of the night-time datapoints within the home Datazone.

Table 2
Number of users assigned home location by algorithms without and with land use.

Metric Dataset Users and datapoints assigned with home location

Method 1: Using activity heuristics and land use Method 2: Using activity heuristics only

2019 2019 2019 2019 2020 2021

Unique users Huq 4,633 11,079 9,165 7,204 15,135 12,495
(24.0%) (37.3%) (36.3%) (37.1%) (50.9%) (49.5%)

Tamoco 21,031 20,355 16,586 36,025 34,693 29,682
(25.9%) (23.9%) (20.4%) (44.4%) (40.7%) (36.6%)

Datapoints (millions) Huq 18.8 151.7 331.0 20.9 161.8 346.8
(85.7%) (93.7%) (95.4%) (91.8%) (97.7%) (98.4%)

Tamoco 253.1 659.9 366.8 389.3 749.0 410.3
(57.2%) (81.7%) (77.8%) (88.0%) (92.7%) (87.0%)

The number of unique users for Tamoco is based on the monthly average for each year since identifiers are reset each month. Values in parentheses are percentages of
the totals shown in Table 1. See Methods section for further details.

M. Sinclair et al.

Applied Geography 158 (2023) 102997

exception of the Tamoco coverage of the groups ‘Executive Wealth’ and
‘Difficult Circumstance’. On the contrary, the groups ‘Modest Means’
and ‘Striving Families’ are slightly over-represented in both datasets,
though differences are modest. Another group where we have particular
concerns about under-representation in MP data is in the elderly pop-
ulation but we do not find evidence of that here. The proportions of our
sample in ‘Mature Money’ or ‘Comfortable Seniors’ groups is very close
to that for the population in both datasets. With areas home to groups
both older and poorer (‘Poorer Pensioners’), we do find these under-
represented by both datasets but only slightly; they make up 8.6% of
Huq users and 8.8% of Tamoco users compared with 9.9% of the
population.

Taking the geographic and socio-demographic comparison to a finer
level of disaggregation still, we switch to summarising results using the
correlation between the estimated number of MPA users in an area and
the (adult) population of that area. Table 3 (top panel) shows results for
each dataset for each year, and for: sub-council area geographies (417
Intermediate Zones); percentiles of deprivation on the SIMD (100
groups); and the CACI Acorn types (62 groups). The correlations for each
level are very similar across years and between datasets. For the
geographic comparison across Intermediate Zones, there are moderate
to strong correlations for both Huq (0.58–0.63) and Tamoco
(0.57–0.69). For the comparisons by socio-demographic status, we see
very strong correlations across both datasets and all three years of
analysis with only one correlation below 0.9 (for Huq in 2019 when it
had a far smaller sample size).

Table 3 (lower panel) takes the analysis a stage further, examining
correlations for the two socio-demographic measures within individual
councils (n = 8), for each dataset and each year. Again, the picture is of a
high degree of representativeness. In 2020, the correlations with SIMD
percentiles within councils range between 0.82 and 0.97 for Huq and
0.87–0.98 for Tamoco. The correlations with CACI types within councils
range between 0.95 and 0.99 for Huq and 0.93–0.98 for Tamoco. For
both datasets, the highest correlations are found in the largest council,
Glasgow City, while the lowest are found in the smallest, Inverclyde.

3.3. Comparing home detection algorithms

We compare our approach (Method 1) with the basic home detection
algorithm (Method 2) by looking at how the sample distributions from
each approach compare with the population distribution. One clear
point of difference is in relation to the proportion of each sample esti-
mated to live in locations identified by CACI as ‘Not Private Households’
(i.e., places where the dominant land use is not residential). Method 2
estimates that 6.3% of Huq users and 5.9% of Tamoco users lived in
these locations in 2020, compared with just 1.1% of the population,
presumably because people are working, socialising or traveling in these
locations over a number of evenings and misattributed to live there.
Method 1, by restricting the data to points within buildings with a res-
idential use, obtains a proportion much closer to the expected propor-
tion (1.5% and 1.0% respectively), as we discuss in the previous
subsection.

More broadly, we can examine how correlations between the sam-
ples and the population change when moving from the basic (Method 2)
to the refined approach (Method 1), again focusing on 2020 (Fig. 5).
Here we show changes in the correlation between samples and the
population comparing Method 2 with Method 1. We show this for In-
termediate Zones (geographic), percentiles (SIMD), and types (CACI). A
positive difference indicates that Method 1 (our approach) yields a higher
correlation, i.e., a sample distribution more similar to that of the

population. A higher correlation does not, of course, prove that the more
restrictive approach is more accurate at identifying the true home lo-
cations. We argue, however, that a closer correlation, combined with the
specific improvement in relation to ‘Not Private Household’ locations, is
strongly suggestive of a better accuracy. The figures show a higher
correlation in every case when using our approach (Method 1). In gen-
eral, the increases in correlations are greater with the Huq dataset. It is
not clear why this should be the case although it may reflect the mix of
apps from which each company is gathering data; Huq may draw on a
larger proportion related to transport or mobility, for example, leading
the basic algorithm to mis-allocate a greater proportion.

3.4. Comparing mobile phone app samples with household survey samples

To assess the potential bias in MPA data relative to traditional survey
data, we compare the distribution of the two MPA samples with the
distribution of the sample for the main Scottish survey which captures
trip data, the SHS (Fig. 6). More details on this survey can be found in
section 2.3. Neither sample is weighted here. Overall, the MPA samples
have a better coverage of the adult population than the (unweighted)
SHS sample, and with a much larger sample size. The mean absolute
error of SIMD quintile groups across the eight councils is 1.8% for Huq
and 1.3% for Tamoco compared with 2.1% for the SHS. One or both
MPA dataset outperforms the SHS in seven of the eight councils. West
Dunbartonshire is the only council where the SHS is the most repre-
sentative although the difference is marginal. Unlike the SHS, MPA data
are not skewed to smaller councils nor to smaller households and, while
we cannot quantify any bias in the MPA data by age, comparisons across
the CACI categories in the previous section (Fig. 4) suggests we do not
have such an underlying inequality as the SHS. With the SHS, as with
other household survey data where characteristics such as age and sex of
respondents are known, weights can be applied to produce a sample
which resembles the Scottish population’s age-sex distribution by
council although this does not of course remove all bias from uneven
response rates. With the MPA data, we can only apply weights by
geographic area (e.g., council) and/or by socio-demographic status of
areas (e.g., SIMD quintile). Based on the results here, however, the
weights for MPA data may only need to make small adjustments. Of
course, survey data have many other strengths, containing personal
characteristics along with information on trip purposes and modes, for
example. Nevertheless, the combination of greater representativeness
with scale and their spatial and temporal detail gives these MPA data
great advantages for many applications.

4. Discussion

This research offers a novel home location detection methodology for
MPA data and demonstrates a more comprehensive approach to
assessing representativeness across multiple socio-demographic di-
mensions and spatial scales. We demonstrate the value of our approach
in assessing the socio-demographic representativeness of major MPA
data collections from two independent providers over an extensive time
period and for a major city-region. Our advancement in home location
detection using residential land use information improves the fit across
both data providers over all three years at the city-region level, and with
little loss in data volumes. Using public and private measures of socio-
demographic status, we find a good fit between two MPA datasets and
the population. While findings are specific to the datasets examined and
the spatio-temporal context, our approach provides a valuable bench-
mark for future assessments as well as further strong encouragement for
the use of MPA data in social science research and policy applications.

As Grantz et al. (2020) note, researchers must pay constant attention
to the potential biases in MP data from any source. Given that these data
are necessarily de-identified to protect privacy, it will always be difficult
to tackle the question of representativeness at the individual level.
Limited information may be sought on individual user characteristics

4 Home locations are only assigned when users record two or more active
evenings in the Datazones in that period. In cases where more than one po-
tential home location is returned with an equal number of active evenings, the
user is not assigned a home location and removed from further analysis.

M. Sinclair et al.

Applied Geography 158 (2023) 102997

Fig. 2. Geographic and socio-demographic comparison of the Huq mobile population across the Glasgow City-region based on the activity heuristics with land use
home detection method.
Colored bars represent the proportion of MPA users in a given year in a category while black bars show the benchmark population. Geographic and public socio-
demographic results are based on the adult population in 2020 while the private socio-demographic comparison is based on the total population in 2020 (adult
population data is not available at the level used for CACI data). Labels are percentage values (population share or, for MPA data, deviation from this). Quintiles are
based on national data and Glasgow city-region has an over-representation of more deprived Datazones. See appendices S1-6 for details on SIMD/CACI popula-
tion data.

M. Sinclair et al.

Applied Geography 158 (2023) 102997

Fig. 3. Geographic and socio-demographic comparison of the Tamoco mobile population across the Glasgow City-region based on the activity heuristics with land
use home detection method.
Colored bars represent the proportion of MPA users in a given year in a category while black bars show the benchmark population. Geographic and public socio-
demographic results are based on the adult population in 2020 while the private socio-demographic comparison is based on the total population in 2020 (adult
population data is not available at the level used for CACI data). Labels are percentage values (population share or, for MPA data, deviation from this). See appendices
S1-6 for details on SIMD/CACI population data.

M. Sinclair et al.

Applied Geography 158 (2023) 102997

but this increases disclosure risks. The best approach therefore appears
to be exploration of spatial coverage on the basis of estimated home
locations with comparison to ‘ground truth’ sources such as official
population statistics (Berke et al., 2022; Huang et al., 2022; Wang et al.,
2019) as well as area-level measures of socio-demographic status (Ber-
nabeu-Bautista et al., 2021; Huang et al., 2020).

Making judgements on the basis of area-level assessments does risk
an ecological fallacy as we cannot claim to know the individual user
characteristics from the area where they live. It is possible, for example,
that we obtain good coverage of a ‘poor’ neighbourhood by capturing
(atypical) ‘richer’ residents living within these locations. However,
demonstrating even coverage across the spectrum of socio-demographic
areas greatly reduces the potential for such an error. To obtain equal
coverage in richer and poorer neighbourhoods without also having
equal coverage of richer and poorer individuals, we would have to posit

a mechanism whereby the richer residents in poorer neighbourhoods
were more likely to be captured than richer residents in other kinds of
place. Such a mechanism is complex and unwarranted so, following the
principle of “Occam’s razor” and taking the simpler explanation, we can
reasonably conclude that good geographic coverage is a strong indica-
tion of good population coverage.

This study is the first to incorporate high resolution residential land
use data into the process of individual home location detection for MPA
data. Unlike more traditional forms of MP data such as call detail re-
cords, which are usually provided at a more aggregated level, or MP data
from social media which are more fragmented, the high accuracy of
MPA data makes the incorporation of building level residential data
possible. Here we find that estimating home area by restricting data to
that within buildings with a residential use has a positive impact on the
socio-demographic fit of the data. By excluding locations related to

Fig. 4. Huq and tamoco by the 18 CACI groups in 2020.
Comparison is to the total population in 2020 (adult population data is not available at the level used for CACI data). Labels are percentage of the total population for
each data source.

M. Sinclair et al.

Applied Geography 158 (2023) 102997

work, travel and leisure, our approach is more selective in estimating the
home area over a standard technique. While this improvement comes
with the minor caveat that fewer data points are attributed with a home
area, the better socio-demographic fit offers an improved foundation
from which to utilise the enriched data for analysis.

We demonstrate a strong representative from two samples but only
for one city-region and one period of three years. With these kinds of
data, representativeness will always need to be assessed on an on-going
basis and our methodology is intended to help towards this end.
Nevertheless, while we do not assume that the results in our case study
area necessarily hold in other areas, there is also no reason to suggest
that the Glasgow City Region, with its diverse socio-demographic
composition, should return results which are exceptional. Therefore,
the positive findings in relation to representativeness in this analysis
should act as an encouragement for further research in the UK and
potentially further afield. In this context, Huq currently offers data
across the UK and Tamoco across the UK and the United States of
America. Companies with a similar offering to the data assessed here are
available internationally.

With survey response rates declining in recent years, new forms of
MPA data may offer increasing potential as a complementary or alter-
native data source. In our case, the MPA data from two independent

providers offers better coverage of the adult population than the un-
weighted data from the ‘gold standard’ household survey while also
providing much greater sample size and coverage over multiple days
(rather than the single day in the survey). Granted, more sophisticated
weights can be applied to survey data to reduce existing biases. Never-
theless, starting from a more unrepresentative sample, these survey
weights must do more work to compensate for sample limitations so the
scope for bias to remain is that much greater. We do not suggest that
MPA data are always better or that they offer a complete replacement for
survey data; they cannot replace the detail on individual characteristics
and attitudes which can be obtained during a survey, for example.
Nevertheless, we would emphasise that concerns about potential bias in
population coverage may have been greatly over-stated.

Despite the potential, there are two major factors which may still
limit wider use of novel MPA data. The first is ethics and the regulatory
requirements designed to protect individual privacy. There are clear
ethical risks to privacy from location-based data with the kinds of spatio-
temporal detail outlined here and regulators, at least in the UK, have
been scrutinising how intermediaries like Huq and Tamoco handle these
(Wakefield, 2021). While researchers’ data management approaches
need to pay equally careful attention to these, as with the current study,
there is clear evidence that social science research use of MPA data does

Table 3
Correlation between MPA users and population across the Glasgow city-region and within councils by geography and socio-
demographic status.

Results show Pearson’s correlation coefficients, which are all significant at the 0.01 level. Results are based on method 1 (see
methods). Geographic results are based on Intermediate Zones. SIMD results are based on Datazones grouped into percentiles.
CACI Acorn results are based on postcodes grouped into ‘types’. Geographic and SIMD comparison are to the adult population
in 2020 while CACI comparison is to the total population in 2020. Tamoco results are the correlation with the mean estimated
monthly users in a given year while Huq are based on the total users for a given year. Not all CACI types and SIMD percentiles
are present in each council region, hence variations in the number of spatial units (n).

M. Sinclair et al.

Applied Geography 158 (2023) 102997

Fig. 5. Difference in correlations coefficient when using land use algorithm between number of MPA users and population across Glasgow City-region.
Results based on Pearson’s correlation coefficients for the relevant sample with 2020 populations. The geographic comparison is based on Intermediate Zones (417
regions). For SIMD, percentiles are used. For CACI, the 62 types are used. Tamoco results are based on the mean monthly users for a given year.

Fig. 6. Socio-demographic coverage of MPA data sample sizes compared with unweighted Scottish Household Survey samples across the eight councils of Glasgow
City-region.
Population is the adult population in 2020. Data for the Scottish Household survey (SHS) are for 2019, the latest available. Results for both MPA datasets are based on
our home location algorithm (Method 1) using 2020 data. Labels in the top 8 charts are percentage shares for the population figure and deviations from this for the
others. Figures in the bottom panel show the MPA samples in 2020 and the number of SHS adult surveys in 2019 by council area. Tamoco figures are based on the
mean monthly users in each case.

M. Sinclair et al.

Applied Geography 158 (2023) 102997

command widespread public support5. Researchers need to work with
policy makers and regulators to ensure an appropriate balance is struck
between individual rights to privacy and support for research with po-
tential for public benefit. The second is cost, in terms of both license
costs and data maintenance/processing costs. Although the details of
licence agreements are usually confidential, it is commonly known that
MP data in their various forms can be expensive to acquire (Lazer,
2006). This risks creating differentials in access, disadvantaging re-
searchers from less well-funded institutions or countries. Intermediary
data services like the UK’s Urban Big Data Centre have a key role to play
in relation to both, using central funding to secure licences which permit
widespread use by other researchers at no further cost; and providing
secure research facilities to reassure data owners that control will be
maintained over the data and access limited to approved researchers and
projects. Continued funding for such services will be essential to ensure
a level playing field going forwards.

Here, we demonstrate sample representativeness of MPA data but
future work should build on this to assess how well MPA data capture
the movements of their sample (Ranjan et al., 2012; Zhao et al., 2016).
With mobile data, location recording is not regular or scheduled, but
sporadic and related to specific events such as sending or receiving a
message (for call detail records) or using a specific app (for MPA data).
These may lead them to capture some types of activity more than others.
This may also vary between social groups so that, for example, while we
may have a proportionate number of older people covered by the
dataset, the ways in which they use apps may still lead to their move-
ments being under- or over-represented. The precise mix of apps on
which each company draws is commercially sensitive information and
therefore not publicly available, but this clearly shapes outcomes as well
as making them liable to change over time. Our findings on population
representativeness should be seen, therefore, as the first stage in a larger
process of evaluation.

Alongside the positive socio-demographic messages from this stage
of our work, there are some limitations to note. First, while our approach
should provide reassurance that we have good representativeness for
social differences which vary over space such as income, age, or
ethnicity, one area of representativeness this approach cannot cover is in
relation to sex. Since men and women are, in general, very evenly spread
geographically, it is not possible to pick up any under- or over-
representation by sex when using an approach based on geographic
variations. Given the widespread tendency for women to be under-
represented in statistical data (Criado-Perez, 2020), this is an impor-
tant priority for future work to address. Second, we can only identify
home locations within the study area covered by the data used here.
Residents from outside the city-region could be among the users for
whom the algorithm fails to generate a home location (and hence
omitted) or, if they spend even a few evenings in the city-region, they
might be misattributed to a home location here. Incorporating residen-
tial land use, as well as the limiting of results to those with multiple
evenings in residential space, should limit this error and should perform
better than the previous approaches reliant solely on timing. In future
work, we plan to explore this by extending the geographic scope of the
analysis.

5. Conclusion

New sources of mobile phone application data offer unprecedented
opportunities for applied scientific research. Given the novelty of these
new forms of spatial data and their relative infancy as a data source, the
future potential and direction of study are wide reaching. However, the
ability to realise this potential is somewhat limited by uncertainties
regarding sample representativeness. In this study, we have shown that
mobile phone application data from two independent providers have a
good fit to the population across public and private sources of socio-
demographic data for a large city region which is home to over 1.8
million people. Furthermore, incorporating residential land use data
into the process of home location detection for mobile phone data can
improve its socio-demographic fit. These findings are important for
future research as they present a technique which helps improves the fit
of mobile phone application data while also offering an empirical
foundation upon which to utilise these novel data sources for applied
research.

Author statement

Michael Sinclair (MS), Saeed Maad (SM), Qunshan Zhao (QZ), Jin-
hyun Hong (JH), Andrea Ghermandi (AG) and Nick Bailey (NB).
Conceptualization: MS, SM, QZ, JH, NB. Methodology: MS, QZ, JH, AG,
NB. Software: MS, AG. Validation: MS. Formal analysis: MS. Visualiza-
tion: MS. Writing – Original Draft: MS, QZ, JH, AG, NB. Writing – Review
& Editing: MS, QZ, AG, NB.

Data statement

The mobile phone datasets used in this analysis can be accessed for
research purposes by application to the Urban Big Data Centre, an
Economic and Social Research Council funded research centre and na-
tional data service based at the University of Glasgow. The analysis is
completed using a combination of PostgreSQL and R programming
language (R Core Team, 2022). The code to process the data and esti-
mate home location is openly available on GitHub (https://github.co
m/sinclairmichael/appliedgeography_representativeness.git).

Acknowledgments

The work was made possible by ESRC’s SDAI funding [ES/
W012979/1] and ESRC’s on-going support for the Urban Big Data
Centre (UBDC) [ES/L011921/1 and ES/S007105/1]. For the use of
SIMD data, Copyright Scottish Government, contains Ordnance Survey
data © Crown copyright and database right (2022). CACI data, ©
1979–2020 CACI Limited. This report shall be used solely for academic,
personal and/or non-commercial purposes. UKBuildings data is Digital
Map Data © The GeoInformation Group Limited (2022), created and
maintained by Geomni, a Verisk company. The authors want to thank
the anonymous reviewers for their insightful comments and suggestions
on an earlier version of this manuscript.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.apgeog.2023.102997.

Appendix 1. Geographic boundaries used in the analysis and/or reporting of results

5 https://www.gov.uk/government/publications/public-dialogue-on-location-data-ethics.

M. Sinclair et al.

https://github.com/sinclairmichael/appliedgeography_representativeness.git

https://doi.org/10.1016/j.apgeog.2023.102997

https://www.gov.uk/government/publications/public-dialogue-on-location-data-ethics

Applied Geography 158 (2023) 102997

References

Berke, A., Doorley, R., Alonso, L., Arroyo, V., Pons, M., & Larson, K. (2022). Using mobile
phone data to estimate dynamic population changes and improve the understanding
of a pandemic: A case study in Andorra. PLoS One, 17, Article e0264860. https://doi.
org/10.1371/journal.pone.0264860

Bernabeu-Bautista, Á., Serrano-Estrada, L., Perez-Sanchez, V. R., & Martí, P. (2021). The
geography of social media data in urban areas: Representativeness and
complementarity. ISPRS International Journal of Geo-Information, 10, 747. https://
doi.org/10.3390/ijgi10110747

Bojic, I., Massaro, E., Belyi, A., Sobolevsky, S., & Ratti, C. (2015). Choosing the right
home location definition method for the given dataset. In T.-Y. Liu, C. N. Scollon, &
W. Zhu (Eds.), Social informatics, lecture notes in computer science (pp. 194–208).
Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-
27433-1_14.

boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a
cultural, technological, and scholarly phenomenon. Information, Communication &
Society, 15(5), 662–679.

Brick, J. M., & Williams, D. (2013). Explaining rising nonresponse rates in cross-sectional
surveys. The Annals of the American Academy of Political and Social Science, 645(1),
36–59.

Calabrese, F., Diao, M., Di Lorenzo, G., Ferreira, J., & Ratti, C. (2013). Understanding
individual mobility patterns from urban sensing data: A mobile phone trace example.
Transportation Research Part C: Emerging Technologies, 26, 301–313. https://doi.org/
10.1016/j.trc.2012.09.009

Calafiore, A., Murage, N., Nasuto, A., & Rowe, F. (2021). Deriving spatio-temporal
geographies of human mobility from GPS traces. Spat. Data Sci. Symp. https://doi.
org/10.25436/E26K5F, 2021 Online.

Cameron, R. W. F., Brindley, P., Mears, M., et al. (2020). Where the wild things are! Do
urban green spaces with greater avian biodiversity promote more positive emotions
in humans? Urban Ecosystems, 23, 301–317. https://doi.org/10.1007/s11252-020-
00929-z

Çolak, S., Alexander, L. P., Alvim, B. G., Mehndiratta, S. R., & González, M. C. (2015).
Analyzing cell phone location data for urban travel: Current methods, limitations,
and opportunities. Transp. Res. Rec. J. Transp. Res. Board, 2526, 126–135. https://
doi.org/10.3141/2526-14

Criado-Perez, C. (2020). Invisible women: Exposing data bias in a world designed for men.
London: Vintage.

Gao, S., Rao, J., Kang, Y., Liang, Y., Kruse, J., Dopfer, D., Sethi, A. K., Mandujano
Reyes, J. F., Yandell, B. S., & Patz, J. A. (2020). Association of mobile phone location
data indications of travel and stay-at-home mandates with COVID-19 infection rates
in the US. JAMA Network Open, 3, Article e2020485. https://doi.org/10.1001/
jamanetworkopen.2020.20485

Grantz, K. H., Meredith, H. R., Cummings, D. A. T., Metcalf, C. J. E., Grenfell, B. T.,
Giles, J. R., Mehta, S., Solomon, S., Labrique, A., Kishore, N., Buckee, C. O., &
Wesolowski, A. (2020). The use of mobile phone data to inform analysis of COVID-
19 pandemic epidemiology. Nature Communications, 11, 4961. https://doi.org/
10.1038/s41467-020-18190-5

Guo, S., Song, C., Pei, T., Liu, Y., Ma, T., Du, Y., Chen, J., Fan, Z., Tang, X., Peng, Y., &
Wang, Y. (2019). Accessibility to urban parks for elderly residents: Perspectives from
mobile phone data. Landscape and Urban Planning, 191, Article 103642. https://doi.
org/10.1016/j.landurbplan.2019.103642

Heo, S., Lim, C. C., & Bell, M. L. (2020). Relationships between Local Green Space and
Human Mobility Patterns during COVID-19 for Maryland and California, USA.
Sustainability, 12(22), 9401. https://doi.org/10.3390/su12229401

Huang, X., Li, Z., Lu, J., Wang, S., Wei, H., & Chen, B. (2020). Time-series clustering for
home dwell time during COVID-19: What can we learn from it? ISPRS International
Journal of Geo-Information, 9, 675. https://doi.org/10.3390/ijgi9110675

Huang, X., Lu, J., Gao, S., Wang, S., Liu, Z., & Wei, H. (2022). Staying at home is a
privilege: Evidence from fine-grained mobile phone location data in the United
States during the COVID-19 pandemic. Annals of the Association of American
Geographers, 112, 286–305. https://doi.org/10.1080/24694452.2021.1904819

Kang, Y., Gao, S., Liang, Y., Li, M., Rao, J., & Kruse, J. (2020). Multiscale dynamic human
mobility flow dataset in the U.S. during the COVID-19 epidemic. Scientific Data, 7,
390. https://doi.org/10.1038/s41597-020-00734-5

Kishore, N., Taylor, A. R., Jacob, P. E., Vembar, N., Cohen, T., Buckee, C. O., &
Menzies, N. A. (2022). Evaluating the reliability of mobility metrics from aggregated
mobile phone data as proxies for SARS-CoV-2 transmission in the USA: A population-
based study. Lancet Digit. Health, 4, e27–e36. https://doi.org/10.1016/S2589-7500
(21)00214-4

Lazer, D. (2006). Global and domestic governance: Modes of interdependence in
regulatory policymaking. European Law Journal, 12, 455–468. https://doi.org/
10.1111/j.1468-0386.2006.00327.x

Lee, K.-S., Eom, J. K., Lee, J., & Ko, S. (2021). Analysis of the activity and travel patterns
of the elderly using mobile phone-based hourly locational trajectory data: Case study
of gangnam, korea. Sustainability, 13, 3025. https://doi.org/10.3390/su13063025

Mao, H., Shuai, X., Ahn, Y.-Y., & Bollen, J. (2015). Quantifying socio-economic indicators
in developing countries from mobile phone communication data: Applications to
côte d’Ivoire. EPJ Data Sci, 4, 15. https://doi.org/10.1140/epjds/s13688-015-0053-
1

Marsh, C. (1982). The survey method: contribution of surveys to sociological explanation.
London: Allen & Unwin.

Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how
we live, work, and think. Houghton Mifflin Harcourt.

Mears, M., Brindley, P., Barrows, P., Richardson, M., & Maheswaran, R. (2021). Mapping
urban greenspace use from mobile phone GPS data. PLoS One, 16, Article e0248622.
https://doi.org/10.1371/journal.pone.0248622

M. Sinclair et al.

https://doi.org/10.1371/journal.pone.0264860

https://doi.org/10.3390/ijgi10110747

https://doi.org/10.1007/978-3-319-27433-1_14

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref31

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref32

https://doi.org/10.1016/j.trc.2012.09.009

https://doi.org/10.25436/E26K5F

https://doi.org/10.1007/s11252-020-00929-z

https://doi.org/10.3141/2526-14

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref7

https://doi.org/10.1001/jamanetworkopen.2020.20485

https://doi.org/10.1038/s41467-020-18190-5

https://doi.org/10.1016/j.landurbplan.2019.103642

https://doi.org/10.3390/su12229401

https://doi.org/10.3390/ijgi9110675

https://doi.org/10.1080/24694452.2021.1904819

https://doi.org/10.1038/s41597-020-00734-5

https://doi.org/10.1016/S2589-7500(21)00214-4

https://doi.org/10.1111/j.1468-0386.2006.00327.x

https://doi.org/10.3390/su13063025

https://doi.org/10.1140/epjds/s13688-015-0053-1

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref35

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref36

https://doi.org/10.1371/journal.pone.0248622

Applied Geography 158 (2023) 102997

Meyer, B. D., Mok, W. K., & Sullivan, J. X. (2015). Household surveys in crisis. Journal of
Economic Perspectives, 29(4), 199–226.

National Records of Scotland. (n.d.). National Records of Scotland | Preserving the past,
Recording the present, Informing the future. https://www.nrscotland.gov.uk/.

Pappalardo, L., Ferres, L., Sacasa, M., Cattuto, C., & Bravo, L. (2021). Evaluation of home
detection algorithms on mobile phone data using individual-level ground truth. EPJ
Data Sci, 10, 29. https://doi.org/10.1140/epjds/s13688-021-00284-9

Phithakkitnukoon, S., Smoreda, Z., & Olivier, P. (2012). Socio-geography of human
mobility: A study using longitudinal mobile phone data. PLoS One, 7, Article e39253.
https://doi.org/10.1371/journal.pone.0039253

Ranjan, G., Zang, H., Zhang, Z.-L., & Bolot, J. (2012). Are call detail records biased for
sampling human mobility? ACM SIGMOBILE – Mobile Computing and Communications
Review, 16, 33–44. https://doi.org/10.1145/2412096.2412101

R Core Team. (2022). R: A Language and Environment for Statistical Computing. Vienna,
Austria: R Foundation for Statistical Computing.

Ren, X., & Guan, C. (2022). Evaluating geographic and social inequity of urban parks in
Shanghai through mobile phone-derived human activities. Urban Forestry & Urban
Greening, 76, 127709.

Savage, M., & Burrows, R. (2007). The coming crisis of empirical sociology. Sociology, 41
(5), 885–899.

Scottish Government, Ipsos MORI. (2021). SHSScottish household survey. 1999-Scottish
Household Survey. https://doi.org/10.5255/UKDA-SN-8775-1, 2019.

Sinclair, M., Mayer, M., Woltering, M., & Ghermandi, A. (2020). Using social media to
estimate visitor provenance and patterns of recreation in Germany’s national parks.
Journal of Environmental Management, 263, Article 110418. https://doi.org/10.1016/
j.jenvman.2020.110418

Sinclair, M., Zhao, Q., Bailey, N., Maadi, S., & Hong, J. (2021). Understanding the use of
greenspace before and during the COVID-19 pandemic by using mobile phone app

data. GIScience 2021 Short Pap. In Proc. 11th int. Conf. Geogr. Inf. Sci. Sept. 27-30
2021. https://doi.org/10.25436/E2D59P. Poznań, Poland (Online).

Vanhoof, M., Reis, F., Ploetz, T., & Smoreda, Z. (2018). Assessing the quality of home
detection from mobile phone data for official statistics. J. Off. Stat., 34, 935–960.
https://doi.org/10.2478/jos-2018-0046

Wakefield, B. J. (2021, October 29). Location data collection firm admits privacy breach.
BBC News. https://www.bbc.co.uk/news/technology-59063766.

Wang, F., & Chen, C. (2018). On data processing required to derive mobility patterns
from passively-generated mobile phone data. Transportation Research Part C:
Emerging Technologies, 87, 58–74. https://doi.org/10.1016/j.trc.2017.12.003

Wang, Y., Li, J., Zhao, X., Feng, G., & Luo, X. (2020). Using mobile phone data for
emergency management: A systematic literature Review. Information Systems
Frontiers, 22, 1539–1559. https://doi.org/10.1007/s10796-020-10057-w

Wang, F., Wang, J., Cao, J., Chen, C., Ban, X., & Jeff). (2019). Extracting trips from multi-
sourced data for mobility pattern analysis: An app-based data example.
Transportation Research Part C: Emerging Technologies, 105, 183–202. https://doi.org/
10.1016/j.trc.2019.05.028

Yabe, T., Jones, N. K. W., Rao, P. S. C., Gonzalez, M. C., & Ukkusuri, S. V. (2022). Mobile
phone location data for disasters: A review from natural hazards and epidemics.
Computers, Environment and Urban Systems, 94, 101777.

Yabe, T., Tsubouchi, K., Fujiwara, N., Sekimoto, Y., & Ukkusuri, S. V. (2020).
Understanding post-disaster population recovery patterns. Journal of The Royal
Society Interface, 17, Article 20190532. https://doi.org/10.1098/rsif.2019.0532

Zhao, Z., Shaw, S.-L., Xu, Y., Lu, F., Chen, J., & Yin, L. (2016). Understanding the bias of
call detail records in human mobility research. International Journal of Geographical
Information Science, 30, 1738–1762. https://doi.org/10.1080/
13658816.2015.1137298

M. Sinclair et al.

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref37

https://www.nrscotland.gov.uk/

https://doi.org/10.1140/epjds/s13688-021-00284-9

https://doi.org/10.1371/journal.pone.0039253

https://doi.org/10.1145/2412096.2412101

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref39

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref40

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref41

https://doi.org/10.5255/UKDA-SN-8775-1

https://doi.org/10.1016/j.jenvman.2020.110418

https://doi.org/10.25436/E2D59P

https://doi.org/10.2478/jos-2018-0046

https://www.bbc.co.uk/news/technology-59063766

https://doi.org/10.1016/j.trc.2017.12.003

https://doi.org/10.1007/s10796-020-10057-w

https://doi.org/10.1016/j.trc.2019.05.028

http://refhub.elsevier.com/S0143-6228(23)00128-5/sref43

https://doi.org/10.1098/rsif.2019.0532

https://doi.org/10.1080/13658816.2015.1137298

Assessing the socio-demographic representativeness of mobile phone application data

1 Introduction

2 Data and methods

2.1 Study area

2.2 Mobile phone application datasets

2.3 Other secondary data sources used in the analysis

2.4 Home location detection techniques

2.5 Comparing the representativeness of mobile phone application data

2.6 Transparency and reproducibility

3 Results

3.1 Estimated home locations of mobile phone application data

3.2 Representativeness of mobile phone application data

3.3 Comparing home detection algorithms

3.4 Comparing mobile phone app samples with household survey samples

4 Discussion

5 Conclusion

Author statement

Data statement

Acknowledgments

Appendix A Supplementary data

Appendix 1 Geographic boundaries used in the analysis and/or reporting of results

References

Sun et al. – 2022 – Understanding building energy efficiency with admi

Energy & Buildings 273 (2022) 112331

Contents lists available at ScienceDirect

Energy & Buildings

journal homepage: www.elsevier .com/locate /enb

Understanding building energy efficiency with administrative and
emerging urban big data by deep learning in Glasgow

https://doi.org/10.1016/j.enbuild.2022.112331
0378-7788/� 2022 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

⇑ Corresponding authors.
E-mail addresses: maoransu@mit.edu (M. Sun), changyu.han@geo.uzh.ch

(C. Han), zhangfan@mit.edu (F. Zhang), Qunshan.Zhao@glasgow.ac.uk (Q. Zhao).

Maoran Sun a, Changyu Han b, Quan Nie b, Jingying Xu b, Fan Zhang a,⇑, Qunshan Zhao b,⇑
a Senseable City Laboratory, MIT 9-216, 77 Massachusetts Avenue, Cambridge, MA 02139 USA
bUrban Big Data Centre, School of Social and Political Sciences, University of Glasgow, Glasgow G12 8RZ, United Kingdom

a r t i c l e i n f o a b s t r a c t

Article history:
Received 4 May 2022
Revised 8 July 2022
Accepted 21 July 2022
Available online 26 July 2022

Keywords:
Building energy efficiency
Energy performance certificate
Deep learning
Google street view
SHapley additive explanations

With buildings consuming nearly 40% of energy in developed countries, it is important to accurately esti-
mate and understand the building energy efficiency in a city. A better understanding of building energy
efficiency is beneficial for reducing overall household energy use and providing guidance for future hous-
ing improvement and retrofit. In this research, we propose a deep learning-based multi-source data
fusion framework to estimate building energy efficiency. We consider the traditional factors associated
with the building energy efficiency from the Energy Performance Certificate (EPC) for 160,000 properties
(30,000 buildings) in Glasgow, UK (e.g., property structural attributes and morphological attributes), as
well as the Google Street View (GSV) building façade images as a complement. We compare the perfor-
mance improvements between our data-fusion framework with traditional morphological attributes and
image-only models. The results show that including the building façade images from GSV, the overall
model accuracy increases from 79.7% to 86.8%. A further investigation and explanation of the deep learn-
ing model are conducted to understand the relationships between building features and building energy
efficiency by using SHapley Additive exPlanations (SHAP). Our research demonstrates the potential of
using multi-source data in building energy efficiency prediction with high accuracy and short inference
time. Our paper also helps understand building energy efficiency at the city level to help achieve the net-
zero target by 2050.

� 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

The emergency of climate change and global warming has been
recognized globally in both Paris Agreement and the Glasgow Cli-
mate Pact [1]; 153 countries have collectedly listed securing the
net-zero emissions as the top missions in COP26 at Glasgow. With
the building sector accounting for nearly 40% of energy consump-
tion in developed countries [2,3], understanding buildings’ energy
usage and improving the energy efficiency are critical for reducing
overall energy use [4]. In fact, accurately predicting and under-
standing the building’s energy efficiency is not only important
for the wider objectives in global carbon emission targets, but also
related to individual homeowners’ decision-making and housing
retrofit and improvement [5]. Successful identification of house-
holds with low energy efficiency is also beneficial for eliminating
fuel poverty problems [6]. Numerous efforts have been devoted

to predicting energy emissions [7,8], mapping energy performance
[9,10], and connecting energy with real estate markets [11].

Despite its importance, multiple challenges remain for current
research methods. Traditional research for energy analysis and
estimation involves engineering calculation, simulation model-
based benchmarking and statistical modellings [12]. Many of the
current methods involve two types of data that are not readily
available or difficult to obtain. Firstly, human behaviour data such
as the number of occupants and heating set point temperature is
often used by current research. While the energy consumption is
largely affected by users’ behaviour, the data is often hard to
retrieve without the installation of smart metres or household sur-
veys. Also, for rented houses, due to the high turnover of the
tenants, it is difficult to have a frequent survey for the behavioural
patterns. Secondly, these research often take the energy consump-
tion related indicators into account, such as CO2 emissions, walls’
solar absorptance, etc [13]. The difficulty of obtaining such data
and extremely high correlation between the data and the objec-
tives make it not suitable for more extensive energy efficiency pre-
diction research.

http://crossmark.crossref.org/dialog/?doi=10.1016/j.enbuild.2022.112331&domain=pdf

http://creativecommons.org/licenses/by/4.0/

https://doi.org/10.1016/j.enbuild.2022.112331

http://creativecommons.org/licenses/by/4.0/

mailto:maoransu@mit.edu

mailto:changyu.han@geo.uzh.ch

mailto:zhangfan@mit.edu

mailto:Qunshan.Zhao@glasgow.ac.uk

https://doi.org/10.1016/j.enbuild.2022.112331

http://www.sciencedirect.com/science/journal/03787788

http://www.elsevier.com/locate/enb

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

In this study, we present a more precise, and scalable frame-
work for estimating building energy efficiency ratings at the city
scale by using new forms of urban big data and deep learning
framework. The framework is able to make classification of proper-
ties’ energy efficiency ratings with building morphology descrip-
tion and street-level building image data. Morphological
attributes have been widely used in the prediction of building
energy consumption and efficiency [14]. It provides information
about properties’ size, material, structural features and possibly
implies how the property is used. The street-level building image
data captures the façade of buildings and also reflects energy-
related information of the property. Architectural elements within
the images, such as windows, doors and balconies are related to
the style, age and structure of the buildings. What’s more, it has
been studied that street view images are able to reveal the rela-
tionships between built environments and socioeconomic environ-
ments [15,16]. With a combination of building morphology
description and street-level building images, this paper aims at
achieving a comprehensive understanding of building energy
efficiency.

The contribution of this paper is twofold: first, we design a scal-
able multi-source data fusion deep learning framework to predict
building energy efficiency ratings from both building morphology
attributes and street-level imagery. The framework is able to per-
form property-level estimation based on publicly available data-
sets. The openly available data, high accuracy and fine scale of
the methods ensures the framework to be beneficial for real-
world application and can be extended to other study areas. The
incorporation of image data within the framework further
improves prediction accuracy. Secondly, with the designed frame-
work, we are able to understand the influential factors for building
energy efficiency through explainable AI techniques. The frame-
work is also able to explain how different building features con-
tribute to the building energy efficiency estimation. It also helps
the homeowners and policy maker to estimate the energy effi-
ciency before the renovation starts. The research results are valu-
able in providing suggestions for the facilitation and execution of
emission-reduction policies. Given the wide availability of data
involved, the framework is also beneficial for broader regions
rather than the presented site.

2. Background

2.1. Methods for predicting building energy efficiency

In the previous research, engineering calculation, simulation-
based benchmarking, data-driven statistical modellings, and artifi-
cial intelligence methods have been widely used in building energy
analysis, estimation, and benchmarking [12]. Engineering method-
ologies use physical laws to assess building energy and can achieve
extremely high accuracy, but they rely upon system complex
details, including mathematics and building dynamics, as well as
all building components, which is not conveniently available to
the public in a large area; simulation-based benchmarking
includes software and computer models that have complex details
and can be used for a variety of applications, but it can be very
costly and time-consuming when a large number of solutions need
to be defined [12]. Current development of computational methods
and data make it possible to use data-driven statistical models that
are more efficient compared to engineering and simulation-based
methods [12]. Compared with other statistical models (multiple
linear regression, support vector machine, decision trees, etc.), arti-
ficial neural networks has been favoured by researchers due to
their reliable predictions and the advantages of overcoming the

nonlinearity between the input and output of energy-related
data [17,18].

Existing research mainly use the ANN-based methods to under-
stand the building energy usage and demand [17]. [19] combined
ANN with the statistical method to quantify the impact of driving
factors on building energy use and found that the heating/cooling
degree days, the building area, the room number, and the window
number are most related to the energy end-use per capita. [14]
used Levenberg-Marquardt optimization algorithm to update the
weights of the hidden neurons considering its high speed, and
got a higher prediction accuracy rate of heat demand indicator—
about 95% of entries fall within ± 3 confidence intervals. [20] devel-
oped ANN models to predict primary energy consumption for
space cooling/heating, and got high accuracy of more than 95%.
Although these previous study has achieved high accuracy in the
task of predicting energy usage, limited research has been done
to understand the energy efficiency, which is developed by well-
established mathematical methods to help estimate and improve
the efficient use of energy [21]. [13] made effort to verify the accu-
racy of the energy performance certificates by refining ANNmodels
and defining Neural Energy Performance Index. It turns a small
error in only 3.6% of cases was found. [22] designed an ANN model
for predicting heating and cooling loads instand of the average
energy efficiency of the building directly. How to understand and
predict energy efficiency better needs more exploration.

Also previous research usually need to collect site-specific data
such as the human behaviour factors (the number of occupants and
the heating setpoint temperature, etc.) and the energy
consumption-related indicators (CO2 emissions, walls’ solar
absorptance, Global Energy Performance Index, etc.). These site-
specific factors are either uncontrollable in practice, and its
enlightenment on retrofitting for green energy buildings is limited
or these factors are not readily available. To extend our research to
a larger study area and have a holistic city-level understanding of
building energy efficiency, we will not use these factors but only
choose abundant and easier to obtain characteristics from EPC
and street view images to understand the building itself and its
basic energy facilities, based on which to accurately estimate their
energy efficiency. This approach makes it possible to assess the
energy efficiency at a city level with insufficient data on energy
consumption indicators.

2.2. Using street-level imagery for building stock estimation

With the increasing coverage of GSV images and computational
power, many research have been devoted to combining deep learn-
ing and GSV for building stock prediction. GSV is a new source of
large-scale urban data that has been widely used in many urban
research fields, such as urban planning and design [23,24], real
estate [25], urban morphology [26,27], transportation and mobility
[28], socio-economic studies [29,30], and urban perception [31].
[32] reviewed approximately 600 papers published between
2005 and 2020 using street-level imagery as a research data
source, where GSV was used as the data source for two-thirds of
the overall papers. The widespread use of GSV is mainly due to
its large coverage all over the world (over 200 countries) and stan-
dard data quality.

The buildings in GSV are labelled with geographical location,
style, age, façade material, volume and scale. With these tags, com-
puter vision algorithms can screen out recognizable image infor-
mation through various discriminant methods, thus analysing the
city and its architectural culture. Based on deep learning for GSV
image feature extraction, studies on architecture have developed
from being limited to the study of the building itself (e.g., architec-
tural style, age, façade material, volume, and scale) to the study of
the area including the buildings. Some studies have investigated

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

certain characteristics of the areas in which they are located by
identifying multiple features of buildings and non-architectural
factors (e.g. vegetation), such as street space quality [33], urban
aesthetics [34,35], continuity of street architecture [36], urban can-
yon geometry [27], and urban architectural landscape characteris-
tics [37,38]. The relationship between building and energy has also
received much scholarly attention based on the application of GSV
images and deep learning. On the basis of the GSV images of Victo-
ria, Australia, [30] estimated the year of buildings and constructed
a dataset of relevant attributes according to GSV images from Vic-
toria, thus providing key information for energy demand and retro-
fitting of buildings. [39] used GSV and machine learning to predict
building features relevant to energy retrofitting (i.e., building type
and suitability for additional façade insulation) [39].

3. Study area and data

3.1. Study area

We take Glasgow city in Scotland as the study area. Fig. 1 pre-
sents the footprints of domestic buildings with EPC data in our
study area. According to Koppen-Geiger classification, the climate
of Glasgow is ‘‘Cfb, Marine West Coast Climate” [40]. Also, the
indoor comfort will be affected by the global warming [41]. Besides
the climate change issues in Glasgow, as one of the largest cities in
the UK, the diversity of building styles, the historical city develop-
ment, the ambitious goal to achieve a net-zero target by 2045, and
the well organized public Scottish EPC dataset, make it an ideal site
for building energy efficiency studies.

3.2. Data

This study incorporates multi-source data from the Scottish EPC
data [42], UKBuildings dataset from EDINA Geomni Digimap Ser-
vice [43], and GSV images for estimation of building energy
efficiency.

1 https://developers.google.com/maps/documentation/streetview/overview.

3.2.1. EPC data
In the UK, a building must obtain and has an EPC in the past

10 years when it has been newly constructed or is to be sold or
rented, except for very few special cases [44]. An EPC includes
information on the energy efficiency of buildings. It records speci-
fic information such as the size and layout of the building, how it
has been constructed and the way it is insulated, heated, venti-
lated, and lighted. Based on these records, the EPC evaluators use
a UK government calculation methodology to estimate monthly
energy usage and CO2 emissions of buildings and generate the ‘‘en-
ergy efficiency rating” of the building from A to G, with A being the
best, which can help understand how much fuel cost may need to
be paid.

We select 168,410 EPC records for domestic buildings from
October 2012 to March 2021 in Glasgow. The data is requested
by setting the range of zip code of Scotland and converting
addresses (contain street name, street number and zip code) of
records into coordinates by Google Geocoding API. After filtering
out the records whose coordinates fall outside the study area, we
get 165,318 records. This nearly two percent loss may be due to
inaccurate or incomplete address information of the EPC dataset.
The specific domestic building locations are shown in Fig. 1. After
deleting the features that have overlapping descriptions or are
obtained by calculation, the numerical features in the EPC we used
to predict energy efficiency are shown in Table 1. It includes the
building construction factors and facilities. Besides, details about
categorical features are presented in the Supplementary material.

3.2.2. UK buildings
UK Buildings dataset provides 2D building footprints across

Great Britain for residential, non-residential and mixed-use prop-
erties for towns having more than 10,000 population [43]. We
use this dataset as a geo-referenced dataset to match each EPC
record with a street view image that describes its appearance.

3.2.3. Street view images
As the digital representation of the built environments, street

view image is a valuable resource for understanding and analyzing
architecture and cities. With growing coverage and more provi-
ders, the street view image service has covered more than half
the world’s population [45]. It provides a more intuitive and
human-perspective view than other data sources. Among all the
services, GSV is one of the most popular sources. We obtain images
from the GSV service with its own Application Programming Inter-
face (API) with our customized parameters1. The detailed calcula-
tion of parameters will be presented in methods section 3.3.1. We
request GSV for 165,318 properties and download more than
550,000 street view images for 157,222 properties for further analy-
sis. For each property, historical GSV images are also obtained to
enlarge the dataset. The dates of images range from 2008 to 2021.
Considering that GSV images captured in consecutive years might
be too similar and lead to the overfitting of our model, we filter
images to make sure the capture year of the historical images of
the same property has at least two years gap. As a result, we link
368,769 GSV with the EPC records.

4. Methodology

To demonstrate how our framework, this section presents the
workflow of our methodology: 1) Street view image collection, 2)
Model design, 3) Model evaluation, and 4) Model interpretation.
The nomenclature and abbreviations are shown in Table 2. The
methodology code can be found on the project GitHub repository

(https://github.com/MaoranSun/buildingEnergyEfficiency).

4.1. Street view image collection

GSV images are available for download via Google API. How-
ever, the default view is pointing at the direction along the streets.
For this analysis, an image facing the building façade is required.
The GSV API allows us to pass customized parameters including
heading (the direction the camera is pointing at), field of view
(FOV, zoom level of the camera), and pitch (vertical angle of the
camera relative to the street view vehicle). Here, we present an
algorithm for calculating these parameters below.

Heading represents the direction in which the camera is point-
ing at. More specifically, what the parameter needs is the angle a
between the Vector North and Vector SC, as shown in Fig. 2. Equa-
tion (1) shows the calculation of the heading parameter from Point
S xs; ysð Þ, Point C xc; ycð Þ and Vector North. It is worth noticing that
angle a is the clockwise rotation angle from Vecttor North to Vec-
tor SC.

a ¼
arccos VnVsc

Vnj j Vscj j ; if xc � xs > 0ð Þ
360� arccos VnVsc

Vnj j Vscj j ; otherwise

(
ð1Þ

Pitch controls the vertical direction of the camera. This could be
calculated based on the angle c between Vector SC and Vector ST.
Equation (2) shows the calculation of the pitch angle.

k ¼ arccos
VCTj j
VSC

� �
� 0:5 ð2Þ

https://github.com/MaoranSun/buildingEnergyEfficiency

https://developers.google.com/maps/documentation/streetview/overview

Fig. 1. Study area.

Table 1
Numerical EPC Features used in the analysis.

FEATURES Mean Standard
deviation

Range

Building construction factors
Total floor area (m2) 76.64 34.80 [15.00,

708.96]
Average height of the lowest storey
of the dwelling (m)

2.63 0.39 [0, 6.37]

Facilities
Number of open fireplaces 0.08 0.36 [0,7]
Percentage of low energy lighting
(%)

49.40 40.42 [0,100]

Simple size 165,318

Table 2
Nomenclature and Abbreviations.

Abbreviation Notation

EPC Energy Performance
Certificate

Point
C

Center of bottom edge of the
requested facade

GSV Google Street View Point
T

Center of top edge of the
requested facade

SHAP SHapley Additive
exPlanations

Point
S

Request point for GSV
download

FOV Field of View a Horizontal angle of the camera
API Application

Programming
Interface

b Vertical angle of the camera

DenseNet Dense Convolutional
Network

Vn North vector

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

FOV is based on the width of the façade and the distance
between the building façade and the requested point. This can be

measured with angle b, which is the angle between VSF1
��!

and VSF2
��!

as illustrated in Fig. 2. As GSV API accepts the value between 0
and 120, we pass the value of 120 or the calculated angle, which-
ever is smaller.

4.2. Multi-branch deep learning model design

With building façades and energy performance descriptive data,
we design a deep convolutional neural network for classifying the
energy efficiency. This model aims at learning energy efficiency

information simuteously from both façade image and property’s
morphological and structural attributes. As shown in Fig. 3, the
network mainly consists of four stages: input, feature extraction,
feature fusion and output. In input and feature extraction stages,
the network runs in two parallel branches: image branch and
descriptive feature branch. The image branch takes building façade
photos as input, with Dense Convolutional Network (DenseNet) as
backbone and outputs a 1024-dimensional deep feature for feature
fusion. Four dense blocks are used in the image branch. For the
descriptive feature branch, we build a simple fully-connected neu-
ral network with four hidden layers which yields a 256-
dimensional deep feature. In the feature fusion stage, the 1024-

Fig. 2. Parameters for retrieving building façade from GSV images.

Fig. 3. Deep learning model architecture. The model takes image and descriptive features as input, processes two branches simultaneously and make a final prediction.

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

dimensional feature from the image branch and 256-dimensional
feature from the descriptive feature branch are concatenated as a
1280-dimensional deep feature. This concatenated deep feature
is then passed into a fully-connected neural network with two hid-
den layers and makes the final classification of the property energy
efficiency categories.

For model initialization, we apply different strategies for the
image and descriptive feature branches respectively. The image
branch is initialized with weights pre-trained on the ImageNet
dataset [46]. ImageNet is a widely used dataset for detecting com-
mon objects such as vehicle, building and street sign. The pre-
trained weights used is able to understand and extract information
about objects and scenes within the images. This initialization
strategy helps the faster convergence of the network and requires
less training time. For the descriptive feature branch, we apply ran-
dom initialization because of the small number of parameters
involved and the simple structure. Both characteristics make the
branch easier to train.

4.3. Evaluation of result

For the main model, the dataset is splitted into three parts: 70%
of it is used for training, 15% for validation and 15% is used to test
the model. The model is evaluated with several metrics with the
test dataset. The evaluation aims at providing details about the
performance and where the misclassification happens. Firstly, the
model is assessed with numerical metrics of Precision, Recall and
F1-score. Besides, we present a confusion matrix for detailed per-
formance of all classes and corresponding numerical metrics. Pre-
cision is the ratio of correctly predicted positive samples to all the
predicted positive samples as shown in equation (3); Recall is the
ratio of correctly predicted positive samples to all the samples in
actual classes as shown in equation (4); F1-score refers to the har-
monic mean of precision and recall as shown in equation (5).
Besides the main model, we also apply a 10-fold cross-validation
on the full dataset to test the stability of our method with averag-
ing the total accuracy from 10 models.

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

Precision ¼ TP
TP þ FP

ð3Þ

Recall ¼ TP
TP þ FN

ð4Þ

F1 ¼ 2 � Precision � Recall
Precisionþ Recall

ð5Þ

Secondly, we map the prediction results to explore the spatial
distribution of the model performance, as the adjacent buildings
often share similar style, age and façade materials. It is also impor-
tant to identify the poor performance areas spatially for further
improvement. Though the prediction is made at the property level,
we aggregate the predictions results to 150-meter grids for two
reasons. Firstly, the development and implementation of policies
are usually at higher spatial scale rather than property-level. Sec-
ondly, it is more intuitive and could better convene the spatial dis-
tribution of results. The performance of grid is evaluated with the
ratio of correctly classified sample numbers to the total sample
number associated within the grid.

4.4. Model comparison and interpretation

Besides the multi-branch model, we also predict the building
energy efficiency rating from image or descriptive features, respec-
tively. Same as the image branch, the image classification model is
adapted from DenseNet architecture with four dense blocks, and
yields final prediction for seven rating classes. The descriptive fea-
ture is also fed into a simple neural network with four hidden lay-
ers. To make sure the results are comparable and avoid the
randomness during experiments, we also keep the same split of
training, validation and test dataset across models.

To better understand and improve the building energy effi-
ciency, we explore the attributes of property contributing to the
model decisions. The interpretability of deep learning has been
widely studied recently. We apply the methods proposed by [47]
to the multi-branch and descriptive models. For the multi-branch
model, we mainly focus on the most decisive regions within the
image. For the descriptive models, we explore which descriptive
attributes are more important for improving the building energy
efficiency. We calculate SHAP2 values for each pixel within images
and attributes of properties. SHAP value is the method based on
game theory and used to increase transparency and interpretability
of our model. More specifically, SHAP values measure the contribu-
tion of the factors to the final prediction, with greater value leading
to the prediction and smaller value contributing to other possible
predictions.

5. Results

5.1. Model performance

With the data and methodology above, we implement the
model on the Ubuntu platform with four GeForce RTX 2080 Ti
GPUs and Python and PyTorch framework. The model is trained
with hyperparameters of 0.005 as the learning rate and 100 as
batch size. The training process takes 13.26 h and 45 epochs by
using the training (70%) and validation (15%) dataset. As shown
in Fig. 4, the validation accuracy becomes stable after the 37th
epoch. The final model is evaluated on the held test data and the
inference time is 133 samples per second. We evaluate the final
model with overall precision, recall and F1-score by categories
and spatial distribution of the prediction accuracies.

2 https://github.com/slundberg/shap.

5.1.1. Overall performance
The final model achieves an overall accuracy of 86.8% on test

set. We also test the model performance with 10-fold cross-
validation on the full dataset, results show that our model is able
to achieve the mean accuracy of 86.4%. To further explore the
detailed performance by classes, we present a confusion matrix
containing normalized performance, recall, precision and F1-
scores for each category. As shown in Table 3, the top-left to
bottom-right diagonal shows the percentage of correctly predicted
samples over all samples. The off-diagonal space represents the
percentage of misclassified samples. Result shows that the model
is able to archieve more than 80% accuracy for most classes. For
Grade A properties, the model achieves 69% accuracy as shown
in Table 3. This is due to the extremely small number of samples
from class A compared to that of other categories. It is also worth
noticing that most of the confusion happens within adjacent
classes. For classes with accuracy under 80%, Grade F samples are
often misclassified as Grade E and Grade A samples are mostly mis-
classified as Grade B and C.

5.1.2. Spatial distribution of prediction
To further understand the framework performance and the

characteristics of energy efficiency, we plot the error into the
map to explore the spatial distribution of prediction error. Fig. 5
shows the prediction accuracy aggregated into 150-meter grids.
The performance is measured by the precision, which is the num-
ber of correctly classified samples divided by all associated sample
number. The color indicates the accuracy with blue representing
high accuracy and magenta showing low accuracy. As shown in
the map, the majority of grids achieves a precision of over 80%.
Few grids have low precisions and no evident spatial patterns are
shown, which indicates no significant spatial autocorrelation exists
for the model’s outputs (Global Moran’s I = 0.14).

5.2. Model comparison

As an exploration, we trained two individual models predicting
building energy efficiency rating from façade images and descrip-
tive features respectively. The two models are then evaluated
based on the same metrics for feature-fusion model. Table 4 pre-
sents the accuracy, precision, recall and F1-score for comparisons
of different models. As shown in the table, the image feature model
and descriptive feature model achieves the accuracy of 57.2% and
79.5%, respectively. It is not surprising that the image feature per-
forms the worst, as the image data is just the external appearance
at the building level and the information is insufficient for energy
efficiency prediction. Results also shows that the feature-fusion
model achieves the highest accuracy. Furthermore, we present
the confusion matrix heatmap for each model in Fig. 6 to inspect
the break down of performance in categories. The heatmap shows
that, the feature-fusion model not only has the best overall accu-
racy, but also has a more balanced performance across classes.
Similar to the feature fusion model, the traditional descriptive fea-
ture model has a very low accuracy for Grade A.

5.3. Model interpretation

We calculate the SHAP value for multi-branch model and
descriptive model respectively with the SHAP Python library devel-
oped by [47]. Fig. 7 shows the top important features in the
descriptive model. We implement KernelExplainer from the SHAP
library to explore the impact of each feature to building energy
efficiency. Color represents the original value of the feature
and X position shows the SHAP value for features. Since we prepro-
cess the data with one-hot encoding, in most cases pink means
True while blue represent False. The plot is ordered by the absolute

https://github.com/slundberg/shap

Fig. 4. Model training curve (green dash line: validation accuracy; blue line: model accuracy). (For interpretation of the references to color in this figure legend, the reader is
referred to the web version of this article.)

Table 3
Confusion matrix for energy efficiency rating classification.

Ground Truth Prediction

A B C D E F G

A 69.23% 0.17% 0.01% 0.01% 0% 0% 0%
B 11.54% 88.41% 1.72% 0.11% 0.05% 0% 0%
C 7.69% 11.2% 90.89% 11.85% 0.63% 0.08% 0%
D 0% 0.2% 7.19% 82.05% 15.36% 1.68% 0%
E 0% 0.02% 0.17% 5.87% 80.06% 15.44% 0.25%
F 11.54% 0% 0.01% 0.11% 3.86% 77.77% 13.18%
G 0% 0% 0% 0.01% 0.05% 5.03% 86.57%
Recall 0.58 0.85 0.92 0.82 0.78 0.75 0.84
Precision 0.69 0.88 0.91 0.82 0.8 0.78 0.87
F1-score 0.63 0.87 0.91 0.82 0.79 0.76 0.86
Sample Number (368,769 in total) 171 23,779 190,876 110,222 34,555 6,797 2,369

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

mean of each feature’s SHAP values, which could be treated as a
proxy for feature importance. Features with high importance are
ranked on the top of the figure. SHAP values on the X-axis repre-
sent the feature’s contribution to energy efficiency, with larger
value meaning feature contributes to lower building energy effi-
ciency. Take the feature Roof: Pitched, no insulation for example,
pink dots (meaning the property has pitched and non-insulated
roof) have high SHAP values, thus contributing to low energy effi-
ciency. The result aligns with the intuition in many ways. ‘Insula-
tion’ plays an important role in energy efficiency. Roof and wall
with ’no insulation’ have negative impact to the energy efficiency,
while insulated wall improves the efficiency. Furthermore, the plot
is able to compare the contribution of each feature to identify use-
ful elements for energy efficiency improvement. For example,
houses featured with long history are associated with low effi-
ciency. Construction year before 1919 has more negative impact
to energy efficiency than construction age band within 1930–1949.

Fig. 8 shows the informative regions with the building façade
images. We select random samples from the dataset and calculate

the SHAP value for each pixel within images. Pink dots represent
the areas with high SHAP value and are important for final deci-
sions. As shown in the plot, most pink dots distribute around struc-
tural elements in the building façades, such as windows and doors.
This reveals that the model is able to make decisions based on
meaningful areas of the building façades, rather than paying atten-
tion to random parts.

6. Discussion

6.1. Building energy efficiency estimation in the era of big data

With the approaches in big data era, more and more data have
been generated and made publicly available. This research demon-
strates the potential of utilizing publicly available administrative
data to estimate building energy efficiency. These data are able
to provide extra information for building energy studies and fill
the gaps for the traditional energy efficiency estimation methods.

Fig. 5. Prediction results aggregated to 150-meter grids.

Table 4
Model performance comparison.

Metric Model

Image model Descriptive feature model Multi-branch model

Accuracy 57.2% 79.5% 86.8%
Precision 40.7% 66.2% 82.1%
Recall 31.2% 57.5% 79.2%
F1-Score 32.0% 61.0% 80.6%

Fig. 6. Confusion matrix heatmap of image feature model

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

This paper recognizes that GSV is informative not only for attri-
butes such as building age and style, which are directly related
to the visual aspects of the buildings, but also can extend our
understanding of building intrinsic characteristics such as building
energy efficiency. As a urban big data source, GSV is able to provide
extra information in addition to the traditional building morpho-
logical attributes, and achieves a more accurate and holistic
description of buildings. With the combination of GSV and EPC,

, descriptive feature model and feature fusion model.

Fig. 7. Feature Importance in descriptive attributes. The features from top to bottom have decreasing feature importance.

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

we are able to achieve high performance for property-level build-
ing energy efficiency estimation and prediction. Furthermore, with
the increasing number of research on deep learning’s interpretabil-
ity, we are able to find the meaningful features of building mor-
phological attributes and decisive part of building façades for
energy efficiency estimation. With the wider enforcement and cov-
erage of EPC data and street view images, this approach s shows
greater potential for the future work and can be extended to other
cities and countries.

6.2. Policy implications

Predicting and understanding the energy efficiency is crucial to
the policy making and implementations. As the building sector
being the largest energy consumer [48], improving building energy
efficiency is effective for greenhouse gas emission reduction, cli-
mate change prevention, and carbon-neutral policies. Accurate
estimations and comprehensive understanding of property-level
energy efficiency ratings are beneficial for regulation and policy
making. It has been proven that energy efficiency rating is related
to fuel poverty problem [6]. Furthermore, it is critical to the imple-
mentation of policies. For example, UK government is aiming at
increasing as many private rented properties to EPC Band C by
2030 [49]. For property owners, it is important to understand not
only current ratings, but also how the renovation will affect the
energy efficiency for their properties to be legally listed. For policy
execution, it also helps to identify the properties which require fur-
ther renovation and improvement.

The proposed analytical framework in this research is beneficial
to both policy making and implementation. The high accuracy of
methods ensures the framework could be applied to practice and
has the potential to be extended to other cities. The fine-scale

property-level prediction makes it possible for home owners to
better understand their ratings. Besides, because of the fine scale
of the methods, it is easier for city administrations to aggregate
the results and set goals in different spatial units (e.g. neighbor-
hood) for better execution, particularly for the deprived neighbor-
hood. Furthermore, our framework takes detailed inputs about the
properties such as window description and wall description. It
helps the homeowners to estimate the final energy efficiency
before the renovation starts.

6.3. Limitations and future work

Limitations do exist in this study and could be improved in the
future studies. The main limitation of this work is the data quality
of EPC dataset. It has been widely disucssed that some uncertain-
ties exist in the EPC dataset in terms of the gap between estimated
and actual energy performance [50,51]. With the growing coverage
of EPC, European countries have built standard for quality assur-
ance. In the future work, the uncertainty of dataset can be gradu-
ally minimized.

Secondly, the detailed attributes of EPC dataset also constrain
the widely application of our framework. Most of the descriptions
about properties from EPC could not be obtained through other
data sources. It is very difficult to make good prediction for build-
ings with general building information (i.e. building height, age)
outside of EPC database. For future work, we plan to explore the
balance between the data availability and prediction accuracy, so
that the framework could be extended more broadly to demostic
buildings without EPC now. At present, the coverage of EPC data
is about 50% in England, Wales, and Scotland [52], and a large num-
ber of properties that have GSV does not have EPC data. We can use
data from other open dataset instead of EPC as the traditional fea-

Fig. 8. Feature Importance in image branch of multi-branch model. Pink dots represent areas important for final decision. (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

tures in the framework to predict fine-grained city-level or larger-
scale energy efficiency and gain insight into energy efficiency dif-
ferences among different regions.

7. Conclusion

Improving building energy efficiency is key to the global carbon
emission reduction task. Accurate predicting and understanding of
building energy efficiency is beneficial for better utilizing and sav-
ing energy in the building sector. This paper proposes a feature-
fusion framework for building energy efficiency prediction with
publicly available data. The framework involves EPC data collection
of building descriptive factors and street-level imagery data, and
we extract and fuse the features from both data sources for the
final estimation. The framework is implemented for the city of
Glasgow, UK for its feasibility. Results show that our framework
is able to correctly classify 86.8% samples from test set. With the

comparison of our feature-fusion framework, image-only model
and traditional descriptive factors model, our framework is able
to achieve the highest accuracy and has a more balanced perfor-
mance across different ratings. The explainable AI tool indicates
that insulations around open structures such as windows and
doors are key factors to influence the energy efficiency.

Our research contributes to the research of building energy
studies in twofold. First, by incorporating street view images, for
energy efficiency estimation task we are able to achieve higher
accuracy compared to traditional building attribute features. Sec-
ond, our method is able to identify important building features
to improve building energy efficiency, which will be useful for
the housing retrofit in the near future. This study also provides
insights into the potential of applying deep learning to the research
of building attributes with new forms of urban big data. By intro-
ducing street view images to the building energy studies as a visual
representation and proxy for building ages, styles and façade mate-
rials, we verify that GSV is able to provide another layer for build-

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

ing stock attributes prediction. This research has the potential to
help urban planners and policy makers to target specific ‘energy
efficiency deprived’ neighborhood and provides extra evidence to
better tackle the fuel poverty problems efficiently.

Data availability

This study brought together existing research data obtained upon
request and subject to licence restrictions from a number of differ-
ent sources. Full details of how these data were obtained are avail-
able in the documentation available at the MaoranSun/
buildingEnergyEfficiency: Building Energy Efficiency Glasgow repo-
sitory at https://doi.org/10.5281/zenodo.6913572.

Declaration of Competing Interest

The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared
to influence the work reported in this paper.

Acknowledgements

This work was made possible by the ESRC’s on-going support
for the Urban Big Data Centre (UBDC) [ES/L011921/1 and ES/
S007105/1] and the Urban Studies Research Incentivization Fund-
ing from the University of Glasgow. This work was also supported
by the National Natural Science Foundation of China under Grant
41901321. The authors want to thank the anonymous reviewers
for their insightful comments and suggestions on an earlier version
of this manuscript.

Appendix A. Supplementary data

Supplementary data to this article can be found online at
https://doi.org/10.1016/j.enbuild.2022.112331.

References

[1] ‘‘Glasgow Climate Pact | UNFCCC,” Nov. 2021. https://unfccc.int/documents/
310475 (accessed Apr. 19, 2022).

[2] P. Waide and M. D. GERUNDINO, ‘‘International standards to develop and
promote energy efficiency and renewable energy sources,” Prep. G8 Plan Action
IEA Inf. Pap. Retrieved, p. 2011, 2007.

[3] L. Pérez-Lombard, J. Ortiz, C. Pout, A review on buildings energy consumption
information, Energy Build. 40 (3) (2008) 394–398.

[4] P.G. Taylor, O.L. d’Ortigue, M. Francoeur, N. Trudeau, Final energy use in IEA
countries: The role of energy efficiency, Energy Policy 38 (11) (2010) 6463–
6474.

[5] O. Pasichnyi, J. Wallin, F. Levihn, H. Shahrokni, O. Kordas, Energy performance
certificates — New opportunities for data-enabled urban energy policy
instruments?, Energy Policy 127 (Apr 2019) 486–499, https://doi.org/
10.1016/j.enpol.2018.11.051.

[6] F. Belaïd, Exposure and risk to fuel poverty in France: Examining the extent of
the fuel precariousness and its salient determinants, Energy Policy 114 (2018)
189–200.

[7] K.G. Droutsa, S. Kontoyiannidis, E.G. Dascalaki, C.A. Balaras, Mapping the
energy performance of hellenic residential buildings from EPC (energy
performance certificate) data, Energy 98 (Mar. 2016) 284–295, https://doi.
org/10.1016/j.energy.2015.12.137.

[8] R. Gupta, M. Gregg, Assessing energy use and overheating risk in net zero
energy dwellings in UK, Energy Build. 158 (2018) 897–905.

[9] A. Abela, M. Hoxley, P. McGrath, S. Goodhew, An investigation of the
appropriateness of current methodologies for energy certification of
Mediterranean housing, Energy Build. 130 (2016) 210–218.

[10] M. Österbring, L. Thuvander, É. Mata, H. Wallbaum, Stakeholder specific multi-
scale spatial representation of urban building-stocks, ISPRS Int. J. Geo-Inf. 7 (5)
(2018) 173.

[11] F. Fuerst, P. McAllister, The impact of Energy Performance Certificates on the
rental and capital values of commercial property assets, Energy Policy 39 (10)
(2011) 6608–6614.

[12] S. Seyedzadeh, F.P. Rahimian, I. Glesk, M. Roper, Machine learning for
estimation of building energy consumption and performance: a review, Vis.
Eng. 6 (1) (Dec. 2018) 5, https://doi.org/10.1186/s40327-018-0064-7.

[13] C. Buratti, M. Barbanera, D. Palladino, An original tool for checking energy
performance and certification of buildings by means of Artificial Neural
Networks, Appl. Energy 120 (May 2014) 125–132, https://doi.org/10.1016/j.
apenergy.2014.01.053.

[14] F. Khayatian, L. Sarto, G. Dall’O’, Application of neural networks for evaluating
energy performance certificates of residential buildings, Energy and Buildings
125 (2016) 45–54.

[15] T. Gebru et al., Using deep learning and Google Street View to estimate the
demographic makeup of neighborhoods across the United States, Proc. Natl.
Acad. Sci. 114 (50) (2017) 13108–13113.

[16] F. Zhang, B. Zhou, L. Liu, Y.u. Liu, H.H. Fung, H. Lin, C. Ratti, Measuring human
perceptions of a large-scale urban region using machine learning, Landsc.
Urban Plan. 180 (2018) 148–160.

[17] R. Kumar, R.K. Aggarwal, J.D. Sharma, Energy analysis of a building using
artificial neural network: A review, Energy Build. 65 (Oct. 2013) 352–358,
https://doi.org/10.1016/j.enbuild.2013.06.007.

[18] S.R. Mohandes, X. Zhang, A. Mahdiyar, A comprehensive review on the
application of artificial neural networks in building energy analysis,
Neurocomputing 340 (May 2019) 55–75, https://doi.org/10.1016/j.
neucom.2019.02.040.

[19] L. Wang, E.W.M. Lee, S.A. Hussian, A.C.Y. Yuen, W. Feng, Quantitative impact
analysis of driving factors on annual residential building energy end-use
combining machine learning and stochastic methods, Appl. Energy 299 (Oct.
2021), https://doi.org/10.1016/j.apenergy.2021.117303 117303.

[20] F. Ascione, N. Bianco, C. De Stasio, G.M. Mauro, G.P. Vanoli, Artificial neural
networks to predict energy performance and retrofit scenarios for any member
of a building category: A novel approach, Energy 118 (Jan. 2017) 999–1017,
https://doi.org/10.1016/j.energy.2016.10.126.

[21] W. Chung, Review of building energy-use performance benchmarking
methodologies, Appl. Energy 88 (5) (May 2011) 1470–1479, https://doi.org/
10.1016/j.apenergy.2010.11.022.

[22] A. J. Khalil, A. M. Barhoom, B. S. Abu-Nasser, M. M. Musleh, and S. S. Abu-Naser,
‘‘Energy Efficiency Prediction using Artificial Neural Network,” Int. J. Acad.
Pedagog. Res. IJAPR, vol. 3, no. 9, 2019, Accessed: Jun. 15, 2022. [Online].
Available: https://philpapers.org/rec/KHAEEP.

[23] J. Yang, L. Zhao, J. Mcbride, P. Gong, Can you see green? Assessing the visibility
of urban forests in cities, Landsc. Urban Plan. 91 (2) (Jun. 2009) 97–104,
https://doi.org/10.1016/j.landurbplan.2008.12.004.

[24] Z. Tang, Y. Ye, Z. Jiang, C. Fu, R. Huang, D. Yao, A data-informed analytical
approach to human-scale greenway planning: Integrating multi-sourced
urban data with machine learning algorithms, Urban For. Urban Green. 56
(Dec. 2020), https://doi.org/10.1016/j.ufug.2020.126871 126871.

[25] E.B. Johnson, A. Tidwell, S.V. Villupuram, Valuing Curb Appeal, J. Real Estate
Finance Econ. 60 (1–2) (Feb. 2020) 111–133, https://doi.org/10.1007/s11146-
019-09713-z.

[26] A. Middel, J. Lukasczyk, S. Zakrzewski, M. Arnold, R. Maciejewski, Urban form
and composition of street canyons: A human-centric big data and deep
learning approach, Landsc. Urban Plan. 183 (Mar. 2019) 122–132, https://doi.
org/10.1016/j.landurbplan.2018.12.001.

[27] C.-B. Hu, F. Zhang, F.-Y. Gong, C. Ratti, X. Li, Classification and mapping of
urban canyon geometry using Google Street View images and deep multitask
learning, Build. Environ. 167 (Jan. 2020), https://doi.org/10.1016/j.
buildenv.2019.106424 106424.

[28] J. Hong, D. McArthur, V. Raturi, Did Safe Cycling Infrastructure Still Matter
During a COVID-19 Lockdown?, Sustainability 12 (20) (Oct 2020) 8672,
https://doi.org/10.3390/su12208672.

[29] E. Glaeser, S. D. Kominers, M. Luca, and N. Naik, ‘‘Big Data and Big Cities: The
Promises and Limitations of Improved Measures of Urban Life,” National
Bureau of Economic Research, Cambridge, MA, w21778, Dec. 2015. doi:
10.3386/w21778.

[30] Y. Li, Y. Chen, A. Rajabifard, K. Khoshelham, and M. Aleksandrov, ‘‘Estimating
building age from google street view images using deep learning,” in 10th
international conference on geographic information science (GIScience 2018),
Dagstuhl, Germany, 2018, vol. 114, p. 40:1-40:7. doi: 10.4230/LIPIcs.
GISCIENCE.2018.40.

[31] Z. Gong, Q. Ma, C. Kan, Q. Qi, Classifying Street Spaces with Street View Images
for a Spatial Indicator of Urban Functions, Sustainability 11 (22) (Nov. 2019)
6424, https://doi.org/10.3390/su11226424.

[32] F. Biljecki, K. Ito, Street view imagery in urban analytics and GIS: A review,
Landsc. Urban Plan. 215 (Nov. 2021), https://doi.org/10.1016/
j.landurbplan.2021.104217 104217.

[33] J. Tang, Y. Long, Measuring visual quality of street space and its temporal
variation: Methodology and its application in the Hutong area in Beijing,
Landsc. Urban Plan. 191 (Nov. 2019), https://doi.org/10.1016/
j.landurbplan.2018.09.015 103436.

[34] D. Quercia, N. K. O’Hare, and H. Cramer, ‘‘Aesthetic capital: what makes london
look beautiful, quiet, and happy?,” in Proceedings of the 17th ACM conference on
Computer supported cooperative work & social computing, Baltimore Maryland
USA, Feb. 2014, pp. 945–955. doi: 10.1145/2531602.2531613.

[35] C.I. Seresinhe, T. Preis, H.S. Moat, Using deep learning to quantify the beauty of
outdoor places, R. Soc. Open Sci. 4 (7) (Jul. 2017), https://doi.org/10.1098/
rsos.170170 170170.

[36] L. Liu, E.A. Silva, C. Wu, H. Wang, A machine learning-based method for the
large-scale evaluation of the qualities of the urban environment, Comput.
Environ. Urban Syst. 65 (Sep. 2017) 113–125, https://doi.org/10.1016/
j.compenvurbsys.2017.06.003.

https://doi.org/10.5281/zenodo.6913572

https://doi.org/10.1016/j.enbuild.2022.112331

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0015

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0020

https://doi.org/10.1016/j.enpol.2018.11.051

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0030

https://doi.org/10.1016/j.energy.2015.12.137

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0040

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0045

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0050

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0055

https://doi.org/10.1186/s40327-018-0064-7

https://doi.org/10.1016/j.apenergy.2014.01.053

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0070

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0075

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0080

https://doi.org/10.1016/j.enbuild.2013.06.007

https://doi.org/10.1016/j.neucom.2019.02.040

https://doi.org/10.1016/j.apenergy.2021.117303

https://doi.org/10.1016/j.energy.2016.10.126

https://doi.org/10.1016/j.apenergy.2010.11.022

https://doi.org/10.1016/j.landurbplan.2008.12.004

https://doi.org/10.1016/j.ufug.2020.126871

https://doi.org/10.1007/s11146-019-09713-z

https://doi.org/10.1016/j.landurbplan.2018.12.001

https://doi.org/10.1016/j.buildenv.2019.106424

https://doi.org/10.3390/su12208672

https://doi.org/10.3390/su11226424

https://doi.org/10.1016/j.landurbplan.2021.104217

https://doi.org/10.1016/j.landurbplan.2018.09.015

https://doi.org/10.1098/rsos.170170

https://doi.org/10.1016/j.compenvurbsys.2017.06.003

M. Sun, C. Han, Q. Nie et al. Energy & Buildings 273 (2022) 112331

[37] C. Doersch, S. Singh, A. Gupta, J. Sivic, A.A. Efros, What makes Paris look like
Paris?, ACM Trans Graph. 31 (4) (Aug. 2012) 1–9, https://doi.org/10.1145/
2185520.2185597.

[38] Y. Yoshimura, B. Cai, Z. Wang, and C. Ratti, ‘‘Deep Learning Architect:
Classification for Architectural Design through the Eye of Artificial
Intelligence,” 2018, doi: 10.48550/ARXIV.1812.01714.

[39] J. von Platten, C. Sandels, K. Jörgensson, V. Karlsson, M. Mangold, K. Mjörnell,
Using Machine Learning to Enrich Building Databases—Methods for Tailored
Energy Retrofits, Energies 13 (10) (May 2020) 2574, https://doi.org/10.3390/
en13102574.

[40] D. Chen, H.W. Chen, Using the Köppen classification to quantify climate
variation and change: An example for 1901–2010, Environ. Dev. 6 (2013) 69–
79.

[41] P.M. Congedo, C. Baglivo, A.K. Seyhan, R. Marchetti, Worldwide dynamic
predictive analysis of building performance under long-term climate change
conditions, J. Build. Eng. 42 (2021) 103057.

[42] ‘‘Domestic Energy Performance Certificates – Dataset to Q2 2021,” Jul. 2021.
https://statistics.gov.scot/data/domestic-energy-performance-certificates
(accessed Apr. 20, 2022).

[43] ‘‘Geomni,” Mar. 2021. https://digimap.edina.ac.uk/geomni (accessed Apr. 20,
2022).

[44] ‘‘Energy Performance Certificates: introduction – gov.scot,” Apr. 2016. https://
www.gov.scot/publications/energy-performance-certificates-introduction/?
utm_source=redirect&utm_medium=shorturl&utm_campaign=epc (accessed
Jan. 25, 2022).

[45] R. Goel, L.M.T. Garcia, A. Goodman, R. Johnson, R. Aldred, M. Murugesan, S.
Brage, K. Bhalla, J. Woodcock, M. Srinivasan, Estimating city-level travel
patterns using street imagery: A case study of using Google Street View in
Britain, PloS One 13 (5) (2018) e0196521.

[46] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ‘‘Imagenet: A large-scale
hierarchical image database”, in, IEEE conference on computer vision and
pattern recognition 2009 (2009) 248–255.

[47] S. M. Lundberg and S.-I. Lee, ‘‘A unified approach to interpreting model
predictions,” Adv. Neural Inf. Process. Syst., vol. 30, 2017.

[48] O.T. Masoso, L.J. Grobler, The dark side of occupants’ behaviour on building
energy use, Energy Build. 42 (2) (2010) 173–177.

[49] ‘‘Domestic private rented property: minimum energy efficiency standard –
landlord guidance,” GOV.UK, May 2020. https://www.gov.uk/
guidance/domestic-private-rented-property-minimum-energy-efficiency-
standard-landlord-guidance (accessed Apr. 19, 2022).

[50] L. Tronchin, K. Fabbri, Energy Performance Certificate of building and
confidence interval in assessment: An Italian case study, Energy Policy 48
(2012) 176–184.

[51] B. Coyne, E. Denny, Mind the energy performance gap: testing the accuracy of
building energy performance certificates in ireland, Energy Effic. 14 (6) (2021)
1–28.

[52] ‘‘Energy efficiency of housing in England and Wales – Office for National
Statistics,” Sep. 2020. https://www.ons.gov.uk/peoplepopulationand
community/ housing/articles/energyefficiencyofhousinginenglandandwales/
2020-09-23#coverage-of-energy-performance-certificate-data (accessed Apr.
19, 2022).

https://doi.org/10.1145/2185520.2185597

https://doi.org/10.3390/en13102574

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0200

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0205

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0225

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0230

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0240

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0250

http://refhub.elsevier.com/S0378-7788(22)00502-3/h0255

Understanding building energy efficiency with administrative and �emerging urban big data by deep learning in Glasgow

1 Introduction

2 Background

2.1 Methods for predicting building energy efficiency

2.2 Using street-level imagery for building stock estimation

3 Study area and data

3.1 Study area

3.2 Data

3.2.1 EPC data

3.2.2 UK buildings

3.2.3 Street view images

4 Methodology

4.1 Street view image collection

4.2 Multi-branch deep learning model design

4.3 Evaluation of result

4.4 Model comparison and interpretation

5 Results

5.1 Model performance

5.1.1 Overall performance

5.1.2 Spatial distribution of prediction

5.2 Model comparison

5.3 Model interpretation

6 Discussion

6.1 Building energy efficiency estimation in the era of big data

6.2 Policy implications

6.3 Limitations and future work

7 Conclusion

Declaration of Competing Interest

Acknowledgements

Appendix A Supplementary data

References

Li et al. – 2024 – Understanding urban traffic flows in response to C

Cities 154 (2024) 105381

Available online 26 August 2024
0264-2751/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Understanding urban traffic flows in response to COVID-19 pandemic with
emerging urban big data in Glasgow

Yue Li a, Qunshan Zhao a,*, Mingshu Wang b,**

a Urban Big Data Centre, School of Social and Political Sciences, University of Glasgow, Glasgow, UK
b School of Geographical and Earth Sciences, University of Glasgow, Glasgow, UK

A R T I C L E I N F O

Keywords:
Traffic flows
Urban big data
COVID-19
Spatial Durbin model

A B S T R A C T

Urban traffic analysis has played an important role in urban development, providing insights for urban planning,
traffic management, and resource allocation. Meanwhile, the global pandemic of COVID-19 has significantly
changed people’s travel behaviour in urban areas. This research uses the spatial Durbin model to understand the
relationship between traffic flows, urban infrastructure, and socio-demographic indicators before, during, and
after pandemic periods. We include factors such as road characteristics, socio-demographics, surrounding built
environments (land use and nearby points of interest), and the emerging urban big data source of Google Street
View images to understand their influences on time series traffic flows. Taking the city of Glasgow as the case
study, we have found that areas with more young and white dwellers are associated with more traffic flows,
while natural green spaces are associated with fewer traffic flows. Major roads between cities and towns also
show heavier traffic flows. Besides, the application of Google Street View images in this research has revealed the
heterogeneous effects of green space on urban traffic flows, as the magnitudes of their effects vary by distance.
We also detect that the spatial dependence between adjacent neighbourhoods among the traffic flows and
associated urban parameters is variable during the four COVID-19 periods. With the influence of COVID-19, there
has been a significant decrease in long-distance travel. The noticeable change in travel behaviour presents a
valuable opportunity to encourage active travel in the near future.

1. Introduction

In most cities, the transportation system has been dominated by
motor vehicles with the supplement of other sustainable transport
modes (Transport, 2022). However, travel behaviour has changed
significantly since the COVID-19 pandemic (Aloi et al., 2020; Bucsky,
2020; Hadjidemetriou et al., 2020; Parr et al., 2020; Saladié et al., 2020;
Tian et al., 2021). The coronavirus disease 2019 (COVID-19) pandemic
is a global threat with the transmission of SARS-COV-2, leading to
escalating health, economic and social challenges. To contain the
COVID-19 pandemic, the government has adopted different social
distancing measures, especially the restrictions on human mobility in
urban areas (e.g., Hadjidemetriou et al., 2020; Zhu et al., 2023). The
reduction in traffic flows impacted by this pandemic has been seen in
cities all over the world (Parr et al., 2020; Tian et al., 2021), with >50 %
reduction in many cities (Aloi et al., 2020; Bucsky, 2020; Hadjideme-
triou et al., 2020; Parr et al., 2020; Saladié et al., 2020; Tian et al., 2021).

With the change in travel modes during the pandemic, urban traffic
analysis has played an important role in providing insights for urban
planning, traffic management, and resource allocation. Besides, ana-
lysing the travel behaviour before, during, and after COVID-19 not only
contribute to data-driven public health and governmental decision-
making but also help enhance community responses to the future
pandemic.

As a crucial component of the complex urban system, urban traffic
analysis has drawn the attention of researchers and planners for decades
(Batty, 2008). Meanwhile, the increasing development of Intelligent
Transportation Systems with various urban sensing technologies (Buch
et al., 2011) has produced a variety of traffic-related data to monitor
urban traffic conditions in high spatiotemporal resolution. Many of the
studies applied mobile device data (Jiang, Song, et al., 2021; Kupfer
et al., 2021), smart card data (Mützel & Scheiner, 2022; Zhou, Liu, et al.,
2021), and road detector data (Gao & Levinson, 2022; Liu & Stern,
2021), to explore the spatial and temporal evolution of human mobility

* Correspondence to: Q. Zhao, 7 Lilybank Gardens, University of Glasgow, Glasgow G12 8RZ, UK.
** Corresponding author.

E-mail addresses: 2672496L@student.gla.ac.uk (Y. Li), Qunshan.Zhao@glasgow.ac.uk (Q. Zhao), Mingshu.Wang@glasgow.ac.uk (M. Wang).

Contents lists available at ScienceDirect

Cities

journal homepage: www.elsevier.com/locate/cities

https://doi.org/10.1016/j.cities.2024.105381
Received 7 July 2023; Received in revised form 10 August 2024; Accepted 10 August 2024

mailto:2672496L@student.gla.ac.uk

mailto:Qunshan.Zhao@glasgow.ac.uk

mailto:Mingshu.Wang@glasgow.ac.uk

www.sciencedirect.com/science/journal/02642751

https://www.elsevier.com/locate/cities

https://doi.org/10.1016/j.cities.2024.105381

http://crossmark.crossref.org/dialog/?doi=10.1016/j.cities.2024.105381&domain=pdf

http://creativecommons.org/licenses/by/4.0/

Cities 154 (2024) 105381

patterns in both pre- and COVID-19 period. However, they mostly only
explored the pandemic’s impact, with limited aspects considered for the
built environment and socio-demographics. Those existing studies
overlooked the integrated influence of COVID-19 and other surrounding
environments on urban traffic flows.

Numerous research studies have examined the effects of the COVID-
19 pandemic on travel behaviours and patterns by leveraging traffic
flow data (Aloi et al., 2020; Gao & Levinson, 2022; Liu & Stern, 2021;
Parr et al., 2020; Tian et al., 2021). This new form of data, primarily
acquired through road detectors and cameras embedded in Intelligent
Transportation Systems, has been extensively utilised. However, most
investigations focused on analysing changes in traffic flows during a
limited time, typically spanning a few months (Aloi et al., 2020; Parr
et al., 2020; Tian et al., 2021). Only limited studies have employed
traffic flow data collected over several years (Gao & Levinson, 2022; Liu
& Stern, 2021). Regrettably, the post-COVID-19 changes in traffic flows
have been largely overlooked. In this study, we aim to address this gap
by utilising comprehensive time series traffic flow data to examine the
dynamics of traffic flows before, during, and after the COVID-19
pandemic.

The overarching goal of this research is to understand the quantita-
tive relationship between the urban physical and social elements (i.e.,
built environment, socio-demographics) and traffic dynamics by using
new forms of urban big data and spatial econometric model in Glasgow.
The contribution of this paper is threefold. First, it bridges the research
gap between the topics of COVID-19 mobility patterns and influential
factors of urban traffic flows by exploring the heterogeneous and linear
relationship between the influential urban factors and traffic flows by
different COVID-19 stages. Second, it applies the long time series traffic
flow data spanning multiple years in Glasgow, which has been sparsely
investigated in previous research conducted in the United Kingdom. It
also conducts a detailed comparative analysis of traffic flow changes at a
high temporal granularity, delineated into four distinct stages: ‘Before
COVID-19’, ‘1st lockdown’, ‘2nd lockdown’, and ‘Post COVID-19’.
Third, it highlights the distance-sensitive heterogeneous effects of the
green space on urban traffic flows by applying Google Street View im-
ages as the emerging urban big data. It also reveals the patterns of spatial
dependence on traffic flows and urban factors at different stages of the
COVID-19 pandemic. The research outputs will help city planners and
policymakers understand what physical and social factors will influence
the traffic flows for urban planning and resource allocation. It is valu-
able in data-driven governmental decision-making and helps enhance
community responses to the future pandemic.

2. Background

2.1. Influential factors of urban traffic flows

As a crucial component of the complex urban system, urban traffic
analysis has drawn the attention of researchers and planners for decades
(Batty, 2008). Numerous studies have been conducted on the impact of
urban factors on traffic dynamics, which can be summarised as the socio-
demographics and built environment factors. Several studies have
investigated the influence of socio-demographics on daily travel be-
haviours in cities via quantitative analysis using survey datasets with
information such as age, race, gender, income, and educational level
(Ma et al., 2014; Schoenau & Müller, 2017; Wang & Mu, 2018; Zhou,
Yuan, et al., 2021). Existing research found that residents with higher
incomes, better employment, and more children are more likely to travel
by car (Klinger & Lanzendorf, 2016). Limited research has investigated
how socio-demographics influence sensor-measured urban traffic flows.

In addition to socio-demographics, numerous studies have explored the
influence of the urban built environment on urban traffic flows. With the
built environment, road characteristics are the most concerning and
important factors in early traffic flow research (He & Zhao, 2013; Irawan
et al., 2010). Recent research found that the longer the road and the
lower frequency of the intersections are conducive to city motor vehicles
(Cubells et al., 2023; Yokoo & Levinson, 2019).

In recent years, the development of sensing technologies and
crowdsourcing platforms have produced a variety of urban big data (Pan
et al., 2016), including land use data from satellite imagery, Point of
Interest (POI) from OpenStreetMap, and street-level images from street
level vehicle scanning. POI is the precise positioning of urban function
points (Nian et al., 2020; Xu et al., 2019), which has been proven to have
a strong correlation with travel behaviours (Bao et al., 2015; Gong et al.,
2016; Jiang, Song, et al., 2021; Yue et al., 2017). Specifically, areas with
entertainment and consumption functions are more likely to generate
more traffic flows (Nian et al., 2020; Xu et al., 2019). However, the
attraction of consumption POIs on travel behaviour significantly
decreased, while the impact on residential areas increased during the
pandemic (Nian et al., 2020). Urban land use typically refers to the land
surface modified by human activities in urban areas (Ellis, 2013; Liu
et al., 2021). Similar to POIs, several studies connect land use with urban
travel behaviours (Bandeira et al., 2011; Jiang, Huang and Li, 2021; Lee
& Holme, 2015; Liu et al., 2021; Wang & Debbage, 2021). Most focus on
traffic prediction via land use (Bandeira et al., 2011; Lee & Holme,
2015), and some inferred urban land use from human mobility data (Liu
et al., 2021; Pan et al., 2013). Street-level imagery is a novel source of
large-scale urban data that provides panoramic information along the
streets (Goel et al., 2018; Ibrahim et al., 2020). Google Street View
(GSV) is one of the most common sources of street imagery, widely used
to identify human mobility patterns in cities (Bartzokas-Tsiompras et al.,
2021; Goel et al., 2018; Wang et al., 2022). Although the urban built
environment has been used to explore its influences on urban traffic
flows, few studies explore the quantitative relationship between POIs,
land use, GSV, and traffic flows in cities. This research fills the gap in the
urban environment from a quantitative analysis perspective using
emerging urban big data.

2.2. Spatial models on urban traffic analytics

Regression analysis is a statistical technique for modelling and
investigating the relationships between a dependent variable and one or
more independent variables (Montgomery, 2015). Linear regression has
been widely used in various transportation research, analysing the
relationship between human travel behaviours and various urban fac-
tors (road characteristics, socio-demographics, surrounding built envi-
ronments, etc.) (Boarnet et al., 2008; Ozbil et al., 2011, 2019; Vance &
Iovanna, 2007; Xu et al., 2017).

Although linear regression has been widely used to understand the
relationships between urban traffic flows and associated influential
factors, the explanation of the generation of urban traffic flows via linear
regression models is limited due to the neglect of potential spatial
dependence in different locations (Koenig, 1999). For instance, the
traffic flows of two adjacency road links are more similar than two un-
connected roads. The spatial dependence can be quantified using spatial
autocorrelation indices from the spatial regression models, which
incorporate the spatial correlation into the traditional regression
framework. Various researchers have applied spatial regression models
to capture the spatial autocorrelation between the urban environment
and traffic behaviour. Wang et al. (2020) compared the performance of
the spatial error model (SEM), spatial autoregressive model (SAM), and

Y. Li et al.

Cities 154 (2024) 105381

spatial Durbin model (SDM) on the relationship between Uber accessi-
bility and road network structure. The results show that SDM is favoured
over SAM, and SEM reveals the worst performance. Rhee et al. (2016)
identified that the performance of the SEM was better than the SAM in
quantifying the traffic crash frequency with road length and speed limit,
while the geographically weighted regression model provided valuable
insights about localisation effects. Considering the spatial features of
urban traffic behaviour, the spatial regression model has been

implemented to measure pedestrian behaviours’ demographic and urban
environmental relationships (Ha & Thill, 2011; LaScala et al., 2000).
The social, economic, and transportation hubs were considered spatial
variables that impact freight traffic generation (Novak et al., 2011),
while weather events affected the spatial distribution of freight truck
flows (Akter et al., 2020).

Fig. 1. The spatial distribution of average daily traffic flows from August 9, 2019, to October 16, 2021, in Glasgow.

Fig. 2. Data cleaning flowchart.

Y. Li et al.

Cities 154 (2024) 105381

3. Study area and data

Glasgow (Fig. 1) is the most populous city and has the largest
economy in Scotland (BBC News, 2017; Statista, 2019). Glasgow is the
focal point of Scotland’s road network, with tens of thousands of resi-
dents commuting daily to the city (Department for Transport, 2021).
Cars are Glasgow’s most popular mode of transport, and two-thirds of
people use them for journeys around the city (Scotland’s Census, 2022).
Table 2 explains the data sources for this study, including traffic flow
data, urban land use data, socio-demographic data, POI data, road
characteristics data, and the GSV images in Glasgow.

3.1. Urban traffic flows

This study collects traffic flow data from August 9, 2019, to October
16, 2021, by road detectors within Glasgow City. The road detectors
include various above- and below-ground traffic sensors to record the
traffic flows at specific points. The traffic sensors only collect the
number of motor vehicles, which cannot discriminate the vehicle type.
We group the traffic data into four stages based on COVID-19 lockdown
periods published by the Scottish government and available dates of
traffic data (Hale et al., 2021; Institute for Government, 2021), including
traffic flows ‘Before COVID-19’ from August 9, 2019, to October 16,
2019, ‘1st Lockdown’ from March 24, 2020, to May 28, 2020, ‘2nd
Lockdown’ from January 6, 2021, to March 13, 2021, and ‘Post COVID-
19’ from August 9, 2021, to October 16, 2021. Both lockdown periods

Fig. 3. The spatial distribution of traffic flows in four COVID-19 periods in Glasgow.

Y. Li et al.

Cities 154 (2024) 105381

are under the C6 level of OxCGRT Indicators, which refers to the policy
of Stay-at-home Requirements (Hale et al., 2021). During the study
period, 1033 sites of traffic flows were recorded, from which 530 valid
sites have been used in this research, located from the main road
(motorway) to the fifth-class road (local road), with a time interval of 15
min.

The workflow of the data cleaning process is listed in Fig. 2 by
filtering the traffic flow data with spatial and temporal constraints (Li
et al., 2024). Since automatic traffic sensors collect raw traffic flow data,
inspecting and cleaning the data before any analysis is necessary.
Therefore, we extracted the coordinates of each recorded site and
compared them to the geographic location of the GCC area. Sites not
within this area are beyond the scope of the research and not considered
in the next step (59 sites). The second step tries to filter the traffic data
on the temporal scale. The study period in this research is from August 9,
2019, to October 16, 2021, 800 days overall. The records of 69 sites in
Glasgow start later than August 2019 or end before October 2021. The
third step of the cleaning process focuses on the record and error per-
centage of the real-time data. GCC has applied an interpolation method
for traffic flows when no data is returned. 48 of 905 remaining sites in
Glasgow are detected and removed with >18 % interpolated data, with
the rest fewer than 1 %. The fourth step considers the spatial relation-
ship between the traffic flows and the road network. Since sensors re-
cord the traffic flows in Glasgow along the road, the range of the
Euclidean distance between each site and the nearest should be
considered. In Glasgow, >99 % of detectors are located within 15 m of
roads. Only 9 sites are 20 m farther away, so we have removed them
from our analysis. Lastly, after performing all the above spatial and
temporal inspections, we can filter the traffic flow data numerically.

Consecutive zero value is the most frequent data issue in the time
series study. In this research, the consecutive zero value refers to no
vehicles passing the road for hours or days. First, we remove the 189
sites with records showing >90 % zero traffic flow during the study
period. Then, the recorded dates with >24 h of consecutive zero are
eliminated for each recorded site, assuming that at least one motor
vehicle should be passing by for daily routine in Glasgow urban areas.
Recorded dates without traffic flow for more than one day are incorrect
and are not considered in this study. 530 of 659 remaining sites are
selected, with the most recorded dates remaining showing 754 of 800
days. Overall, we filtered out 530 records after the data-cleaning process
and calculated the daily traffic flows for each site. Consequently, Fig. 3
demonstrates the spatial variation of the daily average traffic flows
across four COVID-19 stages.

3.2. Distribution of traffic flows

In this research, we aggregate the raw traffic flows daily. By
employing this data, we compute the average daily traffic flows for each
recorded site, aiming to understand the spatial correlation between
traffic flows and the built environment. A visualisation of the statistical
distribution of average daily traffic flows in Glasgow over different pe-
riods is shown in Fig. 4. During the regular period without the pandemic,
the average daily traffic flows are approximately 1460 among each
recorded site. Due to the outbreak of COVID-19 and the first lockdown in
March 2020, the average traffic flow slumped to 788. Compared to the
period before COVID-19, the total daily average traffic flows recorded by
530 detectors during the first lockdown is 417,250, which is only 56.8 %
of daily traffic flows in a regular period. The phenomenon was relieved
during the second lockdown, with the average daily traffic flow rising to

Fig. 4. Histogram of daily average traffic flows.

Table 1
Comparison of traffic flows between different COVID-19 stages and the period
before COVID-19.

Zone 1st Lockdown
compare to pre
COVID-19 (%)

2nd Lockdown
compare to pre
COVID-19 (%)

2nd Lockdown
compare to 1st
Lockdown (%)

Post COVID-19
compare to pre
COVID-19 (%)

G1 − 44.7 − 38.4 17.5 − 6.6
G11 − 46.9 − 33.2 31.1 − 6.0
G12 − 48.7 − 25.4 57.7 − 11.0
G13 − 49.2 − 22.8 56.0 5.2
G14 − 48.3 − 28.4 43.0 − 11.3
G2 − 30.4 − 39.6 − 8.3 − 15.9
G20 − 50.1 − 23.3 59.1 0.1
G21 − 43.1 − 24.0 33.2 − 9.1
G22 − 43.0 − 21.7 39.1 − 4.1
G3 − 49.2 − 38.1 32.4 − 13.3
G31 − 46.3 − 28.9 39.8 − 11.9
G32 − 27.5 − 23.6 8.2 − 6.8
G33 − 45.1 − 21.4 45.9 − 0.4
G4 − 47.6 − 37.8 26.5 − 16.5
G40 − 48.7 − 30.5 38.8 − 8.4
G41 − 52.2 − 27.8 55.1 2.1
G42 − 49.0 − 25.9 50.6 − 6.0
G43 − 55.0 − 31.1 54.8 − 14.3
G44 − 49.7 − 24.0 53.9 − 8.0
G45 − 34.7 − 8.1 40.6 13.1
G5 − 46.2 − 30.6 38.0 − 9.2
G51 − 37.2 − 21.0 31.1 0.3
G52 − 49.0 − 28.9 42.7 − 1.0
G53 − 40.5 − 23.6 31.0 − 3.4

Y. Li et al.

Cities 154 (2024) 105381

around 1024, showing a 30 % recovery from the first lockdown period.
During the post-COVID-19 period, the average daily traffic flows grew to
1334 in Glasgow. Although the daily traffic flows kept increasing after
the end of the first lockdown, there is still approximately a 10 % dif-
ference in the daily average traffic flows during the post-pandemic
period to the regular before October 2019.

The spatial distribution of daily average traffic flows throughout the
study period is shown in Fig. 1. It can be observed from the map that
there is considerable variation in the average daily traffic flow across
Glasgow. A spatially West-East divide pattern can be observed on the
map. The number of motor vehicles travelling in the western part of
Glasgow is generally higher than those in the east. In the eastern part of

Fig. 5. Spatial distribution of land cover types and road links in Glasgow.

Fig. 6. Spatial distribution of GSV and POIs within traffic sensor buffer (200 m).

Y. Li et al.

Cities 154 (2024) 105381

Glasgow, the daily traffic flows are higher towards the city centre and
lower in peripheral areas. However, relatively low traffic flows are
observed in the G3 and G40 area. This could be attributed to the fact that
the G3 area is largely residential, encompassing many local flats and
student accommodations. Furthermore, Kelvingrove Park, a substantial
green space within the G3 area, may also contribute to the low traffic
flows. Similar to G3 area, G40 area is a suburban district in Glasgow
characterised by several residential neighbourhoods and the historical
Glasgow Green Park, leading to the low traffic flows. In the south of
Glasgow, the G53, G43, and G45 areas demonstrate low traffic flows,
due to their predominantly residential areas and large scale of green
space, such as Cathkin Braes Country Park and Pollok Country Park.

We also compare the spatial distribution of traffic flow across four
pandemic stages. Fig. 3(1) and (2) show that the spatial distribution of
daily average traffic flows during the first lockdown and the regular
period changed dramatically. Table 1 shows a noticeable reduction in
the daily average traffic flows during the first lockdown compared to the
period before COVID-19, nearly halving across most areas of Glasgow.
Table 1 further reveals that during the second lockdown, there was an
increase in daily traffic flows across nearly all areas of Glasgow
compared to the first lockdown, except for the G2 area. This phenom-
enon could be attributed to the ‘City Centre Interventions’ implemented
by the Glasgow City Council in the second half of 2020, such as road
closures, which consequently restricted vehicular accessibility within
the G2 area. Both Fig. 3(4) and Table 1 indicate that in most areas of
Glasgow, the number of motor vehicles travelling each day almost re-
bounds to the level of the regular period when the government lifted the
restriction.

3.3. Independent variables

Independent variables considered in this research include land use,
socio-demographics, Point of Interest, Google Street Views, and road
characteristics. The land use dataset provides information on Functional
Urban Areas in Glasgow in 2018. There are 27 types of Functional Urban
Areas across the GCC area. Based on their functionality and the syn-
dromes of human activities, we aggregate them into six groups (Fig. 5):
urban residential areas; green urban areas; natural areas; industrial,
commercial, public, military, and private units; roads and railways; and
others. The satellite image classification is at 10 m resolution by using
Sentinel-2 data. The socio-demographics information is provided by the
most recent Scottish census at the Output Area level in 2011 (Scotland’s
Census, 2021). There are 5486 Output Areas in the GCC area, and we
select the important census data based on the previous research (Ma
et al., 2014; Schoenau & Müller, 2017; Zhou, Yuan, et al., 2021),
including median age, male percentage, white percentage, and college
degree percentage (2011 Census, 2013). We get 28,703 POIs from the
Digimap in Glasgow and the Ordnance Survey, categorising them into
seven groups: retail; accommodation, eating and drinking; attractions;
sports and entertainment; transport; education and health; and public
infrastructure. Road network data from Digimap provides details of road
types, names, lengths, and widths. We use the information on road types
(Fig. 5) and length to measure the road characteristics of the nearby
traffic flows. GSV is a new and novel source that provides a more intu-
itive and human-perspective street view to understand the built envi-
ronment of the cities (Sun et al., 2022). We obtain the GSV images from
the Street View Static Application Programming Interface (API), and

Table 2
Summary statistics of variables (with a 200-m buffer).

Variable Data source Category Mean Std. dev Min Max

Dependent variable
Average daily traffic

flows
Glasgow City Council (GCC) (GCC,
2022)

Before COVID-19 1459.223 803.606 43.710 5901.232
1st Lockdown 787.265 513.501 35.154 5197.569
2nd Lockdown 1024.469 601.138 66.938 4362.308
Post COVID-19 1333.983 733.496 139.029 5145.044

Independent variable
Road link EDINA Digimap Service Motorway (km/sq.km) 0.700 2.639 0.000 19.165

(OS MasterMap Highways Network,
2019)

Major road (km/sq.km) 4.109 3.443 0.000 16.576
Secondary road (km/sq.km) 4.195 2.995 0.000 14.309
Local road (km/sq.km) 12.714 5.050 1.516 26.756

POI EDINA Digimap Service Quantity 72.987 82.868 4.000 586.000
(Points of Interest, 2021) Public Infrastructure (%) 0.157 0.106 0.000 0.875

Education and health (%) 0.066 0.056 0.000 0.333
Transport (%) 0.154 0.137 0.000 0.882
Retail (%) 0.126 0.100 0.000 0.600
Sport and entertainment (%) 0.042 0.050 0.000 0.353
Accommodation, eating, and drinking (%) 0.113 0.084 0.000 0.500
Attractions (%) 0.024 0.047 0.000 0.444

Land use Urban Atlas (Urban Atlas, 2018) Industrial, commercial, public, military, and private units
(%)

0.311 0.211 0.000 0.961

Roads and railways (%) 0.181 0.084 0.028 0.475
Urban residential areas (%) 0.389 0.231 0.000 0.857
Natural areas (Water included) (%) 0.014 0.046 0.000 0.307
Green urban areas (%) 0.094 0.118 0.000 0.691
Other (%) 0.010 0.046 0.000 0.538

Socio-demographic National Records of Scotland (NRS) White (%) 0.857 0.089 0.538 0.992
Mean age 37.525 5.424 22.797 59.562

(Scotland’s Census, 2011) Male (%) 0.511 0.060 0.394 0.797
College degree (%) 0.280 0.161 0.034 0.627

GSV Google Maps Platform Road (%) 0.293 0.050 0.123 0.423
(Google Developers, 2022) Building (%) 0.208 0.167 0.000 0.583

Vegetation (%) 0.126 0.119 0.000 0.553
Car (%) 0.036 0.037 0.000 0.201

Y. Li et al.

Cities 154 (2024) 105381

images of each site are recorded from 4 perspectives: 0◦, 90◦, 180◦, and
270◦ (Fig. 6). The latest updated images are selected for analysis, with
dates no earlier than 2019. We apply the pre-trained DeepLab model
from TensorFlow to perform semantic segmentation on GSV images (Yu
et al., 2020). The outputs demonstrate the pixel percentage of 19 typical
cityscape objects, from which the most relevant categories, such as
roads, buildings, vegetation, and cars, are considered in this study.

To align the spatial resolution of the independent variables, we
create buffer zones from each traffic sensor and calculate the percentage
or density of each variable within the buffer area (Fig. 6). For example,
the density of road link represents the total length of roads per unit area
within the buffer zone; the percentage of land use type is weighted by
the share of each spatial area within the buffer area; and the percentage
of POI is calculated via the number of each POI category divided by the
total number of POI within the buffer zone. We calculate the percentage
of socio-demographic data in two steps. First, the census Output Area
located near the recorded sites is partly segmented into the buffer zone.
We calculate the socio-demographics (including the total population) of
each segmentation area separately by weighting the segmented area
within the Output Area (qij = OAij × wij). Then, we add the socio-
demographics of each segmentation area together and divide them by
the total population of the buffer zone to get the percentage of socio-
demographics (Pj =

∑

qij
qi,pop

). Concerning the Modifiable Area Unit

Problem (MAUP), we test different buffer radii varying from 100 m to
400 m with a step of 100 m. The detailed comparison between buffer
sizes is presented in the results section. Variable selection is also per-
formed to avoid the multi-collinearity problem. We calculated the
variance inflation factor (VIF), and VIFs greater than five were excluded
from the analysis (Akinwande et al., 2015). Table 2 summarises the
descriptive statistics for all variables (VIF ≤ 5) with a 200-m buffer.

4. Methodology

This section provides a detailed description of our methodology, and
the steps for model selection are: (1) Conduct the linear regression
analysis; (2) Construct the spatial weight matrix; (3) Perform the global
Moran’s I test and Lagrange Multiplier test for linear regression re-
siduals; (4) Employ the spatial econometric models with spatial weight
matrix.

4.1. Spatial weight matrix

To measure the spatial dependence of urban traffic flows, it is
necessary to construct the geographic relationship of each recorded site.
The spatial weight matrix can represent the magnitude of the spatial
relationships based on the locations of the traffic flows and those of their
neighbours (Rey et al., 2020; Wang et al., 2020). The general expression
of the spatial weight matrix is shown below (Nian et al., 2020):

W =

⎛

⎜
⎜
⎜
⎜
⎜
⎝

w11 w12 ⋯ w1m

w21

⋮

w22

⋮

⋯

⋱

w2m

⋮

wn1 wn2 ⋯ wnm

⎞

⎟
⎟
⎟
⎟
⎟
⎠

(1)

The n × m spatial weight matrix quantifies the spatial correlation
between different recorded sites, and wij represents the degree of spatial
correlation between the recorded site i and the neighbours j (Gelb,
2022). In this study, we construct the spatial weight matrix based on the
k-nearest neighbours (k− NN) algorithm. K-NN defines that the traffic
flows of the recorded site i is influenced by the nearest k observations of
traffic flows. Therefore, for spatial autocorrelation between the traffic
flows, the spatial weight matrix is denoted as:

wij =

{
1, if dij ≤ dik
0, if dij > dik

(2)

where dij indicates the Euclidean distance between site i and site j, dik
represents the Euclidean distance between site i and the farthest site k
among the k-nearest neighbours.

4.2. Spatial econometric models

As mentioned in the literature review, spatial dependence has been
widely considered in the research of urban travel behaviours. Therefore,
the spatial autocorrelation analysis should be performed in the model
selection procedure to evaluate whether there is a relationship between
daily average traffic flows and geographical locations. In this study, we
conduct three spatial econometric models to analyse the association
between urban parameters and average daily traffic flows in Glasgow
(Anselin et al., 2013). We will start with a brief introduction of the linear
regression model, which extends to the spatial econometric models.
According to Montgomery et al. (2012), a linear regression model can be
defined as:

Y = Xβ+ ε (3)

where Y is a vector of the dependent variable, X is a matrix of inde-
pendent variables. β presents the vector of the regression coefficients,
and ε is the vector of random errors. The linear regression model as-
sumes that the errors and dependent variables are uncorrelated.

According to LeSage and Pace (2010), a spatial error model (SEM)
assumes the error terms across different spatial units are correlated,
which violates the assumption of uncorrelated error terms in a linear
regression model. A spatial autoregressive model (SAM) assumes auto-
correlation is present in the dependent variables. The spatial Durbin
model (SDM) is a SAM that assumes autocorrelation may be present in
one or more independent variables and the dependent variable.

The basic form of an SEM is:

Y = Xβ+ μ (4)

μ = λWμ+ ε (5)

The basic form of a SAM is:

Y = ρWY+Xβ+ ε (6)

The basic form of an SDM is:

Y = ρWY+Xβ+WXθ + ε (7)

where W is the spatial weight matrix derived by the k-NN method. Wμ is
the correlated interaction effects, referring to those similar urban envi-
ronmental characteristics hidden between adjacent neighbourhoods,
which can affect traffic flows according to similar approaches with the
coefficient λ (Wang et al., 2020). WY describes the spatial interaction
between adjacent neighbourhoods that appears among the traffic flows
Y with a spatial lag coefficient ρ, also named endogenous interaction
effects (Elhorst, 2014). WX represents the exogenous interaction effects,
which are the spatial interaction effects (the spillover effects of a spatial
model) on the urban parameters X among adjacent neighbourhoods with
a spatial autoregressive coefficient θ.

In the selection procedure of models, global Moran’s I is a measure to
quantify the spatial autocorrelation in traffic flow residuals of linear
regression. The formula can be defined as follows (Draper & Smith,
1998):

I =
n
S

ϵTWϵ
ϵTϵ

(8)

The drawback of global Moran’s I is that it does not reveal the type of
spatial autocorrelation. The Lagrange Multiplier (LM) test is designed to

Y. Li et al.

Cities 154 (2024) 105381

test which type of spatial regression model is most appropriate for the
traffic flow data. The expression of the LM test for SEM is:

LMError =

(
nϵ́ Wy

ϵ́ ϵ

)2[
tr
(
WʹW + W2) ]− I (9)

LM test for SAM takes the form:

LMLag =

(
nϵTWy

ϵTϵ

)2
[
n(WXβ̂)TM(WXβ̂)

ϵTϵ
+ tr

(
WTW + W2)

]− I

(10)

where n is the number of observations of traffic flows, S is the sum of
spatial weights in W, ϵ is the residual error obtained by OLS estimation
of the linear regression model. β̂ refers to the estimated parameters of a
linear regression model, I is the value of Moran’s I, and tr is the matrix
trace operator.

In addition, the Akaike information criterion (AIC) was used in the
model selection. The AIC is a mathematical method that can be used to
evaluate the relative quality of a collection of models for a given set of
data (McElreath, 2018; Stoica & Selen, 2004; Taddy, 2019). Specifically,
AIC compares the performance of different models and determines

which models best fit the dataset (Zajic, 2019). The AIC value of the
model is the following:

AIC = 2k − 2ln(L̂) (11)

where k is the number of estimated parameters in the model, L̂ is the
maximised value of the likelihood function of the model. In statistics,
AIC calculates the relative amount of information lost by a given model.
Therefore, the best-fit model, according to AIC, is the one that generates
the highest quality and loses the least information (Bevans, 2020).

5. Results

Before applying the model selection method proposed by Anselin
et al. (2013), we first select the model from the theoretical perspective.
The traffic flows from geographically close roads are often more similar
than those from more widely separated road links. Besides, the urban
parameters, such as the type of land use, POI, and socio-demographics,
are highly likely to be located near similar parameters. For example, the
cinema is always situated around the shopping centre, both places of
recreation and people from the same ethnic group prefer to dwell in the

Fig. 7. Boxplot of global Moran’s I test on OLS residuals (64 models).

Y. Li et al.

Cities 154 (2024) 105381

adjacent neighbourhoods. Considering the potential endogenous and
exogenous spatial interaction effects mentioned above, the SDM is
preferred. Then, we follow the standard procedures of model selection.
An OLS-based linear regression analysis is conducted on the relationship

between daily average traffic flows and the urban characteristics within
four different buffer sizes. To meet the assumption of a normal distri-
bution from the linear regression model, we apply the log trans-
formation on the daily average traffic flows and all the independent

Fig. 8. Boxplot of P value from Lagrange multiplier (LM) diagnostics.

Fig. 9. Boxplot of two model parameters – Pseudo R-squared and Log-Likelihood.

Y. Li et al.

Cities 154 (2024) 105381

variables in percentage.

5.1. Results of global Moran’s I and Lagrange multiplier test

By substituting the spatial weight matrix based on the nearest
neighbours into the OLS regression, the statistical results of the global
Moran’s I test are obtained to test the spatial dependence of models. We
calculate the global Moran’s I for the Ordinary Least Squares residuals of
daily average traffic flows during four pandemic stages. Fig. 7 presents
the distribution of global Moran’s I value and the corresponding P-value
for different combinations of buffer size and k-NN. All the P-values are
lower than 0.05, suggesting the existence of spatial autocorrelation.
Therefore, spatial econometric models are preferred to avoid biased
estimations via OLS.

Fig. 8c presents the results of Lagrange multiplier diagnostics among
four periods for different combinations of buffer size and four spatial
weight matrices based on k-NN. The buffer size ranges from 100 m to
400 m, with a 100-m step. The LM diagnostics includes four estimates of
tests (Anselin, 2005; Wang et al., 2020): the standard LM test for the
error dependence (LM-Error), the standard LM test for an unobserved
spatially lagged dependent variable (LM-Lag), the robust LM test for the
error dependence based on the possible presence of an unobserved
lagged dependent variable (RLM-Error), and the robust LM test for an
unobserved spatially lagged dependent variable based on the possible
presence of error dependence (RLM-Lag). Only to conduct the robust
versions of the LM diagnostics when the standard versions are significant
(p < 0.05). According to Fig. 8, both the LM-Error and LM-Lag statistics are highly significant, with the latter slightly more so. In this case, further efforts are required by considering the robust forms of di- agnostics. Here, only the p-value of RLM-Lag statistics is all <0.05. Although some results of RLM-Error are significant here, the p-value of RLM-Lag is much lower than that of RLM-Error, which indicates the preferred models between SAM and SDM (Anselin, 2005). Furthermore, the values of both log-likelihood and Pseudo-R2 in Fig. 9 indicate that the SDM is favoured over SAM. Therefore, this study focuses on the results provided by SDM. 5.2. Results of spatial econometric models With the 200 m buffer from the traffic sensors and the four nearest neighbours1 of the spatial weight matrix, we quantitate the relationship between traffic flows, urban infrastructure, and socio-demographic in- dicators at four pandemic stages in Glasgow. According to the results of the RLM-Lag test in Table 3, we detect that the spatial dependence be- tween adjacent neighbourhoods among the traffic flows and urban pa- rameters is variable during the four COVID-19 periods. The statistic value of RLM-Lag drops from 17.754 to 10.299 at the first two stages, demonstrating that the spatial dependence becomes weak due to the pandemic outbreak. However, the statistic value rises to 23.185 in the second lockdown, suggesting the magnitude of spatial dependence during the second lockdown becomes more substantial than before. Besides, both the values of RLM-Lag and AIC (Table 4) reveal that all the variables after COVID-19 show the most substantial spatial dependence. Table 5 presents the significant results of SDM between urban pa- rameters and daily average traffic flows. We can observe minor differ- ences in the significance of the estimated coefficients during the four pandemic periods. The density of major roads is significantly positively related to the daily average traffic flows before and during the lockdown but changes to insignificant after COVID-19. In this study, the major road refers to the type that intends to provide principal and large-scale transport links within or between cities and towns (Ordnance Survey, 2021). The insignificant result of traffic flows between distanced desti- nations supports the idea that people have reduced their long-distance travel after the COVID-19 lockdown. This could be due to the increased adoption of remote work, online shopping, and virtual social activities, which reduces the necessity for physical travel. Furthermore, from the cityscape objects extracted from GSV, we observe that the relationship between the percentage of vegetation covered along the roads and the number of vehicles passing by was only significant before the pandemic. Comparing the different significant levels of two vegetation-related characteristics (Pct. of natural areas (land use) and Pct. of vegetation (from GSV)), we identify that the green space within the surrounding area has a significant impact on urban traffic flows while those in the immediate nearby location are insignificant after the breakout of COVID-19. From the regression coefficients of urban characteristics from the SDM, a negative relationship exists between the natural areas and the daily average traffic flows, indicating that fewer vehicles travel to the natural green spaces than the other areas. The mean age of residents in Glasgow ranges from 23 to 60, and the areas with more young and white dwellers are associated with more traffic flows. The phenomenon may result from most youngsters living in areas with high population density, like the city centre, which attracts many motor vehicles. Furthermore, the major roads connected between and within cities and towns always have heavy traffic flows before and during COVID-19. The values of the regression coefficients remain stable among the four stages, indicating the consistency of the model results. 6. Discussions In this study, we extend previous work by exploring the relationship between urban parameters and traffic flows across the COVID-19 pandemic in Glasgow. Specifically, it provides knowledge of how the built environment and socio-demographics matter in the daily average traffic flows, compensating the previous study that considers the pandemic outbreak as the sole impact of travel behaviours (Aloi et al., 2020; Bucsky, 2020; Hadjidemetriou et al., 2020; Parr et al., 2020). Second, we employ the long time series traffic flow data spanning multiple years in Glasgow. In the United Kingdom, limited research has utilised this emerging form of data due to its inherent challenges, characterised by disorganisation, incompleteness, and errors. This can be attributed to the automated collection of data from road detectors. In this research, we construct a detailed data clean process for raw traffic counts from traffic sensors, which support reproducibility, replicability, Table 3 Results of Lagrange multiplier (LM) diagnostics. Before COVID-19 1st Lockdown 2nd Lockdown Post COVID-19 Stat. P Stat. P Stat. P Stat. P LM-Error 7.756 0.005 9.327 0.002 30.531 <0.001 16.980 <0.001 LM-Lag 15.905 <0.001 15.601 <0.001 46.174 <0.001 31.304 <0.001 RLM-Error 9.605 0.002 4.026 0.045 7.541 0.006 14.653 0.001 RLM-Lag 17.754 <0.001 10.299 0.001 23.185 <0.001 28.977 <0.001 1 Due to the limited space, a detailed selection of spatial weight matrix and buffer size can be found in Appendices Y. Li et al. Cities 154 (2024) 105381 12 and extensibility to other study areas. Additionally, our study conducts a detailed comparative analysis of traffic flow changes at a high temporal granularity, delineated into four distinct stages: ‘Before COVID-19’, ‘1st Lockdown’, ‘2nd Lockdown’, and ‘Post COVID-19’. This approach stands in contrast to prior research, which only focused on comparing traffic flow differences between the pre-pandemic and during-pandemic periods (Aloi et al., 2020; Gao & Levinson, 2022; Liu & Stern, 2021; Parr et al., 2020; Tian et al., 2021). Third, by leveraging the emerging urban big data source of Google Street View images, our study highlights the heterogeneous effects of green space on urban traffic flow, specifically emphasising its sensitivity to distance variations. It also identifies how different stages of COVID-19 contribute to the spatial dependence of traffic flow and urban parame- ters. The research outputs will help city planners understand what physical and social factors will influence the traffic flows for urban planning and resource allocation. We have found that the areas with more young and white dwellers are associated with more traffic flows, while areas covered with natural green space are associated with fewer traffic flows. This is not surprising as youngsters may prefer to live in the busy city centre for leisure ac- tivities and easy access to different amenities, and white dwellers are more likely to travel by car with higher income and better employment (Klinger & Lanzendorf, 2016). Major roads between cities and towns also show heavier traffic flows as tens of thousands of residents commute daily to Glasgow (Department for Transport, 2021). These findings inform where cities should prioritise investing in the public transport infrastructure and promoting active travel. Besides, the impact of the young and white dwellers on urban traffic flows increases gradually, and the impact of land cover types like natural areas decreases due to COVID-19. We also detect that the spatial dependence between adjacent neighbourhoods among the traffic flows and urban parameters increases significantly after the pandemic. The results are consistence with pre- vious research that the travel demand is more linked to community life after COVID-19 (Nian et al., 2020). Unlike other studies, we have a new finding with novel data source Google Street View images that the heterogeneous effects of the green space on the urban traffic flows, as the magnitudes of their effects vary by distance. We identify that green space within a surrounding area has a more significant impact on urban traffic flows than those in the immediate nearby location, which makes it easy to understand that car travelling is distance-oriented rather than attracted by the environment at a single location. In summary, this research bridges the gap between the literature on quantitative analysis of urban built environments and the understanding of influential factors affecting urban traffic flows. By examining the time series of traffic data encompassing four distinct stages of the COVID-19 pandemic, we explore the heterogeneous and linear relationship be- tween these influential urban factors and traffic flow patterns. This research provides a comprehensive analysis integrating COVID-19 mobility pattern literature insights with the broader context of urban traffic flows. The outputs will offer valuable insights to city planners and policymakers regarding the physical and social factors influencing traffic flows, which can inform effective urban planning and resource allocation strategies. Specifically, these insights can help to support and enhance government policies such as Glasgow's Low Emission Zone (Glasgow City Council, 2018), which is in place on June 1, 2023. This knowledge can inform the design and implementation of targeted in- terventions within the Low Emission Zone, including optimising traffic management systems, promoting alternative modes of transportation, and strategically locating infrastructure to mitigate congestion and pollution. The integration of long-term traffic flow data and analysis of multiple years' worth of data from Glasgow further strengthens the reliability and robustness of the research findings, ensuring their applicability to real-world policy decisions. Several limitations exist in this study, which could also pave the path for future work. First, we mix all the types of motor vehicles in the traffic flow analysis, such as private cars and public transport. Ideally, Ta bl e 4 Re su lts o f m od el p ar am et er s. Be fo re C O VI D -1 9 1s t L oc kd ow n 2n d Lo ck do w n Po st C O VI D -1 9 SD M SA M SE M SD M SA M SE M SD M SA M SE M SD M SA M SE M Lo g- Li ke lih oo d − 41 2. 04 6 − 42 5. 46 6 − 42 5. 69 7 − 41 5. 38 2 − 43 4. 36 9 − 43 5. 90 8 − 40 1. 06 6 − 41 2. 41 5 − 42 2. 12 4 − 37 8. 38 8 − 39 7. 93 5 − 39 8. 93 3 Ps eu do -R 2 0. 25 2 0. 21 6 0. 19 4 0. 25 0 0. 19 8 0. 17 5 0. 30 5 0. 25 6 0. 19 2 0. 27 7 0. 22 9 0. 17 7 A IC 95 6. 09 2 91 8. 93 3 91 7. 39 4 96 2. 76 4 93 6. 73 8 93 7. 81 6 93 4. 13 3 91 0. 83 1 91 0. 24 8 88 8. 77 6 86 3. 87 1 86 3. 86 6 Y. Li et al. Cities 154 (2024) 105381 13 information considering the differences in traffic flows between public and private transport will provide a more comprehensive view of the traffic analysis for COVID-19. In the future, we plan to extend the cur- rent work to active travel with further data support, for example, how the traffic flows of cyclists and pedestrians change during COVID-19 compared to motor vehicles. Second, we focus on the spatial impact of variables on the average daily traffic flows in four different periods and overlook the temporal variation of traffic flows. Future work can conduct the time-series analysis, including the traffic flow changes under different weather conditions during the day, weekdays vs. weekends, holidays, and seasons. We also intend to undertake a tem- poral analysis of traffic flows to scrutinise and comprehend patterns during peak periods. Third, we can use our current analysis framework to evaluate the usefulness of implementing the new transport policy. The newly established low-emission zone was operated on June 1, 2023. We can evaluate the traffic flow changes within and outside the low- emission zone and estimate the new policy's effectiveness. 7. Conclusion Understanding urban traffic flows in high spatiotemporal resolution has been an immediate agenda for creating net-zero carbon cities. Taking the city of Glasgow as an example, we have found that not only the socio-demographics, such as the age and ethnic group of people, are associated with traffic flows, but also the land cover types, such as green space and major roads are related to the traffic flows. Meanwhile, the heterogeneous effects of green space on urban traffic flows exist, as the magnitudes of their effects vary by distance. Furthermore, we have explored the variation of spatial dependence between adjacent neigh- bourhoods among the traffic flows and the urban parameters in response to COVID-19. With the influence of COVID-19, there has been a signif- icant decrease in long-distance travel. The noticeable change in travel behaviour presents a valuable opportunity to encourage active travel and achieve a net-zero carbon target in the near future. (Calafiore et al., 2022). CRediT authorship contribution statement Yue Li: Writing – review & editing, Writing – original draft, Visu- alization, Validation, Software, Methodology, Formal analysis, Data curation, Conceptualization. Qunshan Zhao: Writing – review & edit- ing, Supervision, Resources, Project administration, Methodology, Funding acquisition, Conceptualization. Mingshu Wang: Writing – re- view & editing, Supervision, Project administration, Methodology, Conceptualization. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Data availability This study integrated existing research data obtained upon request from various sources. Full details about the data acquisition can be found in the documentation available at the GitHub repository: https://github.com/YueLi-0816/trafficFlowAnalysis. Acknowledgement The first author is funded by the China Scholarship Council (CSC) from the Ministry of Education of P.R. China. Dr Qunshan Zhao has received the ESRC’s ongoing support for the Urban Big Data Centre (UBDC) [ES/L011921/1 and ES/S007105/1], and Royal Society Inter- national Exchange Scheme [IEC\NSFC\223042]. The authors thank the anonymous reviewers for their insightful comments and suggestions on an earlier version of this manuscript. Appendix A A.1. KNN Fig. A1 presents the Akaike information criterion (AIC) value distribution for different buffer sizes with four spatial weight matrices based on k-NN. According to the model selection results above, we select the buffer and spatial weight matrix based on the SDM. According to Fig. A1, the trend of AIC value among different buffer sizes presents consistent variation across four COVID-19 stages. Regardless of the spatial weight matrix selection, the AIC scores reach the lowest point at the 200-m buffer area for SDM. The line charts in Fig. A2 also indicate that the SDM with a 200-m buffer area for independent variables performs better than the rest. Besides, the AIC value of each SDM is almost the same in the range of k-NN according to Fig. A2, from which the model applying spatial weight matrix with 4-nearest neighbours generates the relatively highest quality. Hence, we select the 200-m buffer and the spatial weight matrix with 4-nearest neighbours in this study. Table 5 Results of the relationship between urban parameters and daily average traffic flows of SDM. Y = Log(flows) Before COVID-19 1st Lockdown 2nd Lockdown Post COVID-19 Beta P Beta P Beta P Beta P (Intercept) 6.751 *** 0.000 5.939 *** 0.000 5.416 *** 0.000 5.938 *** 0.000 Log(Pct. of Natural areas) − 0.061 ** 0.009 − 0.088 *** 0.000 − 0.065 ** 0.004 − 0.053* 0.015 Mean age − 0.027* 0.038 − 0.031* 0.019 − 0.031* 0.015 − 0.036 ** 0.003 Log(Pct. of White) 1.325* 0.031 1.264* 0.040 1.540* 0.010 1.602** 0.003 Major Road (km/sq.km) 0.037* 0.044 0.038* 0.039 0.039* 0.027 0.031 0.062 Log(Pct. of Vegetation) − 0.039* 0.036 − 0.017 0.358 − 0.010 0.569 − 0.022 0.207 *, **, and *** indicate significance at the 0.05, 0.01, and 0.001 levels, respectively. Y. Li et al. https://github.com/YueLi-0816/trafficFlowAnalysis Cities 154 (2024) 105381 14 Fig. A1. The Akaike information criterion (AIC) of the spatial Durbin model (SDM) among different buffer sizes and different KNN (K = 3, 4, 5, 6) weight matrices. Fig. A2. The Akaike information criterion (AIC) of the spatial Durbin model (SDM) among different KNN weight matrices by the buffer size. A.2. Street environment Our research uses three types of street environment-related data: land use data, POI data, and GSV images. There is only one version of land use data and GSV images throughout the study period from 2019 to 2021, while the POI data updates every three months. We selected and analysed four versions of POI data during four COVID-19 stages. Specifically, we applied ANOVA (Analysis of variance) to test the difference between four versions of POI data. ANOVA is a statistical method used to test the statistic differences between the means of three or more independent groups. In this study, we applied ANOVA to test the differences between four groups of POI data, each corresponding to a specific stage of COVID-19. There are seven categories Y. Li et al. Cities 154 (2024) 105381 15 of POI data, and we performed an individual ANOVA for each category. The formula for ANOVA is represented as follows: F = MSB MSW where MSB stands for the mean sum of squares of POI numbers across the four stages of COVID-19, representing the between-group variance. MSW is the mean sum of squares of POI numbers within each COVID-19 stage, representing the within-group variance. The F statistic is used to test the hypothesis of whether there are significant differences among the group means. Table A1 demonstrates that all the p-values are higher than 0.05, meaning there is not significantly difference among the four COVID-19 stages. In this case, we assume there was no obvious street environment change in our study regions during the study periods. Table A1 ANOVA test results of POI data. POI Category F Statistic P-value Public Infrastructure 0.577 0.629 Education and Health 0.070 0.975 Transport 0.983 0.399 Retail 0.044 0.987 Sport and Entertainment 1.249 0.290 Accommodation, Eating and Drinking 0.219 0.882 Attractions 0.086 0.967 Appendix B. Supplementary data Supplementary data to this article can be found online at https://doi.org/10.1016/j.cities.2024.105381. References 2011 Census. (2013). Counts of Output Area by council area, 2001 and 2011 censuses. https://www.scotlandscensus.gov.uk/. Akinwande, M. O., Dikko, H. G., & Samson, A. (2015). Variance inflation factor: As a condition for the inclusion of suppressor variable(s) in regression analysis. Open Journal of Statistics, 5(7), Article 7. https://doi.org/10.4236/ojs.2015.57075 Akter, T., Mitra, S. K., Hernandez, S., & Corro-Diaz, K. (2020). A spatial panel regression model to measure the effect of weather events on freight truck traffic. Transportmetrica A: Transport Science, 16(3), 910–929. https://doi.org/10.1080/ 23249935.2020.1719552 Aloi, A., Alonso, B., Benavente, J., Cordera, R., Echániz, E., González, F., … Sañudo, R. (2020). Effects of the COVID-19 lockdown on urban mobility: Empirical evidence from the City of Santander (Spain). Sustainability, 12(9), Article 9. https://doi.org/ 10.3390/su12093870 Anselin, L. (2005). Exploring spatial data with GeoDa: A workbook. Center for Spatially Integrated Social Science. Anselin, L., Florax, R., & Rey, S. J. (2013). Advances in spatial econometrics: Methodology, Tools and Applications. Springer Science & Business Media. Bandeira, J. M., Coelho, M. C., Sá, M. E., Tavares, R., & Borrego, C. (2011). Impact of land use on urban mobility patterns, emissions and air quality in a Portuguese medium-sized city. Science of the Total Environment, 409(6), 1154–1163. https://doi. org/10.1016/j.scitotenv.2010.12.008 Bao, J., Zheng, Y., Wilkie, D., & Mokbel, M. (2015). Recommendations in location-based social networks: A survey. GeoInformatica, 19(3), 525–565. https://doi.org/ 10.1007/s10707-014-0220-8 Bartzokas-Tsiompras, A., Photis, Y. N., Tsagkis, P., & Panagiotopoulos, G. (2021). Microscale walkability indicators for fifty-nine European central urban areas: An open-access tabular dataset and a geospatial web-based platform. Data in Brief, 36, Article 107048. https://doi.org/10.1016/j.dib.2021.107048 Batty, M. (2008). The size, scale, and shape of cities. Science, 319(5864), 769–771. https://doi.org/10.1126/science.1151419 BBC News. (2017). ONS: Glasgow remains Scotland's biggest city economy. BBC News. December 21 https://www.bbc.com/news/uk-scotland-scotland-business -42443811. Bevans, R. (2020). Akaike Information Criterion. Scribbr. https://www.scribbr.com/statis tics/akaike-information-criterion/. Boarnet, M. G., Greenwald, M., & McMillan, T. E. (2008). Walking, urban design, and health: Toward a cost-benefit analysis framework. Journal of Planning Education and Research, 27(3), 341–358. https://doi.org/10.1177/0739456X07311073 Buch, N., Velastin, S. A., & Orwell, J. (2011). A review of computer vision techniques for the analysis of urban traffic. IEEE Transactions on Intelligent Transportation Systems, 12 (3), 920–939. https://doi.org/10.1109/TITS.2011.2119372 Bucsky, P. (2020). Modal share changes due to COVID-19: The case of Budapest. Transportation Research Interdisciplinary Perspectives, 8, Article 100141. https://doi. org/10.1016/j.trip.2020.100141 Calafiore, A., Dunning, R., Nurse, A., & Singleton, A. (2022). The 20-minute city: An equity analysis of Liverpool City region. Transportation Research Part D: Transport and Environment, 102, Article 103111. https://doi.org/10.1016/j.trd.2021.103111 Cubells, J., Miralles-Guasch, C., & Marquet, O. (2023). Gendered travel behaviour in micromobility? Travel speed and route choice through the lens of intersecting identities. Journal of Transport Geography, 106, Article 103502. https://doi.org/ 10.1016/j.jtrangeo.2022.103502 Department for Transport. (2021). Road traffic statistics - local authority: Glasgow City. https://roadtraffic.dft.gov.uk/local-authorities/3. Draper, N. R., & Smith, H. (1998). Applied regression analysis. John Wiley & Sons, Incorporated. http://ebookcentral.proquest.com/lib/gla/detail.action? docID=1775203. Elhorst, J. P. (2014). Spatial Panel Data Models. In J. P. Elhorst (Ed.), Spatial econometrics: From cross-sectional data to spatial panels (pp. 37–93). Springer. https:// doi.org/10.1007/978-3-642-40340-8_3. Ellis, E. (2013). Land-use and land-cover change—The encyclopedia of earth. https:// editors.eol.org/eoearth/wiki/Land-use_and_land-cover_change. Gao, Y., & Levinson, D. (2022). A bifurcation of the peak: New patterns of traffic peaking during the COVID-19 era. Transportation. https://doi.org/10.1007/s11116-022- 10329-1 GCC, M. A. A. (2022). Home. Microsoft Azure API Management-Developer Portal. https://gcc.developer.azure-api.net/. Gelb, J. (2022). Spatial weight matrices. https://cran.r-project.org/web/packages/spNet work/vignettes/SpatialWeightMatrices.html. Glasgow City Council. (2018, July 20). Glasgow's Low Emission Zone (LEZ) [Accounts]. 1619. https://www.glasgow.gov.uk/LEZ. Goel, R., Garcia, L. M. T., Goodman, A., Johnson, R., Aldred, R., Murugesan, M., … Woodcock, J. (2018). Estimating city-level travel patterns using street imagery: A case study of using Google Street View in Britain. PLoS One, 13(5), Article e0196521. https://doi.org/10.1371/journal.pone.0196521 Gong, L., Liu, X., Wu, L., & Liu, Y. (2016). Inferring trip purposes and uncovering travel patterns from taxi trajectory data. Cartography and Geographic Information Science, 43 (2), 103–114. https://doi.org/10.1080/15230406.2015.1014424 Google Developers. (2022). Google maps platform documentation | street view static API. Google Developers. https://developers.google.com/maps/documentation/stree tview. Ha, H.-H., & Thill, J.-C. (2011). Analysis of traffic hazard intensity: A spatial epidemiology case study of urban pedestrians. Computers, Environment and Urban Systems, 35(3), 230–240. https://doi.org/10.1016/j.compenvurbsys.2010.12.004 Hadjidemetriou, G. M., Sasidharan, M., Kouyialis, G., & Parlikad, A. K. (2020). The impact of government measures and human mobility trend on COVID-19 related deaths in the UK. Transportation Research Interdisciplinary Perspectives, 6, Article 100167. https://doi.org/10.1016/j.trip.2020.100167 Hale, T., Angrist, N., Goldszmidt, R., Kira, B., Petherick, A., Phillips, T., Webster, S., Cameron-Blake, E., Hallas, L., Majumdar, S., & Tatlow, H. (2021). A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker). Nature Human Behaviour, 5(4), 529–538. https://doi.org/10.1038/s41562-021- 01079-8 He, N., & Zhao, S. (2013). Discussion on influencing factors of free-flow travel time in road traffic impedance function. Procedia - Social and Behavioral Sciences, 96, 90–97. https://doi.org/10.1016/j.sbspro.2013.08.013 Y. Li et al. https://doi.org/10.1016/j.cities.2024.105381 https://www.scotlandscensus.gov.uk/ https://doi.org/10.4236/ojs.2015.57075 https://doi.org/10.1080/23249935.2020.1719552 https://doi.org/10.1080/23249935.2020.1719552 https://doi.org/10.3390/su12093870 https://doi.org/10.3390/su12093870 http://refhub.elsevier.com/S0264-2751(24)00595-X/rf0025 http://refhub.elsevier.com/S0264-2751(24)00595-X/rf0025 http://refhub.elsevier.com/S0264-2751(24)00595-X/rf0030 http://refhub.elsevier.com/S0264-2751(24)00595-X/rf0030 https://doi.org/10.1016/j.scitotenv.2010.12.008 https://doi.org/10.1016/j.scitotenv.2010.12.008 https://doi.org/10.1007/s10707-014-0220-8 https://doi.org/10.1007/s10707-014-0220-8 https://doi.org/10.1016/j.dib.2021.107048 https://doi.org/10.1126/science.1151419 https://www.bbc.com/news/uk-scotland-scotland-business-42443811 https://www.bbc.com/news/uk-scotland-scotland-business-42443811 https://www.scribbr.com/statistics/akaike-information-criterion/ https://www.scribbr.com/statistics/akaike-information-criterion/ https://doi.org/10.1177/0739456X07311073 https://doi.org/10.1109/TITS.2011.2119372 https://doi.org/10.1016/j.trip.2020.100141 https://doi.org/10.1016/j.trip.2020.100141 https://doi.org/10.1016/j.trd.2021.103111 https://doi.org/10.1016/j.jtrangeo.2022.103502 https://doi.org/10.1016/j.jtrangeo.2022.103502 https://roadtraffic.dft.gov.uk/local-authorities/3 http://ebookcentral.proquest.com/lib/gla/detail.action?docID=1775203 http://ebookcentral.proquest.com/lib/gla/detail.action?docID=1775203 https://doi.org/10.1007/978-3-642-40340-8_3 https://doi.org/10.1007/978-3-642-40340-8_3 https://editors.eol.org/eoearth/wiki/Land-use_and_land-cover_change https://editors.eol.org/eoearth/wiki/Land-use_and_land-cover_change https://doi.org/10.1007/s11116-022-10329-1 https://doi.org/10.1007/s11116-022-10329-1 https://gcc.developer.azure-api.net/ https://cran.r-project.org/web/packages/spNetwork/vignettes/SpatialWeightMatrices.html https://cran.r-project.org/web/packages/spNetwork/vignettes/SpatialWeightMatrices.html https://www.glasgow.gov.uk/LEZ https://doi.org/10.1371/journal.pone.0196521 https://doi.org/10.1080/15230406.2015.1014424 https://developers.google.com/maps/documentation/streetview https://developers.google.com/maps/documentation/streetview https://doi.org/10.1016/j.compenvurbsys.2010.12.004 https://doi.org/10.1016/j.trip.2020.100167 https://doi.org/10.1038/s41562-021-01079-8 https://doi.org/10.1038/s41562-021-01079-8 https://doi.org/10.1016/j.sbspro.2013.08.013 Cities 154 (2024) 105381 16 Ibrahim, M. R., Haworth, J., & Cheng, T. (2020). Understanding cities with machine eyes: A review of deep computer vision in urban analytics. Cities, 96, Article 102481. https://doi.org/10.1016/j.cities.2019.102481 Institute for Government. (2021, April 9). Timeline of UK government coronavirus lockdowns and restrictions. The Institute for Government. https://www.institutefo rgovernment.org.uk/charts/uk-government-coronavirus-lockdowns. Irawan, M. Z., Sumi, T., & Munawar, A. (2010). Implementation of the 1997 Indonesian highway capacity manual (MKJI) volume delay function. Journal of the Eastern Asia Society for Transportation Studies, 8, 11. https://doi.org/10.11175/easts.8.350 Jiang, R., Song, X., Fan, Z., Xia, T., Wang, Z., Chen, Q., … Shibasaki, R. (2021). Transfer urban human mobility via POI embedding over multiple cities. ACM/IMS Transactions on Data Science, 2(1). https://doi.org/10.1145/3416914, 4:1–4:26. Jiang, Y., Huang, X., & Li, Z. (2021). Spatiotemporal Patterns of Human Mobility and Its Association with Land Use Types during COVID-19 in New York City. ISPRS International Journal of Geo-Information, 10(5), Article 5. https://doi.org/10.3390/ ijgi10050344 Klinger, T., & Lanzendorf, M. (2016). Moving between mobility cultures: What affects the travel behavior of new residents? Transportation, 43(2), 243–271. https://doi.org/ 10.1007/s11116-014-9574-x Koenig, W. D. (1999). Spatial autocorrelation of ecological phenomena. Trends in Ecology & Evolution, 14(1), 22–26. https://doi.org/10.1016/S0169-5347(98)01533-X Kupfer, J. A., Li, Z., Ning, H., & Huang, X. (2021). Using Mobile device data to track the effects of the COVID-19 pandemic on spatiotemporal patterns of National Park Visitation. Sustainability, 13(16), Article 16. https://doi.org/10.3390/su13169366 LaScala, E. A., Gerber, D., & Gruenewald, P. J. (2000). Demographic and environmental correlates of pedestrian injury collisions: A spatial analysis. Accident Analysis & Prevention, 32(5), 651–658. https://doi.org/10.1016/S0001-4575(99)00100-1 Lee, M., & Holme, P. (2015). Relating land use and human intra-city mobility. PLoS One, 10(10), Article e0140152. https://doi.org/10.1371/journal.pone.0140152 LeSage, J. P., & Pace, R. K. (2010). Spatial Econometric Models. In M. M. Fischer, & A. Getis (Eds.), Handbook of applied spatial analysis: Software tools, methods and applications (pp. 355–376). Springer. https://doi.org/10.1007/978-3-642-03647-7_ 18. Li, Y., Zhao, Q., & Wang, M. (2024). High-resolution traffic flow data from the urban traffic control system in Glasgow. OSF. https://doi.org/10.31219/osf.io/qgf2j Liu, Q., Huan, W., Deng, M., Zheng, X., & Yuan, H. (2021). Inferring Urban Land Use from Multi-Source Urban Mobility Data Using Latent Multi-View Subspace Clustering. ISPRS International Journal of Geo-Information, 10(5), Article 5. https:// doi.org/10.3390/ijgi10050274 Liu, Z., & Stern, R. (2021). Quantifying the traffic impacts of the COVID-19 shutdown. Journal of Transportation Engineering, Part A: Systems, 147(5), 04021014. https://doi. org/10.1061/JTEPBS.0000527 Ma, J., Mitchell, G., & Heppenstall, A. (2014). Daily travel behaviour in Beijing, China: An analysis of workers' trip chains, and the role of socio-demographics and urban form. Habitat International, 43, 263–273. https://doi.org/10.1016/j. habitatint.2014.04.008 McElreath, R. (2018). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC Press. Montgomery, D. C. (2015). Introduction to time series analysis and forecasting. Wiley- Blackwell. Montgomery, D. C., Peck, E. A., Vining, G. G., & Vining, G. G. (2012). Introduction to linear regression analysis. John Wiley & Sons, Incorporated. http://ebookcentral. proquest.com/lib/gla/detail.action?docID=1211887. Mützel, C. M., & Scheiner, J. (2022). Investigating spatio-temporal mobility patterns and changes in metro usage under the impact of COVID-19 using Taipei metro smart card data. Public Transport, 14(2), 343–366. https://doi.org/10.1007/s12469-021-00280- 2 Nian, G., Peng, B., Sun, D.(J.), Ma, W., Peng, B., & Huang, T. (2020). Impact of COVID-19 on urban mobility during post-epidemic period in megacities: From the perspectives of taxi travel and social vitality. Sustainability, 12(19), Article 19. https://doi.org/ 10.3390/su12197954 Novak, D. C., Hodgdon, C., Guo, F., & Aultman-Hall, L. (2011). Nationwide freight generation models: A spatial regression approach. Networks and Spatial Economics, 11 (1), 23–41. https://doi.org/10.1007/s11067-008-9079-2 Ordnance Survey. Road function value. https://www.ordnancesurvey.co.uk/xml/co delists/RoadFunctionValue.xml. OS MasterMap Highways Network. [FileGeoDatabase geospatial data], Scale 1:2500, Tiles: GB, Updated: March 26 2019, Ordnance Survey (GB), Using: EDINA Digimap Ordnance Survey Service, , Downloaded: 2021-11-
09 15:24:41.282.

Ozbil, A., Gurleyen, T., Yesiltepe, D., & Zunbuloglu, E. (2019). Comparative associations
of street Network design, streetscape attributes and land-use characteristics on
pedestrian flows in peripheral Neighbourhoods. International Journal of
Environmental Research and Public Health, 16(10), Article 10. https://doi.org/
10.3390/ijerph16101846

Ozbil, A., Peponis, J., & Stone, B. (2011). Understanding the link between street
connectivity, land use and pedestrian flows. Urban Design International, 16(2),
125–141. https://doi.org/10.1057/udi.2011.2

Pan, G., Qi, G., Wu, Z., Zhang, D., & Li, S. (2013). Land-use classification using taxi GPS
traces. IEEE Transactions on Intelligent Transportation Systems, 14(1), 113–123.
https://doi.org/10.1109/TITS.2012.2209201

Pan, Y., Tian, Y., Liu, X., Gu, D., & Hua, G. (2016). Urban big data and the development
of city intelligence. Engineering, 2(2), 171–178. https://doi.org/10.1016/J.
ENG.2016.02.003

Parr, S., Wolshon, B., Renne, J., Murray-Tuite, P., & Kim, K. (2020). Traffic impacts of the
COVID-19 pandemic: Statewide analysis of social separation and activity restriction.

Natural Hazards Review, 21(3), 04020025. https://doi.org/10.1061/(ASCE)
NH.1527-6996.0000409

Points of Interest. [FileGeoDatabase geospatial data], Scale 1:1250, Tiles: GB, Updated:
June 1 2021, Ordnance Survey (GB), Using: EDINA Digimap Ordnance Survey
Service, , Downloaded: 2022-01-09 17:04:29.681.

Rey, S. J., Arribas-Bel, D., & Wolf, L. J. (2020). Spatial weights—Geographic data science
with Python. https://geographicdata.science/book/notebooks/04_spatial_weights.ht
ml#contiguity-weights.

Rhee, K.-A., Kim, J.-K., Lee, Y., & Ulfarsson, G. F. (2016). Spatial regression analysis of
traffic crashes in Seoul. Accident Analysis & Prevention, 91, 190–199. https://doi.org/
10.1016/j.aap.2016.02.023

Saladié, Ò., Bustamante, E., & Gutiérrez, A. (2020). COVID-19 lockdown and reduction of
traffic accidents in Tarragona province, Spain. Transportation Research
Interdisciplinary Perspectives, 8, Article 100218. https://doi.org/10.1016/j.
trip.2020.100218

Schoenau, M., & Müller, M. (2017). What affects our urban travel behavior? A GPS-based
evaluation of internal and external determinants of sustainable mobility in Stuttgart
(Germany). Transportation Research Part F: Traffic Psychology and Behaviour, 48,
61–73. https://doi.org/10.1016/j.trf.2017.05.004

Scotland’s Census. (2011). Home. Scotland’s Census. https://www.scotlandscensus.gov.
uk/.

Scotland’s Census. (2021). 2011 census table data: Output Area 2011. Scotland’s Census.
https://www.scotlandscensus.gov.uk/documents/2011-census-table-data-output
-area-2011/.

Scotland’s Census. (2022). Scotland’s Census: Transport. Scotland’s Census. https://www.
scotlandscensus.gov.uk/census-results/at-a-glance/transport/.

Statista. (2019). Largest European cities 2020. Statista. https://www.statista.com/statist
ics/1101883/largest-european-cities/.

Stoica, P., & Selen, Y. (2004). Model-order selection: A review of information criterion
rules. IEEE Signal Processing Magazine, 21(4), 36–47. https://doi.org/10.1109/
MSP.2004.1311138

Sun, M., Han, C., Nie, Q., Xu, J., Zhang, F., & Zhao, Q. (2022). Understanding building
energy efficiency with administrative and emerging urban big data by deep learning
in Glasgow. Energy and Buildings, 273, Article 112331. https://doi.org/10.1016/j.
enbuild.2022.112331

Taddy, M. (2019). Business data science: Combining machine learning and economics to
optimize, automate, and accelerate business decisions. McGraw Hill Professional.

Tian, X., An, C., Chen, Z., & Tian, Z. (2021). Assessing the impact of COVID-19 pandemic
on urban transportation and air quality in Canada. Science of the Total Environment,
765, Article 144270. https://doi.org/10.1016/j.scitotenv.2020.144270

Transport. (2022). Transport. In Wikipedia. https://en.wikipedia.org/w/index.php?titl
e=Transport&oldid=1084281521.

Urban Atlas. (2018). Urban Atlas—Copernicus land monitoring service [land section].
https://land.copernicus.eu/local/urban-atlas.

Vance, C., & Iovanna, R. (2007). Gender and the automobile: Analysis of nonwork service
trips. Transportation Research Record, 2013(1), 54–61. https://doi.org/10.3141/
2013-08

Wang, M., Chen, Z., Mu, L., & Zhang, X. (2020). Road network structure and ride-sharing
accessibility: A network science perspective. Computers, Environment and Urban
Systems, 80, Article 101430. https://doi.org/10.1016/j.
compenvurbsys.2019.101430

Wang, M., Chen, Z., Rong, H. H., Mu, L., Zhu, P., & Shi, Z. (2022). Ridesharing
accessibility from the human eye: Spatial modeling of built environment with street-
level images. Computers, Environment and Urban Systems, 97, Article 101858.

Wang, M., & Debbage, N. (2021). Urban morphology and traffic congestion: Longitudinal
evidence from US cities. Computers, Environment and Urban Systems, 89, Article
101676.

Wang, M., & Mu, L. (2018). Spatial disparities of Uber accessibility: An exploratory
analysis in Atlanta, USA. Computers, Environment and Urban Systems, 67, 169–175.

Xu, X., Wong, S. C., Zhu, F., Pei, X., Huang, H., & Liu, Y. (2017). A Heckman selection
model for the safety analysis of signalized intersections. PLoS One, 12(7), Article
e0181544. https://doi.org/10.1371/journal.pone.0181544

Xu, Z., Cui, G., Zhong, M., & Wang, X. (2019). Anomalous urban mobility pattern
detection based on GPS trajectories and POI data. ISPRS International Journal of Geo-
Information, 8(7), Article 7. https://doi.org/10.3390/ijgi8070308

Yokoo, T., & Levinson, D. (2019). Measures of speeding from a GPS-based travel behavior
survey. Traffic Injury Prevention, 20(2), 158–163. https://doi.org/10.1080/
15389588.2018.1543873

Yu, H., Chen, C., Du, X., & Li, Y. (2020). TensorFlow Model Garden. GitHub. https://gith
ub.com/tensorflow/models.

Yue, Y., Zhuang, Y., Yeh, A. G. O., Xie, J.-Y., Ma, C.-L., & Li, Q.-Q. (2017). Measurements
of POI-based mixed use and their relationships with neighbourhood vibrancy.
International Journal of Geographical Information Science, 31(4), 658–675. https://doi.
org/10.1080/13658816.2016.1220561

Zajic, A.. Introduction to AIC — Akaike Information Criterion. Medium. https://towar
dsdatascience.com/introduction-to-aic-akaike-information-criterion-9c9ba1c96ced.

Zhou, Y., Liu, X. C., & Grubesic, T. (2021). Unravel the impact of COVID-19 on the spatio-
temporal mobility patterns of microtransit. Journal of Transport Geography, 97,
Article 103226. https://doi.org/10.1016/j.jtrangeo.2021.103226

Zhou, Y., Yuan, Q., Yang, C., & Wang, Y. (2021). Who you are determines how you travel:
Clustering human activity patterns with a Markov-chain-based mixture model. Travel
Behaviour and Society, 24, 102–112. https://doi.org/10.1016/j.tbs.2021.03.005

Zhu, P., Tan, X., Wang, M., Guo, F., Shi, S., & Li, Z. (2023). The impact of mass gatherings
on the local transmission of COVID-19 and the implications for social distancing
policies: Evidence from Hong Kong. PLoS One, 18(2), Article e0279539.

Y. Li et al.

https://doi.org/10.1016/j.cities.2019.102481

https://www.instituteforgovernment.org.uk/charts/uk-government-coronavirus-lockdowns

https://doi.org/10.11175/easts.8.350

https://doi.org/10.1145/3416914

https://doi.org/10.3390/ijgi10050344

https://doi.org/10.1007/s11116-014-9574-x

https://doi.org/10.1016/S0169-5347(98)01533-X

https://doi.org/10.3390/su13169366

https://doi.org/10.1016/S0001-4575(99)00100-1

https://doi.org/10.1371/journal.pone.0140152

https://doi.org/10.1007/978-3-642-03647-7_18

https://doi.org/10.31219/osf.io/qgf2j

https://doi.org/10.3390/ijgi10050274

https://doi.org/10.1061/JTEPBS.0000527

https://doi.org/10.1016/j.habitatint.2014.04.008

http://refhub.elsevier.com/S0264-2751(24)00595-X/rf2000

http://refhub.elsevier.com/S0264-2751(24)00595-X/rf0240

http://ebookcentral.proquest.com/lib/gla/detail.action?docID=1211887

https://doi.org/10.1007/s12469-021-00280-2

https://doi.org/10.3390/su12197954

https://doi.org/10.1007/s11067-008-9079-2

https://www.ordnancesurvey.co.uk/xml/codelists/RoadFunctionValue.xml

https://digimap.edina.ac.uk

https://doi.org/10.3390/ijerph16101846

https://doi.org/10.1057/udi.2011.2

https://doi.org/10.1109/TITS.2012.2209201

https://doi.org/10.1016/J.ENG.2016.02.003

https://doi.org/10.1061/(ASCE)NH.1527-6996.0000409

https://digimap.edina.ac.uk

https://geographicdata.science/book/notebooks/04_spatial_weights.html#contiguity-weights

https://doi.org/10.1016/j.aap.2016.02.023

https://doi.org/10.1016/j.trip.2020.100218

https://doi.org/10.1016/j.trf.2017.05.004

https://www.scotlandscensus.gov.uk/

https://www.scotlandscensus.gov.uk/documents/2011-census-table-data-output-area-2011/

https://www.scotlandscensus.gov.uk/census-results/at-a-glance/transport/

https://www.statista.com/statistics/1101883/largest-european-cities/

https://doi.org/10.1109/MSP.2004.1311138

https://doi.org/10.1016/j.enbuild.2022.112331

http://refhub.elsevier.com/S0264-2751(24)00595-X/rf5000

https://doi.org/10.1016/j.scitotenv.2020.144270

https://en.wikipedia.org/w/index.php?title=Transport&oldid=1084281521

https://land.copernicus.eu/local/urban-atlas

https://doi.org/10.3141/2013-08

https://doi.org/10.1016/j.compenvurbsys.2019.101430

http://refhub.elsevier.com/S0264-2751(24)00595-X/rf0360

http://refhub.elsevier.com/S0264-2751(24)00595-X/rf0365

http://refhub.elsevier.com/S0264-2751(24)00595-X/rf0370

https://doi.org/10.1371/journal.pone.0181544

https://doi.org/10.3390/ijgi8070308

https://doi.org/10.1080/15389588.2018.1543873

https://github.com/tensorflow/models

https://doi.org/10.1080/13658816.2016.1220561

https://towardsdatascience.com/introduction-to-aic-akaike-information-criterion-9c9ba1c96ced

https://doi.org/10.1016/j.jtrangeo.2021.103226

https://doi.org/10.1016/j.tbs.2021.03.005

http://refhub.elsevier.com/S0264-2751(24)00595-X/rf0410

Understanding urban traffic flows in response to COVID-19 pandemic with emerging urban big data in Glasgow

1 Introduction

2 Background

2.1 Influential factors of urban traffic flows

2.2 Spatial models on urban traffic analytics

3 Study area and data

3.1 Urban traffic flows

3.2 Distribution of traffic flows

3.3 Independent variables

4 Methodology

4.1 Spatial weight matrix

4.2 Spatial econometric models

5 Results

5.1 Results of global Moran’s I and Lagrange multiplier test

5.2 Results of spatial econometric models

6 Discussions

7 Conclusion

CRediT authorship contribution statement

Declaration of competing interest

Data availability

Acknowledgement

Appendix A Acknowledgement

A.1 KNN

A.2 Street environment

Appendix B Supplementary data

References

PTUA-Summative assessment-v1

Programming Tools for Urban Analytics Summative Assessment

Dr Qunshan Zhao, Professor in Urban Analytics

1. Overview
The aim of this course is to familiarise students with programming tools which allow them to
access, collect, manage, analyse, visualise, and understand urban big data efficiently and
effectively. It will cover techniques required to extract data from online Application Protocol
Interfaces (APIs), set up a database for holding data in a way which enables efficient
analysis, and statistic/machine learning and visualization tools. It will cover best practices in
relation to coding (Python, SQL/NoSQL queries), collaborating on coding projects (Unix Shell,
Git, and GitHub), and reproducibility of analyses (Jupyter Lab/Notebook).

Students will undertake a data science project which requires them to demonstrate the skills
which they have acquired during the course. You will be required to submit a single Jupyter
Notebook and an HTML file generated from the Jupyter Notebook including all your Python
code and markdown-style written report, with a maximum of 3,000 words (not including
code and reference). You will need to submit your final assignment through Moodle. Please
note that the maximum file submission size in Moodle is 100 MB.

The assignment is due by noon on the 31st March 2025.

2. Content and Format
The final data science report should broadly follow the style of a quantitative journal article,
with the exception that you should focus on the data analysis and explanation of your data
analysis. It is not necessary to include a detailed literature review, though you may choose to
cite papers to support some of your choices e.g., your research question, your choice of
variable, the assumptions you make, and so on. The data science report should outline what
your research question is and what data you will use to address it. You will analyse your data
by using the tools and packages you have learnt in the classroom though using extra Python
packages to achieve your project goals is highly favourable. Your analysis should include:

• Research questions and project objectives with the support of academic literature

• Data collection methods, either through API, online scraping, or explaining the data
sources

• Understanding your data (data types, summary statistics, data visualisation, etc.)

• Data cleaning (missing values, outliers, date/time transformation, data errors, etc.)

• Feature engineering (categorical variables to dummy variables,
normalisation/standardisation, feature combination, etc.)

• Data analysis (time series analysis, machine learning, spatial analysis, advanced
regression analysis, etc.)

• A summary of your findings and suggestion from your data analysis

In terms of the selection of datasets for your final report, it is highly recommended that you
can use new forms of data to address urban related questions. It is okay to use existing and
analysis-ready data from Kaggle or other data service platforms, but you will need to make
sure to still include a data cleaning section to demonstrate your knowledge in this topic.

Students are required to keep to within an additional 10 percent of the word limit given for
an assignment – there are penalties on assignments that are longer than this. Submissions

that go 10-14% over the word limit on an assessment will be subject to a 1 point deduction;
15-19% over a 2 points deduction; 20-24% over a 3 points deduction and 25% or more over
will be awarded a fail (zero) and required to resubmit as a second attempt.

3. Marking
As usual, the purpose of the assignment is to give students the opportunity to demonstrate
their learning in relation to the course’s intended learning outcomes. The outcomes of this
course are:

• Write code according to best practice and produce tidy data

• Collaborate effectively with other analysts using appropriate tools

• Produce documentation for their work which makes the processes behind analyses
transparent and reproducible.

• Set up, connect to and query a simple relational and non-relational database by
Python

• Retrieve and analyze data from an Application Protocol Interface (API)

• Perform basic machine learning tasks

For more information about assessment at the University of Glasgow, please consult the
Code of Assessment. Please take time to familiarize yourself with the university’s policy on
plagiarism and AI usage.

4. Getting help
There is time available during the last class lecture sessions for you to ask any related
questions for your final assessment. An essay proposal or outline (maximum 1-page,
formative assessment) is due on the 17th February if you would like to get feedback from
your instructor. This should identify the data science project you want to do for this course,
and outline the objectives, programming tools you want to use, and expected outcomes.
You can also post any questions on the assignment forum on Moodle or email the instructor
if you have specific questions.

https://www.gla.ac.uk/myglasgow/apg/policies/uniregs/regulations2024-25/feesandgeneral/assessmentandacademicappeals/reg16/

https://www.gla.ac.uk/myglasgow/sld/plagiarism/

https://www.gla.ac.uk/myglasgow/sld/ai/students/

https://github.com/rekavonnak/declining_cities_cluster_analysis

https://github.com/FeliksWang/Analysis-of-popular-attractions-in-Glasgow-based-on-Tripadvisor-data

all lecture and lab contents are posted on GitHub, let me know if they are needed and I will send you the github link

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Computer Science- Python razzy Transport assignment,arcgis pro and python ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now