Download Small area model-based estimators using big data sources

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Small area model-based estimators using big data
sources
Monica Pratesi1 Stefano Marchetti2 Dino Pedreschi3 Fosca
Giannotti4 Nicola Salvati5 Filomena Maggino6
1,2,5
Department of Economics and Management, University of Pisa
3
Department of Informatics, University of Pisa,
4
6
National Research Council of Italy
Department of Statistics and informatics, University of Florence
NTTS conference, Bruxelles, 4-7 March 2013
Small area model-based estimators using big data sources
Outline
1
Motivation
2
Social Mining
3
Small Area Estimation
4
Model-Based Estimators of Poverty Indicators
5
Using Big Data in Small Area Estimation
Small area model-based estimators using big data sources
Motivation
Part I
Motivation
Small area model-based estimators using big data sources
Motivation
Motivation
The timely, accurate monitoring of social indicators, such as poverty
or inequality, at a fine grained spatial and temporal scale is a
challenging task for official statistics, albeit a crucial tool for
understanding social phenomena and policy making
Statistical methods and social mining from “big data” represent
today a concrete opportunity for understanding social complexity
The aim is to improve our ability to measure, monitor and, possibly,
predict social performance, well-being, deprivation poverty, exclusion,
inequality, and so on, at a fine-grained spatial and temporal scale
To succeed we need a statistical framework that allow to make
inference with the big data
Small area model-based estimators using big data sources
Social Mining
Part II
Social Mining
Small area model-based estimators using big data sources
Social Mining
Social Mining
Social Mining aims to provide the analytical methods and associated
algorithms needed to understand human behaviour by means of
automated discovery of the underlying patterns, rules and profiles from
the massive datasets of human activity records produced by social sensing
We live in a time with unprecedented opportunities of sensing, storing,
analyzing (micro)-data, at mass level, recording human activities at
extreme detail
Shopping patterns and lifestyle
Relationships and social ties
Desires, opinions, sentiments
Movements
Small area model-based estimators using big data sources
Social Mining
Social Mining
Figure: What happen in 60 seconds?
Small area model-based estimators using big data sources
Social Mining
Big Data and New Question to Ask
Figure: Text analysis of Tweets in the USA in the 24 hours
Small area model-based estimators using big data sources
Social Mining
Big Data as Proxy of Human Mobility
Figure: Impact of systematic mobility on access patterns
Small area model-based estimators using big data sources
Social Mining
Big Data as Proxy of Human Mobility
Work-Home
Home-Work
Figure: Discovering individual systematic movements
Small area model-based estimators using big data sources
Social Mining
Big Data as Proxy of Human Mobility
Figure: By GPS data is possible to identify new borders of mobility that are
different from administrative borders
Small area model-based estimators using big data sources
SAE Model-Based Estimators of Poverty Indicators
Part III
Model-Based Estimators of Poverty
Indicators
Small area model-based estimators using big data sources
SAE Model-Based Estimators of Poverty Indicators
Introduction to Small Area Estimation
Population of interest (or target population): population for which
the survey is designed
→direct estimators should be reliable for the target population
Domain: sub-population of the population of interest, they could be
planned or not in the survey design
Geographic areas (e.g. Regions, Provinces, Municipalities, Health
Service Area)
Socio-demographic groups (e.g. Sex, Age, Race within a large
geographic area)
Other sub-populations (e.g. the set of firms belonging to a industry
subdivision)
→we don’t know the reliability of direct estimators for the domains
that have not been planned in the survey design
Small area model-based estimators using big data sources
SAE Model-Based Estimators of Poverty Indicators
Introduction to Small Area Estimation
Often direct estimators are not reliable for some domains of interest
In these cases we have two choices:
oversampling over that domains
applying statistical techniques that allow for reliable estimates in
that domains
Small Domain or Small Area
Geographical area or domain where direct estimators do not reach a
minimum level of precision
Small Area Estimator (SAE)
An estimator created to obtain reliable estimate in a Small Area
Small area model-based estimators using big data sources
SAE Model-Based Estimators of Poverty Indicators
Introduction to Small Area Estimation
Modern small area estimation approach is based on model-based
methods
Statistical models link the variable of interest with covariate
information
unit level models → when auxiliary variables are known for out f
sample units
area level models → when auxiliary variables are known only at area
level
Unit level models are based on mixed models or M-quantile models
Area level models are based on the Fay and Herriot model
Small area model-based estimators using big data sources
SAE Model-Based Estimators of Poverty Indicators
Small Area Estimation: Unit Level Model
Let yij be the target variable for unit j in area i
Let xij be the auxiliary vector of p auxiliary variables for unit j in
area i
Mixed effect model
yij = xT
ij β + ui + εij
ui ∼ N(0, σu2 ) εij ∼ N(0, σε2 )
M-quantile models
T
Q(yij |xij )qij = xT
ij β(qij ) → yij = xij β(θi ) + εij
Parameters are estimated respectively with restricted maximum
likelihood and iterative weighted least squares
By the use of these models estimates are more accurate because we take
into account the between area variation
Small area model-based estimators using big data sources
SAE Model-Based Estimators of Poverty Indicators
Small Area Estimation: Fay-Harriot Model
Based on mixed models Fay and Harriot modeled direct estimates with
area level auxiliary variables
ˆ i its direct estimate in area i
Let θi be a target parameter and theta
Let xi be the auxiliary vector of p auxiliary variables for area i
θ̂i = θi + εi
εi ∼ N(0, σε2i )
ui ∼ N(0, σu2 )
θi = xi β + ui
The Fay and Harriot model is than
θ̂i = xT
i β + ui + εi
Model parameters can be estimated by maximum likelihood methods
(σε2i are considered known)
By the use of the auxiliary information the accuracy of the estimates is
improved
Small area model-based estimators using big data sources
SAE Model-Based Estimators of Poverty Indicators
Model-Based Estimators of Poverty Indicators
Denoting by t the poverty line different poverty measures are defined by
using (Foster et al., 1984)
t − y α
ij
I(yij 6 t) i = 1, . . . , N
Fα,ij =
t
The population distribution function in small area j can be decomposed
as follows
hX
i
X
Fα,ij +
Fα,ij
Fα,j = Nj−1
i∈sj
i∈rj
Setting α = 0 defines the Head Count Ratio whereas setting α = 1
defines the Poverty Gap. HCR and PG can be estimated with unit level
models.
Small area model-based estimators using big data sources
SAE Model-Based Estimators of Poverty Indicators
Model-Based Estimators of Poverty Indicators
By the use of HCR and PG estimators we can estimate 13 on 19
Laeken indicators
The mean squared error of these estimator can be estimated by
bootstrap techniques (Molina and Rao, 2010; Marchetti et al., 2012)
The use of Fay and Herriot models for binary or count outcome is
currently under development
The use of well-specified models and the availability of auxiliary
variable is a key point to allow the use of small area estimation
methods
Small area model-based estimators using big data sources
Using Big Data in Small Area Estimation
Part IV
Using Big Data in Small Area Estimation
Small area model-based estimators using big data sources
Using Big Data in Small Area Estimation
Small Area Estimation and Big Data
Our aim is to use the huge source of data coming from human
activities - the big data - to make accurate inference at a small area
level.
We identified three possible approaches:
1
2
3
Use big data as covariate in small area models
Use survey data to remove self-selection bias from estimates
obtained using big data
Use big data to validate small area estimates
Small area model-based estimators using big data sources
Using Big Data in Small Area Estimation
Use Big Data as Covariate in Small Area Models
Big data often provide unit level data
Outcome variable have to be linked to auxiliary variable in order to
use unit level data in a small area model
Due to technical challenges and law restriction it is unfeasible at this
stage to have unit level big data that can be linked with
administrative archive or census or survey data
Big data can be aggregate at area level and then used in an area
level model
θ̂i = dT
i β + ui + εi
di is a vector of p variables gathered from big data sources
Small area model-based estimators using big data sources
Using Big Data in Small Area Estimation
Use Survey Data to Remove Self-Selection Bias from
Estimates Obtained Using Big Data
An option is to use big data directly to measure poverty and social
exclusion
It is realistic to think the big data are not representative of the
whole population of interest (self-selection problem)
Using a quality survey we can check difference in the distribution of
common variables between big data and survey data
If there aren’t common variables we can use known correlated data
to check difference in the distribution
Given this difference we can compute weights that allow the
reduction of bias due to the self-selection of the big data
Small area model-based estimators using big data sources
Using Big Data in Small Area Estimation
Use Big Data to Validate Small Area Estimates
Poverty and deprivation measures obtained from big data can be
compared with similar measures obtained from survey data
If there is accordance between big data estimates and survey data
estimates then there is a double checked evidence of the level of
poverty and deprivation
If there is discrepancy there is need of further investigation
Small area model-based estimators using big data sources
Using Big Data in Small Area Estimation
Essential Bibliography
Breckling, J. and Chambers, R. (1988). M-quantiles. Biometrika, 75, 761-71.
Chambers, R., Tzavidis, N. (2006). M-quantile models for small area estimation.
Biometrika 93 (2), 255-68.
Foster, J., Greer, J., Thorbecke, E. (1984). A class of decomposable poverty
measures. Econometrica 52, 761-766.
Giannotti, F. and Pedreschi, D. (2008). Mobility, Data Mining and Privacy,
Springer.
Giannotti, F., Pedreschi, D., Pentland, A., Lukowicz, P., Kossmann, D., Crowley,
J. and Helbing, D. (2012). A planetary nervous system for social mining and
collective awareness. European Physics Journal - Special Topics 214, 49-75
Marchetti, S., Tzavidis, N. and Pratesi, M. (2012). Non-parametric bootstrap
mean squared error Estimation for M-quantile Estimators of Small Area
Averages, Quantiles and Poverty Indicators. Computional Statistics and Data
Analysis, 56(10), 2889-2902.
Molina, I., Rao, J. (2010). Small area estimation of poverty indicators. The
Canadian Journal of Statistics.
Small area model-based estimators using big data sources