Download Datamining and the Efficacy of Government Policy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining and the Efficacy of
Government Policy
Daniel Ray Lewis
Data Mining and the Efficacy of
Government Policy
– by Daniel Ray Lewis
Data Mining
Data Mining in Developing Countries
Welfare Programs and Energy Subsidies
Theory Based Economic Optimization
Modern Development Economics
Data Mining and Government Policy
Modern companies understand their
customers very well
(Target and pregnancy story)
Governments, especially governments in developing
countries, understand clients much less so
Asian Subsidies for Energy (as a % of total government revenues)
Country
Bangladesh
Bhutan
Brunei
Darussalam
Cambodia
China
India
Indonesia
Laos P.D.R.
Malaysia
Myanmar
Pakistan
Philippines
Singapore
Sri Lanka
Thailand
Petroleum
7.56
1.39
3.77
Electricity
22.12
n.a.
1.57
Natural gas
13.45
n.a.
0
Coal
0
n.a.
0
Total
43.13
1.39
5.34
0
0
6.75
14.51
0
5.67
9.35
1.02
0
0
7.99
0.66
n.a.
0.68
1.72
3.69
n.a.
1.49
n.a.
10.23
0
n.a.
3.26
7.24
n.a.
n.a.
0.9
0
n.a.
1.41
n.a.
19.89
0
n.a.
0
0.61
n.a.
n.a.
0
0
n.a.
0
n.a.
0
0
n.a.
0
1.08
0
0.68
9.37
18.2
0
8.57
9.35
31.14
0
0
11.25
9.59
Governments spend an enormous sum on aid
programs, most especially on subsidizing energy
Source: IMF 2013 (data generally from 2011)
This money could be much better spent if
governments understood their clients better
– but is it possible?
•
•
•
•
•
Where is the money spent?
Who receives the benefit?
Is there any corruption?
Is the benefit worth the cost?
Do people like us/vote for us?
We don’t know
We don’t know
We don’t know
We don’t know
We don’t know
Redistributive Taxation
Are Energy Subsidies worthwhile?
• Fundamental Theorems of Welfare Economics
• T1: Any Walrasian equilibrium leads to a Pareto
efficient allocation of resources.
• T2: Any Pareto efficient outcome that we could
desire, can be achieved by enacting a lump-sum
wealth redistribution.
• Generally advice from developed countries
centers around removing any distortion to prices
as inefficient, it being held that changes in price
reduces utility, and that the same welfare
outcome could be achieved though taxation.
But…Who are the poor?
How can we persuade the wealthy to help?
• 60 percent of Thai persons do not receive a
paycheck – How does someone prove that he is
poor?
• Alatas, Banerjee, et. al. 2012 in an AER paper
describe 3 ways to identify the poor in Indonesian
villages – Although all are successful, all three
methods are expensive and difficult to achieve.
• India, invested enormous energy and resources into
identifying its poor, but data are 20 years old so no
longer valid.
• Developing countries cannot always force the
wealthy to contribute as much as desired.
• Subsidies are okay if they are appropriately targeted
Modern
Development
Economics
Behavioral Economics uses known foibles
of human behavior to solve poverty issues
• Low Hanging Fruit
– Chlorine in drinking water
– Cheap mosquito nets
– Paying people to study
Data Mining can Support
Behavioral Economics Objectives
by better Understanding Clients.
Behavioral Economic Solutions related to Energy
Information (Prius Dashboard) –
Information about real-time energy
use due to driving style.
Salience –Smart electricity
meters in home remind
consumers about usage
Social Approval -Comparison
with neighbors works better than
environment or price
Randomized Controlled Trials (RCT) look at
responses of closely matched populations
to economic interventions
Data Mining can Substitute for
Randomized Controlled Trials
when some Regions experience a
policy and similar Regions do not
Data Mining and the Efficacy of
Government Policy
• THREE OBJECTIVES
• Objective 1: To determine Who receives the benefit
from government policies (within a province or area,
income range, profession, etc.)
• Objective 2: To determine the effect of government
policies on regional GDP
• Objective 3: To determine the effects of government
policies on voting patterns
Data Mining and the Efficacy of
Government Policy
• Relevant Government Polices
–
–
–
–
–
First Car Policy
Jam Nam Khaaw (Rice pledging :-)
Free Electricity
Subsidized Diesel
Gasohol
• Strategy is to use Data Mining to better
understand and better target these policies.
• Begin by building a time series of variables
Other papers that use big data to
research Economic Policies
•
•
•
•
•
•
•
Big microdata for population research
S. Ruggles - Demography, 2014
Discusses using big data for analysis of populations
Big-data applications in the government sector
GH Kim, S Trimi, JH Chung - Communications of the ACM, 2014
Looks at ways big data can be used to inform government policy
The Role of Information in Perception of FossilFuel Subsidy Reform: Evidence from Indonesia
• R Pradiptyo, A Wirotomo, A Adisasmita… - Available at SSRN 2015
• Looks at perceptions of energy reform based on demographic and
economic variables
Data Used
• SES Data
–
–
–
–
–
–
–
2006 – 570 variables, 44,918 sample size
2007 – 566 variables, 43,055 sample size
2008– 385 variables, 44,969 sample size
2009 – 595 variables, 43,844 sample size
2010 – 392 variables, 44,273 sample size
2011 – 545 variables, 42,192 sample size
2012 – 411 variables, 43,762 sample size
• NESDB
– 19 years - 77 provinces – 23 subcategories
• Electoral Commission of Thailand
– 2007 – 76 provinces – 41 parties
– 2011 – 77 provinces – 40 parties
• Altogether a bit over 20 million observations
Pseudo-panel and Sample Size Issues
• The SES is a survey, not true panel data.
• There is, however, stability in the means of most data.
From the central limit theorem
•
so that, with a sample size of 400,
sqrt(n) = 20, and unless the variance is very large, the
mean remains nearly the same from sample to sample
as observed empirically.
• As the SES survey consists of about 44,000
observations, it is generally possible to divide each
survey into as many as 100 pieces and still get stable
means.
• The “pieces” could be provinces, professions, months,
regions, or a combination of these or other variables.
In the expenditure database I employ
three versions of each variable
• A) The average expenditure on a good averaged
over all possible households even if they don’t buy
• B) The average expenditure on a good averaged
over only those households that do buy
• C) The share of all possible households that do buy
the product.
• All 3 are needed to answer different questions.
General form of the equations
𝑚
𝑛
𝑉𝑒𝑟𝑠𝑖𝑜𝑛 𝐴) 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑖 =
𝑥𝑖𝑗𝑘
𝑝𝑞
𝑗=1 𝑘=1
𝑚
𝑛
𝑉𝑒𝑟𝑠𝑖𝑜𝑛 𝐵) 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑖 =
𝑥𝑖𝑗𝑘
𝑚𝑛
𝑗=1 𝑘=1
𝑚𝑛
𝑆ℎ𝑎𝑟𝑒 = 𝑉𝑒𝑟𝑠𝑖𝑜𝑛 𝐶) 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑖 =
∗ 100
𝑝𝑞
•
•
•
•
p = possible incidences of the j variable
q = possible incidences of the k variable
m = actual incidence of the j variable
n = actual incidences of the k variable all within the sample for var i.
Some common divisions of the data
•
•
•
•
•
by month (12) and expenditure decile (10)
by province (77)
by month (12) and region (5)
by month (12) and profession (7)
… don’t know others yet.
• Maybe median works better than mean?
Results
0
20
40
60
80
100
Percent of Households Receiving Free Electricity by Income
2006m1
2008m1
poorest
decile 4
Source: SES various years
2010m1
decile 2
decile 5
2012m1
decile 3
decile 6
20
40
60
80
100
Percent of Households Paying for Electricity by Income
2006m1
2008m1
poorest
decile 4
2010m1
decile 2
decile 5
2012m1
decile 3
decile 6
Analysis
• Samak government initiates a lifeline policy to give free
electricity to those who use less than 90 KwH per
month in mid-2008. Yingluk government reduces the
lifeline amount to 50 KwH in mid 2012.
• The lifeline policy initially results in 80% of the poorest
decile households getting free electricity. But 60% of
the 6th decile households (wealthier than average) also
get free electricity
• The modified policy (50 KwH) strongly reduces false
positives, so that very few from decile 6 get free
electricity, but at the cost that only half of the targeted
decile 1 group now receive benefits
0
10
20
30
40
Percent of Households Using Gasohol by Income
2007m1
2008m1
2009m1
decile 5
decile 8
Source: SES various years
2010m1
decile 6
decile 9
2011m1
2012m1
decile 7
wealthiest
Analysis
• Subsidizing Gasohol was primarily a subsidy for
the wealthiest decile and producers until the
policy changed in 2013 to require gas stations to
replace benzene 91 with gasohol.
• There was a widespread belief that gasohol
would damage motorcycle engines and so most
persons did not switch until required to do so,
despite a small price advantage
• In 2013 there should be a massive up spike in
gasohol usage – waiting for 2013 data to see it.
Average LPG use by decile
0
50
100
150
Monthly use of LPG by Income Deciles
2007m1
2008m1
2009m1
poorest
decile 6
2010m1
2011m1
decile 2
wealthiest
2012m1
decile 4
Average LPG use by decile
150
200
250
300
350
Monthly use of LPG by Income Deciles
2007m1
2008m1
2009m1
poorest
decile 6
2010m1
2011m1
decile 2
wealthiest
2012m1
decile 4
Analysis
• First graph shows the amount spent on average by
each expenditure decile (variable type A)
• Second graph shows the amount spent only by
those in the decile with positive expenditure on LPG
during the month. (variable type B)
• Interpretation – a tank of LPG costs about 300 baht,
no matter what your income decile, but wealthier
deciles buy those tanks more frequently.
0
20
40
60
First Car Policy - Share of Decile Owning a Car
2007m1
2008m1
2009m1
decile 5
decile 9
Source: SES various years
2010m1
2011m1
decile 7
wealthiest
2012m1
decile 8
Analysis
• Starting in 2012, the government reduced the tax on
the first car a person bought, with the intention of
increasing car ownership to a greater share of the
population who previous could not afford to buy
• (The policy was also to help car manufacturers who
were badly hurt by the flood a year earlier)
• Interpretation: This graph shows little evidence that
car ownership increased among poorer deciles – in
fact, the opposite seems to be the case.
Other Data Mining Projects
•
•
•
•
•
School Bus Use
Condom Use
Who Donates to Charity
At Risk Households (student won Setathat for this)
Politic Analysis
Decision Tree – Who uses the School Bus
Basic version – not too useful
Decision Tree – Who uses the School Bus
Better Version
Analysis
• Decision trees are a useful to help behavioral
economics discover exploitable relationships
• Although it may not be so easy to link school
use with iron purchases
• The use of the internet may make scheduling
more efficient, or marketing more profitable.
Politics – Who votes for Red parties?
Unexpected ties to regional GDP?
Politics – Are some groups Unrepresented?
People who voted for parties that have no
representation in the government.
Analysis
• Analysis of political affiliation allows us to see
which parties benefit from promoting which
sectors, suggesting directions for new
economic polices
• Ties to regional GDP allow us to track whether
past policies are cost effective.
• Voting records may help indicate whether the
policy worked as public relations
Objective - Who Uses LPG?
59% of Households in Thailand used LPG as primary
cooking fuel in 2009
Original Probit model focused on whether poor
people are primary recipients
Model needs to predict both whether households
1) Do use LPG 2) Don’t use LPG
=
CookLPG   0  1 BelowPov   2 HighInc   3 Rural   4 HHSize   5 ElectricLT 90  
Original Probit Results
Predictions (compare to Naïve = 59%)
Predict Do use LPG
True = 76%
Predict Don’t use LPG True = 62%
Weighted average correct = 69%
Marginals - We are measuring the effect of switching a binomial
value from 0 to 1. Imagine we start at probability=.59
dy/dx
Std. Err.
belowpov
-.246
.014***
high_inc
.178
.007***
rural
-.042
.006***
hhsize
.035
.002***
if free electricity so
electricLT90
-.246
.006*** e.g.
Pr(CookLPG) = .59 -.246 = .344
Predictive Accuracy –
Can we Improve on these results?
• Naïve - No model 59% Accurate
• Probit – Deductive 69% Accurate
• Our Goal – 80% Accurate
Data Mining
• 1) To improve the basic model
• 2) To discover nonlinear results
• 3) To search for relationships between variables
• Setup:
• Discard variables with small sample size n<500
after experimenting with Monte Carlo simulations
d𝐿𝑃𝐺 =∝ +𝛽 ∗ ′𝑣𝑎𝑟′
Demographics Household
Expenditure
Use LPG for Cooking
Is R2 > 5%?
R2=.0785
Approximately 400 variables, n = 43,844
d𝐿𝑃𝐺 =∝ +𝛽1 ∗ 𝐼𝑛𝑐 + 𝛽2 ∗ ′𝑣𝑎𝑟′
Demographics Household
LPG for Cooking
Expenditure
Income
+
Is R2 > 10.85%?
Income correlated with everything – After adding
income (R2 = 7.85%), which variables still important?
d𝐿𝑃𝐺 =∝ +𝛽 ∗′ 𝑖. 𝑣𝑎𝑟′
Profession
Education
Location
Use LPG for Cooking
FinWealth
House Type
Is R2 > 10%?
About 40 variables in the SES are categorical
Non-linear options
•
•
•
•
•
•
Some traditional options:
Natural logarithms
Quadratic functions
More creative alternatives:
Use loops to vary exponents
Deciles – I used this since allows for non
monotonic relationships
– Create deciles for each variable and regress
d𝐿𝑃𝐺 =∝ +𝛽 ∗′ 𝑖. 𝑑𝑒𝑐𝑖𝑙𝑒′𝑣𝑎𝑟′
Demographics Household
Expenditure
Use LPG for Cooking
D1-D10
D1-D10
D1-D10
D1-D10
D1-D10
D1-D10
Is R2 > 15%?
D1-D10
D1-D10
D1-D10
Need high R2 since relatively significant
D1-D10
Example of non-linear relationship that
deciles can best capture
graph bar (mean) dgas, over(decilepc)
ytitle(Share of Households using LPG)
title(Percentage of Households that use LPG by
Per Capita Decile) legend(off)
0
.2
.4
.6
.8
Percentage of Households that use LPG by Per Capita Decile
1
2
3
4
5
6
7
8
9
10
Poor households cannot afford LPG, while wealthy households find LPG is an inferior
substitute to electricity for cooking
Revised Probit Results
Predictions (compare to Naïve = 59% and Original Probit = 69%)
Predict Do use LPG
True = 84%
Using Training Data 2009
Predict Don’t use LPG True = 77%
Weighted average correct = 80.4%
INDEPENDENT TEST DATA
Predictions (compare to Naïve = 61% and Original Probit = 68%)
Predict Do use LPG
True = 84%
Using Test Data 2011
Predict Don’t use LPG True = 75%
Weighted average correct = 79.3%
Original
Model
Final Probit Model
=
CookLPG   0  1 BelowPov   2 HighInc   3 Rural   4 HHSize   5 ElectricLT 90
  6WashMachine   7 PipeWater   8 HouseType   9 FinWealth  10 Pr ovince
 11 Fish  12Grains  13 Fruits  14Vegetables  
Analysis
• Loops for predictive models (developed in my
first essay) can be used to improve predictive
power.
• In this study of LPG usage, predictive power of
the model went from 69% to 80% when
switching from a traditional deductive
approach to a data mining approach.
Small Data
vs.
Big Data
Problem of the Past
Problem of the Future
Finding statistically reliable
estimates with few data points
Finding efficient ways to extract
information from large data sets
Advantage of the Past
Advantage of the Future
Easy to understand data
Deductive Reasoning Possible
Lots of data adds new information
Inductive Reasoning adds value
• By searching for stories and interactions we didn’t think of, or
know about, e.g. cluster analysis, forgotten variables….
• Hypothesis/Objective: Big Data will allow us to improve the
accuracy of our Probit model from 69% to 80%
Lessons and Challenges of Big Data
• Statistical significance (t-stat) not very useful since
almost everything is significant with big data, need to use
minimum R2 instead.
• Alternative-R2 not useful as test for discarding variables
since will barely decrease, need to use whether variables
add to predictability instead.
• Non-linear relationships are likely, but traditional
solutions are ad hoc. Non parametric solutions such as
deciles may be a substitute.
• Over-fitting is a danger, so using separate data for
designing and testing model is important.
• Measurement errors in original data due to poor survey
design are a concern that cannot be solved by increasing
sample size. Increased care is needed in collecting data.