Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and the Efficacy of Government Policy Daniel Ray Lewis Data Mining and the Efficacy of Government Policy – by Daniel Ray Lewis Data Mining Data Mining in Developing Countries Welfare Programs and Energy Subsidies Theory Based Economic Optimization Modern Development Economics Data Mining and Government Policy Modern companies understand their customers very well (Target and pregnancy story) Governments, especially governments in developing countries, understand clients much less so Asian Subsidies for Energy (as a % of total government revenues) Country Bangladesh Bhutan Brunei Darussalam Cambodia China India Indonesia Laos P.D.R. Malaysia Myanmar Pakistan Philippines Singapore Sri Lanka Thailand Petroleum 7.56 1.39 3.77 Electricity 22.12 n.a. 1.57 Natural gas 13.45 n.a. 0 Coal 0 n.a. 0 Total 43.13 1.39 5.34 0 0 6.75 14.51 0 5.67 9.35 1.02 0 0 7.99 0.66 n.a. 0.68 1.72 3.69 n.a. 1.49 n.a. 10.23 0 n.a. 3.26 7.24 n.a. n.a. 0.9 0 n.a. 1.41 n.a. 19.89 0 n.a. 0 0.61 n.a. n.a. 0 0 n.a. 0 n.a. 0 0 n.a. 0 1.08 0 0.68 9.37 18.2 0 8.57 9.35 31.14 0 0 11.25 9.59 Governments spend an enormous sum on aid programs, most especially on subsidizing energy Source: IMF 2013 (data generally from 2011) This money could be much better spent if governments understood their clients better – but is it possible? • • • • • Where is the money spent? Who receives the benefit? Is there any corruption? Is the benefit worth the cost? Do people like us/vote for us? We don’t know We don’t know We don’t know We don’t know We don’t know Redistributive Taxation Are Energy Subsidies worthwhile? • Fundamental Theorems of Welfare Economics • T1: Any Walrasian equilibrium leads to a Pareto efficient allocation of resources. • T2: Any Pareto efficient outcome that we could desire, can be achieved by enacting a lump-sum wealth redistribution. • Generally advice from developed countries centers around removing any distortion to prices as inefficient, it being held that changes in price reduces utility, and that the same welfare outcome could be achieved though taxation. But…Who are the poor? How can we persuade the wealthy to help? • 60 percent of Thai persons do not receive a paycheck – How does someone prove that he is poor? • Alatas, Banerjee, et. al. 2012 in an AER paper describe 3 ways to identify the poor in Indonesian villages – Although all are successful, all three methods are expensive and difficult to achieve. • India, invested enormous energy and resources into identifying its poor, but data are 20 years old so no longer valid. • Developing countries cannot always force the wealthy to contribute as much as desired. • Subsidies are okay if they are appropriately targeted Modern Development Economics Behavioral Economics uses known foibles of human behavior to solve poverty issues • Low Hanging Fruit – Chlorine in drinking water – Cheap mosquito nets – Paying people to study Data Mining can Support Behavioral Economics Objectives by better Understanding Clients. Behavioral Economic Solutions related to Energy Information (Prius Dashboard) – Information about real-time energy use due to driving style. Salience –Smart electricity meters in home remind consumers about usage Social Approval -Comparison with neighbors works better than environment or price Randomized Controlled Trials (RCT) look at responses of closely matched populations to economic interventions Data Mining can Substitute for Randomized Controlled Trials when some Regions experience a policy and similar Regions do not Data Mining and the Efficacy of Government Policy • THREE OBJECTIVES • Objective 1: To determine Who receives the benefit from government policies (within a province or area, income range, profession, etc.) • Objective 2: To determine the effect of government policies on regional GDP • Objective 3: To determine the effects of government policies on voting patterns Data Mining and the Efficacy of Government Policy • Relevant Government Polices – – – – – First Car Policy Jam Nam Khaaw (Rice pledging :-) Free Electricity Subsidized Diesel Gasohol • Strategy is to use Data Mining to better understand and better target these policies. • Begin by building a time series of variables Other papers that use big data to research Economic Policies • • • • • • • Big microdata for population research S. Ruggles - Demography, 2014 Discusses using big data for analysis of populations Big-data applications in the government sector GH Kim, S Trimi, JH Chung - Communications of the ACM, 2014 Looks at ways big data can be used to inform government policy The Role of Information in Perception of FossilFuel Subsidy Reform: Evidence from Indonesia • R Pradiptyo, A Wirotomo, A Adisasmita… - Available at SSRN 2015 • Looks at perceptions of energy reform based on demographic and economic variables Data Used • SES Data – – – – – – – 2006 – 570 variables, 44,918 sample size 2007 – 566 variables, 43,055 sample size 2008– 385 variables, 44,969 sample size 2009 – 595 variables, 43,844 sample size 2010 – 392 variables, 44,273 sample size 2011 – 545 variables, 42,192 sample size 2012 – 411 variables, 43,762 sample size • NESDB – 19 years - 77 provinces – 23 subcategories • Electoral Commission of Thailand – 2007 – 76 provinces – 41 parties – 2011 – 77 provinces – 40 parties • Altogether a bit over 20 million observations Pseudo-panel and Sample Size Issues • The SES is a survey, not true panel data. • There is, however, stability in the means of most data. From the central limit theorem • so that, with a sample size of 400, sqrt(n) = 20, and unless the variance is very large, the mean remains nearly the same from sample to sample as observed empirically. • As the SES survey consists of about 44,000 observations, it is generally possible to divide each survey into as many as 100 pieces and still get stable means. • The “pieces” could be provinces, professions, months, regions, or a combination of these or other variables. In the expenditure database I employ three versions of each variable • A) The average expenditure on a good averaged over all possible households even if they don’t buy • B) The average expenditure on a good averaged over only those households that do buy • C) The share of all possible households that do buy the product. • All 3 are needed to answer different questions. General form of the equations 𝑚 𝑛 𝑉𝑒𝑟𝑠𝑖𝑜𝑛 𝐴) 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑖 = 𝑥𝑖𝑗𝑘 𝑝𝑞 𝑗=1 𝑘=1 𝑚 𝑛 𝑉𝑒𝑟𝑠𝑖𝑜𝑛 𝐵) 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑖 = 𝑥𝑖𝑗𝑘 𝑚𝑛 𝑗=1 𝑘=1 𝑚𝑛 𝑆ℎ𝑎𝑟𝑒 = 𝑉𝑒𝑟𝑠𝑖𝑜𝑛 𝐶) 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑖 = ∗ 100 𝑝𝑞 • • • • p = possible incidences of the j variable q = possible incidences of the k variable m = actual incidence of the j variable n = actual incidences of the k variable all within the sample for var i. Some common divisions of the data • • • • • by month (12) and expenditure decile (10) by province (77) by month (12) and region (5) by month (12) and profession (7) … don’t know others yet. • Maybe median works better than mean? Results 0 20 40 60 80 100 Percent of Households Receiving Free Electricity by Income 2006m1 2008m1 poorest decile 4 Source: SES various years 2010m1 decile 2 decile 5 2012m1 decile 3 decile 6 20 40 60 80 100 Percent of Households Paying for Electricity by Income 2006m1 2008m1 poorest decile 4 2010m1 decile 2 decile 5 2012m1 decile 3 decile 6 Analysis • Samak government initiates a lifeline policy to give free electricity to those who use less than 90 KwH per month in mid-2008. Yingluk government reduces the lifeline amount to 50 KwH in mid 2012. • The lifeline policy initially results in 80% of the poorest decile households getting free electricity. But 60% of the 6th decile households (wealthier than average) also get free electricity • The modified policy (50 KwH) strongly reduces false positives, so that very few from decile 6 get free electricity, but at the cost that only half of the targeted decile 1 group now receive benefits 0 10 20 30 40 Percent of Households Using Gasohol by Income 2007m1 2008m1 2009m1 decile 5 decile 8 Source: SES various years 2010m1 decile 6 decile 9 2011m1 2012m1 decile 7 wealthiest Analysis • Subsidizing Gasohol was primarily a subsidy for the wealthiest decile and producers until the policy changed in 2013 to require gas stations to replace benzene 91 with gasohol. • There was a widespread belief that gasohol would damage motorcycle engines and so most persons did not switch until required to do so, despite a small price advantage • In 2013 there should be a massive up spike in gasohol usage – waiting for 2013 data to see it. Average LPG use by decile 0 50 100 150 Monthly use of LPG by Income Deciles 2007m1 2008m1 2009m1 poorest decile 6 2010m1 2011m1 decile 2 wealthiest 2012m1 decile 4 Average LPG use by decile 150 200 250 300 350 Monthly use of LPG by Income Deciles 2007m1 2008m1 2009m1 poorest decile 6 2010m1 2011m1 decile 2 wealthiest 2012m1 decile 4 Analysis • First graph shows the amount spent on average by each expenditure decile (variable type A) • Second graph shows the amount spent only by those in the decile with positive expenditure on LPG during the month. (variable type B) • Interpretation – a tank of LPG costs about 300 baht, no matter what your income decile, but wealthier deciles buy those tanks more frequently. 0 20 40 60 First Car Policy - Share of Decile Owning a Car 2007m1 2008m1 2009m1 decile 5 decile 9 Source: SES various years 2010m1 2011m1 decile 7 wealthiest 2012m1 decile 8 Analysis • Starting in 2012, the government reduced the tax on the first car a person bought, with the intention of increasing car ownership to a greater share of the population who previous could not afford to buy • (The policy was also to help car manufacturers who were badly hurt by the flood a year earlier) • Interpretation: This graph shows little evidence that car ownership increased among poorer deciles – in fact, the opposite seems to be the case. Other Data Mining Projects • • • • • School Bus Use Condom Use Who Donates to Charity At Risk Households (student won Setathat for this) Politic Analysis Decision Tree – Who uses the School Bus Basic version – not too useful Decision Tree – Who uses the School Bus Better Version Analysis • Decision trees are a useful to help behavioral economics discover exploitable relationships • Although it may not be so easy to link school use with iron purchases • The use of the internet may make scheduling more efficient, or marketing more profitable. Politics – Who votes for Red parties? Unexpected ties to regional GDP? Politics – Are some groups Unrepresented? People who voted for parties that have no representation in the government. Analysis • Analysis of political affiliation allows us to see which parties benefit from promoting which sectors, suggesting directions for new economic polices • Ties to regional GDP allow us to track whether past policies are cost effective. • Voting records may help indicate whether the policy worked as public relations Objective - Who Uses LPG? 59% of Households in Thailand used LPG as primary cooking fuel in 2009 Original Probit model focused on whether poor people are primary recipients Model needs to predict both whether households 1) Do use LPG 2) Don’t use LPG = CookLPG 0 1 BelowPov 2 HighInc 3 Rural 4 HHSize 5 ElectricLT 90 Original Probit Results Predictions (compare to Naïve = 59%) Predict Do use LPG True = 76% Predict Don’t use LPG True = 62% Weighted average correct = 69% Marginals - We are measuring the effect of switching a binomial value from 0 to 1. Imagine we start at probability=.59 dy/dx Std. Err. belowpov -.246 .014*** high_inc .178 .007*** rural -.042 .006*** hhsize .035 .002*** if free electricity so electricLT90 -.246 .006*** e.g. Pr(CookLPG) = .59 -.246 = .344 Predictive Accuracy – Can we Improve on these results? • Naïve - No model 59% Accurate • Probit – Deductive 69% Accurate • Our Goal – 80% Accurate Data Mining • 1) To improve the basic model • 2) To discover nonlinear results • 3) To search for relationships between variables • Setup: • Discard variables with small sample size n<500 after experimenting with Monte Carlo simulations d𝐿𝑃𝐺 =∝ +𝛽 ∗ ′𝑣𝑎𝑟′ Demographics Household Expenditure Use LPG for Cooking Is R2 > 5%? R2=.0785 Approximately 400 variables, n = 43,844 d𝐿𝑃𝐺 =∝ +𝛽1 ∗ 𝐼𝑛𝑐 + 𝛽2 ∗ ′𝑣𝑎𝑟′ Demographics Household LPG for Cooking Expenditure Income + Is R2 > 10.85%? Income correlated with everything – After adding income (R2 = 7.85%), which variables still important? d𝐿𝑃𝐺 =∝ +𝛽 ∗′ 𝑖. 𝑣𝑎𝑟′ Profession Education Location Use LPG for Cooking FinWealth House Type Is R2 > 10%? About 40 variables in the SES are categorical Non-linear options • • • • • • Some traditional options: Natural logarithms Quadratic functions More creative alternatives: Use loops to vary exponents Deciles – I used this since allows for non monotonic relationships – Create deciles for each variable and regress d𝐿𝑃𝐺 =∝ +𝛽 ∗′ 𝑖. 𝑑𝑒𝑐𝑖𝑙𝑒′𝑣𝑎𝑟′ Demographics Household Expenditure Use LPG for Cooking D1-D10 D1-D10 D1-D10 D1-D10 D1-D10 D1-D10 Is R2 > 15%? D1-D10 D1-D10 D1-D10 Need high R2 since relatively significant D1-D10 Example of non-linear relationship that deciles can best capture graph bar (mean) dgas, over(decilepc) ytitle(Share of Households using LPG) title(Percentage of Households that use LPG by Per Capita Decile) legend(off) 0 .2 .4 .6 .8 Percentage of Households that use LPG by Per Capita Decile 1 2 3 4 5 6 7 8 9 10 Poor households cannot afford LPG, while wealthy households find LPG is an inferior substitute to electricity for cooking Revised Probit Results Predictions (compare to Naïve = 59% and Original Probit = 69%) Predict Do use LPG True = 84% Using Training Data 2009 Predict Don’t use LPG True = 77% Weighted average correct = 80.4% INDEPENDENT TEST DATA Predictions (compare to Naïve = 61% and Original Probit = 68%) Predict Do use LPG True = 84% Using Test Data 2011 Predict Don’t use LPG True = 75% Weighted average correct = 79.3% Original Model Final Probit Model = CookLPG 0 1 BelowPov 2 HighInc 3 Rural 4 HHSize 5 ElectricLT 90 6WashMachine 7 PipeWater 8 HouseType 9 FinWealth 10 Pr ovince 11 Fish 12Grains 13 Fruits 14Vegetables Analysis • Loops for predictive models (developed in my first essay) can be used to improve predictive power. • In this study of LPG usage, predictive power of the model went from 69% to 80% when switching from a traditional deductive approach to a data mining approach. Small Data vs. Big Data Problem of the Past Problem of the Future Finding statistically reliable estimates with few data points Finding efficient ways to extract information from large data sets Advantage of the Past Advantage of the Future Easy to understand data Deductive Reasoning Possible Lots of data adds new information Inductive Reasoning adds value • By searching for stories and interactions we didn’t think of, or know about, e.g. cluster analysis, forgotten variables…. • Hypothesis/Objective: Big Data will allow us to improve the accuracy of our Probit model from 69% to 80% Lessons and Challenges of Big Data • Statistical significance (t-stat) not very useful since almost everything is significant with big data, need to use minimum R2 instead. • Alternative-R2 not useful as test for discarding variables since will barely decrease, need to use whether variables add to predictability instead. • Non-linear relationships are likely, but traditional solutions are ad hoc. Non parametric solutions such as deciles may be a substitute. • Over-fitting is a danger, so using separate data for designing and testing model is important. • Measurement errors in original data due to poor survey design are a concern that cannot be solved by increasing sample size. Increased care is needed in collecting data.