Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining as Exploratory Data Analysis Zachary M. Jones & Fridolin Linder, Pennsylvania State University http://github.com/zmjones/datamining/ Summary Marginal Dependence Marginal Pairwise Dependence A great deal of data in political science is observational and constitutes an apparent population (especially in comparative politics and international relations). Due to the complexity of social systems causal inference is difficult (lots of confounders, poor measurement, endogeneity) and randomization by nature is rare: inference about population quantities or causality is often not supportable! We suggest that in such circumstances it can be fruitful to conduct data analysis in in an exploratory manner. Instead of estimating the parameters of a (restrictive) assumed parametric model and giving them a causal interpretation, potentially interesting patterns can be learned from the data using statistical learning algorithms. Exploratory data analysis using statistical learning can support future theoretical and empirical work by identifying variables good at predicting the outcome of interest (see Hill Jr. and Jones (2014)), possible interactions of arbitrary dimension, and nonlinearities in marginal or joint relationships with the outcome of interest. In the paper (online soon) we discuss the epistemology of statistical modeling in political science, different types of statistical analysis (inferential, predictive, exploratory), argue that exploratory data analysis is common but not acknowledged as such, and show the advantages of using statistical learning for exploratory analysis with random forests in two empirical examples: human rights/state repression and government duration. CART/Random Forests Classification and regression trees (CART) are a regresson method that relies on indicator functions as bases. Random forests are ensembles of CART with subsampling/bootstrapping for each tree and random selection of variables at each splitting node, developed by Breiman (2001). We use the implementation of Hothorn, Hornik and Zeileis (2006) and present variable importance and partial dependence plots as tools to interpret the fitted models. See Strobl, Malley and Tutz (2009) for an introduction to CART and random forests. Figure 2: The average predicted value of a random draw from the posterior of the latent measure of state repression developed by Fariss (2014) at each value of the explanatory variable indicated by the grey bar at the top, averaged across the values of all other explanatory variables. Importance Figure 3: The average predicted value of the mean of the latent measure of repression developed by Fariss (2014) at each value of a pair of explanatory values, averaged over the values of all other explanatory variables. The pairs displayed are a subset of total 528 possible pairs. In the grey label at the top of each plot, the x-axis is the first variable and the y-axis the second. Red indicates a pairwise relationship with more respect for human rights. References Breiman, Leo. 2001. “Random forests.” Machine learning 45(1):5–32. Fariss, Christopher J. 2014. “Respect for Human Rights has Improved Over Time: Modeling the Changing Standard of Accountability.” American Political Science Review pp. 1–22. Hill Jr., Daniel W. and Zachary M. Jones. 2014. “An Empirical Evaluation of Explanations for State Repression.” American Political Science Reivew 108(3):1–27. Hothorn, Torsten, Kurt Hornik and Achim Zeileis. 2006. “Unbiased recursive partitioning: A conditional inference framework.” Journal of Computational and Graphical Statistics 15(3). Strobl, Carolin, James Malley and Gerhard Tutz. 2009. “An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests.” Psychological methods 14(4):323. Figure 1: The average decrease in the proportion of observations correctly classified that results from permuting the variable indicated on the y-axis that results from repeatedly sampling from the posterior of the latent measure of state repression developed by Fariss (2014).