Download Data Mining as Exploratory Data Analysis Summary Marginal Dependence Marginal Pairwise Dependence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Data Mining as Exploratory Data Analysis
Zachary M. Jones & Fridolin Linder, Pennsylvania State University
http://github.com/zmjones/datamining/
Summary
Marginal Dependence
Marginal Pairwise Dependence
A great deal of data in political science is observational and constitutes an apparent population (especially
in comparative politics and international relations). Due to the complexity of social systems causal inference is difficult (lots of confounders, poor measurement, endogeneity) and randomization by nature is rare:
inference about population quantities or causality is often not supportable!
We suggest that in such circumstances it can be fruitful to conduct data analysis in in an exploratory
manner. Instead of estimating the parameters of a (restrictive) assumed parametric model and giving
them a causal interpretation, potentially interesting patterns can be learned from the data using statistical
learning algorithms.
Exploratory data analysis using statistical learning can support future theoretical and empirical work by
identifying variables good at predicting the outcome of interest (see Hill Jr. and Jones (2014)), possible
interactions of arbitrary dimension, and nonlinearities in marginal or joint relationships with the outcome
of interest.
In the paper (online soon) we discuss the epistemology of statistical modeling in political science, different
types of statistical analysis (inferential, predictive, exploratory), argue that exploratory data analysis is
common but not acknowledged as such, and show the advantages of using statistical learning for exploratory
analysis with random forests in two empirical examples: human rights/state repression and government
duration.
CART/Random Forests
Classification and regression trees (CART) are a regresson method that relies on indicator functions as
bases. Random forests are ensembles of CART with subsampling/bootstrapping for each tree and random
selection of variables at each splitting node, developed by Breiman (2001). We use the implementation
of Hothorn, Hornik and Zeileis (2006) and present variable importance and partial dependence plots as
tools to interpret the fitted models. See Strobl, Malley and Tutz (2009) for an introduction to CART and
random forests.
Figure 2: The average predicted value of a random draw from the posterior of the latent measure of state repression developed
by Fariss (2014) at each value of the explanatory variable indicated by the grey bar at the top, averaged across the values of all
other explanatory variables.
Importance
Figure 3: The average predicted value of the mean of the latent measure of repression developed by Fariss (2014) at each
value of a pair of explanatory values, averaged over the values of all other explanatory variables. The pairs displayed are a
subset of total 528 possible pairs. In the grey label at the top of each plot, the x-axis is the first variable and the y-axis the
second. Red indicates a pairwise relationship with more respect for human rights.
References
Breiman, Leo. 2001. “Random forests.” Machine learning 45(1):5–32.
Fariss, Christopher J. 2014. “Respect for Human Rights has Improved Over Time: Modeling the Changing
Standard of Accountability.” American Political Science Review pp. 1–22.
Hill Jr., Daniel W. and Zachary M. Jones. 2014. “An Empirical Evaluation of Explanations for State
Repression.” American Political Science Reivew 108(3):1–27.
Hothorn, Torsten, Kurt Hornik and Achim Zeileis. 2006. “Unbiased recursive partitioning: A conditional
inference framework.” Journal of Computational and Graphical Statistics 15(3).
Strobl, Carolin, James Malley and Gerhard Tutz. 2009. “An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests.”
Psychological methods 14(4):323.
Figure 1: The average decrease in the proportion of observations correctly classified that results from permuting the variable indicated on the y-axis that results from repeatedly sampling from the posterior of the latent measure of state repression
developed by Fariss (2014).