* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download STA 490H1S Initial Examination of Data
Survey
Document related concepts
Transcript
STA 490H1S Initial Examination of Data Alison L. Gibbs Department of Statistics University of Toronto Winter 2011 uoft-logo Gibbs STA 490H1S Course mantra It’s OK not to know. Expressing ignorance is encouraged. It’s not OK to not have a willingness to learn. uoft-logo Gibbs STA 490H1S Initial Examination of Data Purpose: I Understand the structure of the data. uoft-logo Gibbs STA 490H1S Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I I Quantitiative: continuous or discrete Categorical: nominal, ordinal (e.g., Likert scales or binned quantitative), binary uoft-logo Gibbs STA 490H1S Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I I I Quantitiative: continuous or discrete Categorical: nominal, ordinal (e.g., Likert scales or binned quantitative), binary Check the quality of the data. I I I Find errors (data cleaning). Check for credibility, consistency, completeness. Identify potential outliers. Are there missing observations? uoft-logo Gibbs STA 490H1S Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I I I Check the quality of the data. I I I I Quantitiative: continuous or discrete Categorical: nominal, ordinal (e.g., Likert scales or binned quantitative), binary Find errors (data cleaning). Check for credibility, consistency, completeness. Identify potential outliers. Are there missing observations? Clear up any problems. uoft-logo Gibbs STA 490H1S Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I I I Check the quality of the data. I I I I Quantitiative: continuous or discrete Categorical: nominal, ordinal (e.g., Likert scales or binned quantitative), binary Find errors (data cleaning). Check for credibility, consistency, completeness. Identify potential outliers. Are there missing observations? Clear up any problems. uoft-logo Gibbs STA 490H1S Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I I I Quantitiative: continuous or discrete Categorical: nominal, ordinal (e.g., Likert scales or binned quantitative), binary Check the quality of the data. I I I Find errors (data cleaning). Check for credibility, consistency, completeness. Identify potential outliers. Are there missing observations? I Clear up any problems. I Get ideas for more sophisticated analyses. uoft-logo Gibbs STA 490H1S Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I I I Quantitiative: continuous or discrete Categorical: nominal, ordinal (e.g., Likert scales or binned quantitative), binary Check the quality of the data. I I I Find errors (data cleaning). Check for credibility, consistency, completeness. Identify potential outliers. Are there missing observations? I Clear up any problems. I Get ideas for more sophisticated analyses. I Check on whether or not assumptions of more sophisticated analyses seem reasonable. Gibbs STA 490H1S uoft-logo IDA I Should be motivated by original research questions. uoft-logo Gibbs STA 490H1S IDA I Should be motivated by original research questions. I Avoid data dredging. (Look long enough and you’ll find some meaningless pattern.) uoft-logo Gibbs STA 490H1S IDA I Should be motivated by original research questions. I Avoid data dredging. (Look long enough and you’ll find some meaningless pattern.) I Trivial? Requires judgment and common sense. uoft-logo Gibbs STA 490H1S Types of Missing Data 1. Missing Completely At Random (MCAR) The probability that a data value is missing does not depend on the missing value, nor on the values of all other variables. uoft-logo Gibbs STA 490H1S Types of Missing Data 1. Missing Completely At Random (MCAR) The probability that a data value is missing does not depend on the missing value, nor on the values of all other variables. 2. Missing At Random (MAR) The probability that a data value is missing, conditional on the values of the other variables for the observation, is not related to the missing value. uoft-logo Gibbs STA 490H1S Types of Missing Data 1. Missing Completely At Random (MCAR) The probability that a data value is missing does not depend on the missing value, nor on the values of all other variables. 2. Missing At Random (MAR) The probability that a data value is missing, conditional on the values of the other variables for the observation, is not related to the missing value. 3. Informative / Non-ignorable (NMAR) Difficult to deal with. uoft-logo Gibbs STA 490H1S Tools for IDA I 5 number summary (for all data and for subsets). uoft-logo Gibbs STA 490H1S Tools for IDA I 5 number summary (for all data and for subsets). I Other summary statistics, e.g., mean and s.d. uoft-logo Gibbs STA 490H1S Tools for IDA I 5 number summary (for all data and for subsets). I Other summary statistics, e.g., mean and s.d. I Histograms / stem-and-leaf plots. uoft-logo Gibbs STA 490H1S Tools for IDA I 5 number summary (for all data and for subsets). I Other summary statistics, e.g., mean and s.d. I Histograms / stem-and-leaf plots. I Frequency tables (1- and 2-way) for categorical variables uoft-logo Gibbs STA 490H1S Tools for IDA I 5 number summary (for all data and for subsets). I Other summary statistics, e.g., mean and s.d. I Histograms / stem-and-leaf plots. I Frequency tables (1- and 2-way) for categorical variables I Scatterplots. uoft-logo Gibbs STA 490H1S Tools for IDA I 5 number summary (for all data and for subsets). I Other summary statistics, e.g., mean and s.d. I Histograms / stem-and-leaf plots. I Frequency tables (1- and 2-way) for categorical variables I Scatterplots. I Correlations. uoft-logo Gibbs STA 490H1S Some More Sophisticated Tools for IDA uoft-logo Gibbs STA 490H1S Some More Sophisticated Tools for IDA I Kernel Density Estimation I I I Smoothed function to estimate the density function. Amount of smoothing controlled by the bandwidth. Non-parametric (that is, doesn’t make an assumption about the distribution). uoft-logo Gibbs STA 490H1S Some More Sophisticated Tools for IDA I Kernel Density Estimation I I I I Smoothed function to estimate the density function. Amount of smoothing controlled by the bandwidth. Non-parametric (that is, doesn’t make an assumption about the distribution). LOWESS (LOESS): Locally Weighted Scatterplot Smoothing I I Idea: fit a simple polynomial using regression on a small ranges of the independent variable, and smoothly join up the pieces. Amount of smoothing controlled by a smoothing parameter. uoft-logo Gibbs STA 490H1S Course mantra It’s OK not to know. Expressing ignorance is encouraged. It’s not OK to not have a willingness to learn. uoft-logo Gibbs STA 490H1S For Thursday: I I Hand in your meeting summary to your TA advisor. Be ready for a discussion about a plan for the project: I I Data cleaning / IDA What methods of analysis might be appropriate. uoft-logo Gibbs STA 490H1S For next class (Tuesday, February 1) Read Chapters 11 and 12 in Chatfield. Bring the text to class. uoft-logo Gibbs STA 490H1S