Download STA 490H1S Initial Examination of Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Theoretical computer science wikipedia , lookup

Inverse problem wikipedia , lookup

Neuroinformatics wikipedia , lookup

Error detection and correction wikipedia , lookup

Data analysis wikipedia , lookup

Pattern recognition wikipedia , lookup

Data assimilation wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
STA 490H1S
Initial Examination of Data
Alison L. Gibbs
Department of Statistics
University of Toronto
Winter 2011
uoft-logo
Gibbs
STA 490H1S
Course mantra
It’s OK not to know.
Expressing ignorance is encouraged.
It’s not OK to not have a willingness to learn.
uoft-logo
Gibbs
STA 490H1S
Initial Examination of Data
Purpose:
I
Understand the structure of the data.
uoft-logo
Gibbs
STA 490H1S
Initial Examination of Data
Purpose:
I
Understand the structure of the data.
Types of variables:
I
I
Quantitiative: continuous or discrete
Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
uoft-logo
Gibbs
STA 490H1S
Initial Examination of Data
Purpose:
I
Understand the structure of the data.
Types of variables:
I
I
I
Quantitiative: continuous or discrete
Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
Check the quality of the data.
I
I
I
Find errors (data cleaning). Check for credibility, consistency,
completeness.
Identify potential outliers.
Are there missing observations?
uoft-logo
Gibbs
STA 490H1S
Initial Examination of Data
Purpose:
I
Understand the structure of the data.
Types of variables:
I
I
I
Check the quality of the data.
I
I
I
I
Quantitiative: continuous or discrete
Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
Find errors (data cleaning). Check for credibility, consistency,
completeness.
Identify potential outliers.
Are there missing observations?
Clear up any problems.
uoft-logo
Gibbs
STA 490H1S
Initial Examination of Data
Purpose:
I
Understand the structure of the data.
Types of variables:
I
I
I
Check the quality of the data.
I
I
I
I
Quantitiative: continuous or discrete
Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
Find errors (data cleaning). Check for credibility, consistency,
completeness.
Identify potential outliers.
Are there missing observations?
Clear up any problems.
uoft-logo
Gibbs
STA 490H1S
Initial Examination of Data
Purpose:
I
Understand the structure of the data.
Types of variables:
I
I
I
Quantitiative: continuous or discrete
Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
Check the quality of the data.
I
I
I
Find errors (data cleaning). Check for credibility, consistency,
completeness.
Identify potential outliers.
Are there missing observations?
I
Clear up any problems.
I
Get ideas for more sophisticated analyses.
uoft-logo
Gibbs
STA 490H1S
Initial Examination of Data
Purpose:
I
Understand the structure of the data.
Types of variables:
I
I
I
Quantitiative: continuous or discrete
Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
Check the quality of the data.
I
I
I
Find errors (data cleaning). Check for credibility, consistency,
completeness.
Identify potential outliers.
Are there missing observations?
I
Clear up any problems.
I
Get ideas for more sophisticated analyses.
I
Check on whether or not assumptions of more sophisticated
analyses seem reasonable.
Gibbs
STA 490H1S
uoft-logo
IDA
I
Should be motivated by original research questions.
uoft-logo
Gibbs
STA 490H1S
IDA
I
Should be motivated by original research questions.
I
Avoid data dredging. (Look long enough and you’ll find some
meaningless pattern.)
uoft-logo
Gibbs
STA 490H1S
IDA
I
Should be motivated by original research questions.
I
Avoid data dredging. (Look long enough and you’ll find some
meaningless pattern.)
I
Trivial? Requires judgment and common sense.
uoft-logo
Gibbs
STA 490H1S
Types of Missing Data
1. Missing Completely At Random (MCAR)
The probability that a data value is missing does not depend
on the missing value, nor on the values of all other variables.
uoft-logo
Gibbs
STA 490H1S
Types of Missing Data
1. Missing Completely At Random (MCAR)
The probability that a data value is missing does not depend
on the missing value, nor on the values of all other variables.
2. Missing At Random (MAR)
The probability that a data value is missing, conditional on
the values of the other variables for the observation, is not
related to the missing value.
uoft-logo
Gibbs
STA 490H1S
Types of Missing Data
1. Missing Completely At Random (MCAR)
The probability that a data value is missing does not depend
on the missing value, nor on the values of all other variables.
2. Missing At Random (MAR)
The probability that a data value is missing, conditional on
the values of the other variables for the observation, is not
related to the missing value.
3. Informative / Non-ignorable (NMAR)
Difficult to deal with.
uoft-logo
Gibbs
STA 490H1S
Tools for IDA
I
5 number summary (for all data and for subsets).
uoft-logo
Gibbs
STA 490H1S
Tools for IDA
I
5 number summary (for all data and for subsets).
I
Other summary statistics, e.g., mean and s.d.
uoft-logo
Gibbs
STA 490H1S
Tools for IDA
I
5 number summary (for all data and for subsets).
I
Other summary statistics, e.g., mean and s.d.
I
Histograms / stem-and-leaf plots.
uoft-logo
Gibbs
STA 490H1S
Tools for IDA
I
5 number summary (for all data and for subsets).
I
Other summary statistics, e.g., mean and s.d.
I
Histograms / stem-and-leaf plots.
I
Frequency tables (1- and 2-way) for categorical variables
uoft-logo
Gibbs
STA 490H1S
Tools for IDA
I
5 number summary (for all data and for subsets).
I
Other summary statistics, e.g., mean and s.d.
I
Histograms / stem-and-leaf plots.
I
Frequency tables (1- and 2-way) for categorical variables
I
Scatterplots.
uoft-logo
Gibbs
STA 490H1S
Tools for IDA
I
5 number summary (for all data and for subsets).
I
Other summary statistics, e.g., mean and s.d.
I
Histograms / stem-and-leaf plots.
I
Frequency tables (1- and 2-way) for categorical variables
I
Scatterplots.
I
Correlations.
uoft-logo
Gibbs
STA 490H1S
Some More Sophisticated Tools for IDA
uoft-logo
Gibbs
STA 490H1S
Some More Sophisticated Tools for IDA
I
Kernel Density Estimation
I
I
I
Smoothed function to estimate the density function.
Amount of smoothing controlled by the bandwidth.
Non-parametric (that is, doesn’t make an assumption about
the distribution).
uoft-logo
Gibbs
STA 490H1S
Some More Sophisticated Tools for IDA
I
Kernel Density Estimation
I
I
I
I
Smoothed function to estimate the density function.
Amount of smoothing controlled by the bandwidth.
Non-parametric (that is, doesn’t make an assumption about
the distribution).
LOWESS (LOESS): Locally Weighted Scatterplot Smoothing
I
I
Idea: fit a simple polynomial using regression on a small ranges
of the independent variable, and smoothly join up the pieces.
Amount of smoothing controlled by a smoothing parameter.
uoft-logo
Gibbs
STA 490H1S
Course mantra
It’s OK not to know.
Expressing ignorance is encouraged.
It’s not OK to not have a willingness to learn.
uoft-logo
Gibbs
STA 490H1S
For Thursday:
I
I
Hand in your meeting summary to your TA advisor.
Be ready for a discussion about a plan for the project:
I
I
Data cleaning / IDA
What methods of analysis might be appropriate.
uoft-logo
Gibbs
STA 490H1S
For next class (Tuesday, February 1)
Read Chapters 11 and 12 in Chatfield.
Bring the text to class.
uoft-logo
Gibbs
STA 490H1S