Download data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
MATH-138
Elementary Statistics
Emily C. Francis
Howard Community College
Unit 1 Lecture Slides
What is Statistics?
Statistics is the science of:
Collecting data
Analyzing data
Drawing conclusions and making decisions
as a result of the data analysis. This is
referred to as “statistical inference”.
2
What is a “Statistic”?
A statistic is a function of the data
Data -> [function] -> statistic
For example, suppose we have a data set of
height of students. Taking the average, or
mean, of heights is a function. Thus the
mean height is a statistic of the data.
3
Phases in Statistical Analysis
Data Collection: The process of collecting
data (samples) via surveys, observational
studies, and/or designed experiments
Data analysis: Graphing and
summarizing key features of the data to
discover major patterns in the data
Statistical Inference: Drawing inferences
(conclusions) and making decisions based
on the data
4
Population vs. Sample
For a given statistical inquiry:
The population consists of all items of
interest (people, places, companies, etc.)
A sample is a (hopefully representative and
random) subset of the population
A numerical value/characteristic of a
population is called a parameter. These
are usually unknown.
A numerical value/characteristic of a
sample is called a statistic
5
Components of a Data Set
Cases: people, places, companies,
colleges, etc.
Variables: characteristics/measurements
of each individual case
6
Variable Types
Categorical variables
Have values that are described by words
Represent categories
Can be represented with #’s (the actual #’s
assigned are irrelevant). The #’s have no
units and no mathematical operations can be
performed on these #’s.
Quantitative variables
Have numerical values and units
7
Displaying & Describing
Categorical Data
No mathematical operations can be
performed on categorical data
Categorical data can only be counted and
then described/displayed using:
Frequency (and relative frequency) tables
Bar charts
Pie charts
8
Contingency Tables
A contingency table shows how cases
are distributed along each variable,
contingent on the value of another variable
Marginal and conditional distributions
9
Displaying & Summarizing
Quantitative Data
A frequency (& relative frequency)
distribution is an excellent initial data analysis
tool
A histogram is a visual representation of a
frequency distribution. A relative frequency
histogram is a visual representation of a relative
frequency distribution
Dotplot
10
Describing a Distribution
Shape
Center
Spread
11
Distribution Shape
“Modality”
Symmetry
Outliers
12
Measures of Center
Median
Mean
13
Median
The median of a variable is the midpoint of the
sorted data values
For odd n, the median equals the middle data
value
For even n, the median equals the average of
the middle two values
Is useful when the variable of interest has a
skewed distribution and/or has outliers (it is not
sensitive to these outliers)
Does not have to be a data value
14
Mean
The mean is the sum of all the data values
divided by the # of data values:
y

Y 
n
Treats all values equally and can therefore be
influenced by outliers
Does not have to be a data value
Deviations from the mean to the data points
always sum to zero
Is useful when the variable of interest is
symmetric with no outliers
15
Distribution Shape (Contd.)
Symmetric data:
Mean is approx. equal to the median
Tails of the distribution are balanced
Skewed left data:
Mean<Median
Long tail of distribution “points” left
A few low values, but most data on right
Skewed right data:
Median<Mean
Long tail of distribution “points” right
A few high values, but most data on left
16
Five-Number Summary
Max
Q3
Median
Q1
Min
17
Measures of Spread
Range
Interquartile range (IQR)
Variance
Standard deviation
18
Range & IQR
The range is the difference between the
maximum and minimum data values
IQR = Q3 – Q1
The IQR is useful when the variable of
interest has a skewed distribution and/or
has outliers (it is not sensitive to these
outliers)
19
Variance
The variance is basically the average of
the squared deviations from the mean:
s
2

2
(
y

Y
)

n 1
The units of this statistic are in squared
units of the original data values
20
Standard Deviation
The SD is the square root of the variance:
s 
(y Y )
2
n 1
Is a single # that helps us understand how
spread out the data is
Units of measurement are the same as the
original data
21
Standard Deviation (Contd.)
The standard deviation (and variance)
statistics are never negative
If every data value is equal, then there is
no variation, and hence SD=Var=0
Is useful when the variable of interest is
symmetric with no outliers
22
Boxplots
A Boxplot is a graphical display of the
five-number summary
The procedure to construct a boxplot can
be found on pgs. 90-91 of the text
23
Standardized Variables: Z Scores
To “standardize” a variable, calculate each
observation’s distance from the mean in
units of the standard deviation. That is,
define variable Z as:
y Y
z
s
24
Normal Models
A normal model:
Is symmetric and “bell” shaped
Is commonly used to model many things in
the business and physical worlds
Is defined by 2 parameters, μ (the mean)
and σ (the standard deviation)
Its distribution peaks at μ
A normal distribution with mean=0 and std.
dev.=1 is called “standard”
The 68-95-99.7 Rule
For data from a NORMAL model:
~68% will lie within 1 std. dev. of the mean
~95% will lie within 2 std. dev’s of the mean
~99.7% (virtually all the data) will lie within 3
std. dev’s of the mean
26
Normalcdf & Invnorm
If you are given a value(s) and you want a
percentage under the normal model, you
use “normalcdf” on your calculator:
normalcdf(left value, right value, mean, std. dev.)
If you are given a percentage under the
normal model and you want a value, you
use “invnorm” on your calculator:
invnorm(percentage, mean, std. dev.)
27
Scatter Plots
A scatter plot shows n pairs of bivariate
data observations on an X-Y graph
A scatter plot is usually the starting point
for bivariate data analysis
We create scatter plots to investigate the
relationship between two variables:
Direction
Form
Strength
Correlation
In our discussion of correlation (and
regression), we will be talking about paired
sample data
A correlation exists between 2 variables when
one of them is related to the other in some way
The linear correlation coefficient, r, measures
the strength of the LINEAR relationship between
two variables
Before you calculate r, the following should hold:
Quantitative variables condition
“Straight Enough” condition
Outlier condition
Correlation Properties
The value of r is always between -1 and 1,
inclusive. That is, -1<=r<=1.
The value of r is not affected by the choice of x
or y
r measures the strength of a linear relationship.
It is not designed to measure the strength of a
relationship that is not linear.
Correlation is sensitive to outliers
Correlation does not imply causality!
Correlation does not measure slope
Regression
If 2 variables have a “significant” linear
correlation, it is appropriate to estimate their
exact linear relationship – regression does this
A regression estimates a and b so that the linear
relationship between x and y can be expressed
as:
ŷ  ax  b
Note that ŷ is the PREDICTED value of y – thus,
you can use this equation to predict values of y
for given values of x (though not all values of x)
The residual for any data point is: y  yˆ
31
Regression (Cont.)
When predicting a value of y based on
some given value of x, do the following:
If there is NOT a linear correlation, the best
predicted y-value is the sample average of y
If there IS a linear correlation, the best
predicted y-value is found using the
regression equation
32