Download Learning the Language of the Statistician

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Learning the Language of
the Statistician
โ€ข The following slides contain many of the symbols we
will be using in this class. These are the symbols we
will be using in formulas. While I do not require you
to memorize all of the formulas, it is important that
you know what these symbols mean. You will be
expected to memorize a few of the simpler formulas
for the departmental final.
โ€ข To do responsible research, you must assimilate,
integrate and apply. This power point presentations
concentrates on assimilating this basic information.
Sample
Sampling
Population
Distribution
--------------------------------------------------------------------------------------------------------
Individual Score
yi
yi
Sample Size
n
N
Mean
ำฏ
µ
Mu
ฯƒ
Sigma
Standard Deviation
๐‘ 2
ฯƒ/n
estimated by
s/ ๐‘›
ฯƒ2
Variance
S2
Sum
โˆ‘
โˆ‘
Proportion
p
ฯ€
Hypothesized Mean
ำฏo
µo
Hypothesized Proportion p0
ฯ€o
Pi
Stating Hypotheses with Symbols
โ€ข One Sample Hypothesis Test for a Proportion
o Null hypothesis
โ€ข P = ฯ€ The sample proportion is the same as the population proportion.
o Research hypothesis
โ€ข P โ‰  ฯ€ The sample proportion is NOT the same as the population
proportion. If you have a theory, you can use a one-tailed test and
indicate that it is greater or less than the population proportion.
โ€ข One Sample Hypothesis Test for a Mean
o Null hypothesis
โ€ข ำฏ = µ The sample mean is the same as the population mean.
o Research hypothesis
โ€ข ำฏ โ‰  µ The sample mean is not the same as the population mean. If you
have a theory, you can use a one-tailed test and indicate that it is greater
or less than the population mean.
Stating Hypotheses with Symbols
โ€ข
Chi Square
o Null hypothesis
โ€ข H0 E=O, The expected value equal the observed value
โ€ข The dependent variable is contingent on the independent variable
in the population
o Research hypothesis
โ€ข H1 Eโ‰ O, The expected value does not equal the observed value
โ€ข The dependent variable is NOT contingent on the independent
variable in the population
NOTE โ€“ For an Elaborated Chi Square you simply state that E=0 for all of the
independent/dependent combinations for the null hypothesis. For the
research hypothesis you state that E โ‰  0 for at least one of the combinations.
You would actually test each dependent/independent combination
separately.
Stating Hypotheses with Symbols
โ€ข
One-Way Anova - with 2 groups
o Null hypothesis
โ€ข H0 µ1 = µ2, The Means are equal Or The Mean of Group 1 is the same as
the Mean of Group 2 in the population
o Research hypothesis
โ€ข Two Tailed โ€“ one the computer uses
โ€ข H0 µ1 โ‰  µ2, The Means are not equal OR the Mean of Group 1 is not the
same as the Mean of Group 2 in the population
โ€ข One Tailed - state a direction
โ€ข H0 µ1 < µ2, or µ1 > µ2 The Mean of Group 1 lower than the Mean of
Group 2 in the population. The Mean of Group 1 is higher then the
mean of Group 2 in the population.
Stating Hypotheses with Symbols
โ€ข
One-Way Anova - with more than 2 groups*
o Null hypothesis
โ€ข H0 µ1 = µ2โ€ฆโ€ฆ..µk The Means of all the groups are equal.
o Research hypothesis
โ€ข Two Tailed โ€“ one the computer uses
โ€ข H0 µ1 โ‰  µ2,โ€ฆโ€ฆ.. µk The Means are not equal. The Mean of one group is
not equal to the Mean of at least one other group.
o * This is still bi-variate. You donโ€™t have more variables โ€“ only more categories
in the categorical variable.
Stating Hypotheses with Symbols
โ€ข
Bi-Variate Regression
o Null hypothesis
โ€ข H0 ฮ’1 = 0, The regression slope is not different from 0 in the population
โ€ข There is no relationship between the independent and dependent variables
in the population.
o Research hypothesis
โ€ข H0 ฮ’1 โ‰  0, The Slope is different from 0 in the population
โ€ข There is a relationship between the independent and dependent variable in
the population.
โ€ข Multi-Variate Regression
o Null hypothesis
โ€ข H0 ฮ’1โ€ฆ..ฮฒk = 0, The regression slope is not different from 0 in the population
โ€ข There is no relationship between the independent and dependent variable
in the population.
o Research hypothesis
โ€ข H0 ฮ’1โ€ฆโ€ฆฮฒk โ‰  0, At leas one of the Slopes is different from 0 in the population.
โ€ข There is a relationship between the independent variable and at least one
of the dependent variables in the population.
Matching Variables with Types of Analysis
๏ƒ˜ Chi-square (2 categorical variables)
type of car you drive by gender
race by political preference
race by eye color
gender by YES/NO questions
๏ƒ˜ Anova (1 categorical and one continuous variable)
gender by yearly income
gender by score on self esteem index
race by yearly income
political preference
by yearly income
age by whether or not you have children
๏ƒ˜ Bi Varate Regression (Two Continuous Variables)
yearly income by years of education
years married by marital satisfaction (scale score)
age by number of children
๏ƒ˜ Multiple Regression ( continuous/dummy independent and continuous
dependent)
number of dates per year by yearly income, age,
height, gender (dummy variable).
poverty rates by sex ratio, percent single headed
household, percent employed.
Statistics That Do Not Use Hypotheses
โ€ข
Confidence Intervals
o We generally do not state a hypothesis for a Confidence Interval.
Confidence Intervals are used to estimate a population mean or
proportion based on a sample mean or proportion. Opinion polls
use Confidence Intervals to predict election results etc.
โ€ข
Pearson Correlation (correlation co-efficient or r)
o We generally do not associate Pearson Correlation Matrixes with
hypotheses. We generally use Pearson Correlation Matrixes for
diagnostic purposes and to test the strength of bi-variate relationships.
Equations/Formulas Z Tests
โ€ข Z scores
o
Z=
๐’š๐’š โˆ’ µ
๐ˆ
o Where yi = individualโ€™s score
o µ = population mean
o ฮฃ = population standard deviation
o Information needed
โ€ข Population mean and standard deviation
o Example of when we would use this
โ€ข If you knew an individualโ€™s SAT/ACT score, you could
determine what percentile they scored in (i.e., the 95%)
โ€ข OR if you know what percentile they are in, you can
determine their score.
Equations for Inferential Statistics
โ€ข
Summary Statistics
o
Mean
โ€ข ำฏ= โˆ‘๐ฒ๐ฒ/n
o Median
โ€ข
๐’+๐Ÿ
๐Ÿ
Order values and count up this far
o Variance
โ€ข S2 = โˆ‘( ๐‘ฆ๐‘ฆ โˆ’ ำฏ)2
๐‘›โˆ’1
o Standard Deviation
โ€ข S = ๐‘ 2
Inferring a Population Mean or Proportion Based on
Sample Mean or Proportion
โ€ข The following Slides Focus on How to Estimate a
Population Mean or Proportion if we ONLY have a
random sample.
โ€ข In these cases we estimate one point in the
population (i.e., the mean IQ of USU students)
โ€ข BUT we build a confidence interval around this
single point โ€“ generally a 95% confidence interval
error
A One or Large Sample Hypothesis Test
โ€ข In the following slides we compare a sample mean
or proportion with a population mean or proportion.
โ€ข We want to know if our sample mean or proportion
is different from the population mean or proportion
โ€ข The population mean or proportion could actually
be a mean/proportion that is specified by a theory
or by past research (rather than a number
computed from a population data set)
Equations/Formulas for One Sample Hypotheses Tests
โ€ข
The equations are outlined in red
โ€ข
What do the symbols mean
o
o
o
o
o
One sample hypothesis test for Proportion
P = proportion in the sample
ฮ 0 =proportion or hypothesized proportion in the population
n = sample size
Z = computed statistic
o
o
o
o
o
o
One sample hypothesis test for Mean
ำฎ = mean in the sample
µ0 = mean or hypothesized mean in the population
n = sample size
sำฎ = standard error or an estimate of the standard deviation in the population
s ๐’ = computation for estimating the standard error using standard deviation
of the sample size times the square root of the sample size.
o
Symbols for Statistics that Infer the Relationship in the
Sample to the Population
Chi Square
Regression
Symbol(s)
X2
Interpretation
Chi Square Statistic
ฮฒ
b
แบก
beta โ€“ slope in population
slope in sample
alpha โ€“ intercept or constant
in prediction formula
value of the X variables
y-hat or predicted Y
Y bar or the mean of Y
X1โ€ฆX
ลถ
ำฏ
Anova
µ
yi - ำฏ
Mu or mean in population
Chi-Square Equation
Equations/Formulas for Inferential Statistics
o
Pearson Correlation Coefficient and R2
โ€ข Formula
o r = โˆ‘(๐‘ฟ๐‘ฟ โˆ’ ) (yi โ€“ ำฎ)
โˆ‘ ๐’™๐’™ โˆ’ ๐’™ ๐Ÿ โˆ‘( ๐’š๐’š โˆ’ ำฏ) ๐Ÿ
o R2 = r squared
o
Multiple Regression
o
Prediction Equation
โ€ข ลถ = ฮฌ + b1x1 + b2x2 + b3x3 +โ€ฆ..
โ€ข ลถ = predicted score for the dependent variable
โ€ข a = intercept or constant
โ€ข b = slope or parameter estimate for independent variables โ€“ unit increase in Y
variable for ever 1 unit increase in X
โ€ข X = value of the X values โ€“ taken from the codebook
o
Equations/Formulas for Inferential Statistics
โ€ข
Anova
o Formula
o TSS = โˆ‘ ๐’š๐’š๐Ÿ๐’‹ - G2
๐Ÿ
๐‘ป
๐’๐’Š
n
o SSB = โˆ‘( ) โˆ’ ๐‘ฎ2
o
TSS = Total Sum of Squares
SSB = Sum of Squares Within
SSW = Sum of Squares Between
n
SSW = TSS โ€“ SSB
s2B = F statistic
s2w
s2B = SSB/k-1
S2w = SSW/n-k
o
F = S2B/S2W
df between = k-1
df within = n-k
Anova and Regression
Sums of Squares
โ€ข Anova
o TSS = Total Sum of Squares
o SSW = Sum or Squares within each group
o SSB = Sum of Squares between the groups
SSB/TSS = R square or the proportion of the total sum of squares that is
explained by group membership
โ€ข Regression
o TSS โ€“ Total Sum of Squares
o SSM โ€“ Sum of Squares Model
o SSE โ€“ Sum of Squares Error
Equations/Formulas for Inferential Statistics
โ€ข
Two Sample T-test
o
Formula
โ€ข T = ำฏ1 โ€“ ำฏ2
__________
sำฏ1 โ€“ ำฏ2
this part is computed as follows
sำฏ1 โ€“ ำฏ2 = SP ๐Ÿ/๐’๐Ÿ + ๐Ÿ/๐’๐Ÿ
o
Pooled standard
deviation
Sp =
standard deviation
of sample 1
o
โ€ข
๐’๐Ÿ โˆ’๐Ÿ ๐‘บ๐Ÿ๐Ÿ+ ๐’๐Ÿ โˆ’๐Ÿ ๐‘บ๐Ÿ๐Ÿ
๐’๐Ÿ+๐’๐Ÿ โˆ’๐Ÿ
What symbols mean
โ€ข t = critical value
โ€ข ำฎ1 = mean of sample one
โ€ข ำฎ2 = mean of sample two
โ€ข n1 = size of sample 1 and n2 = size of sample 2
โ€ข Degrees of freedom = df = n1 + n2 โ€“ 2
o
Uses a T distribution
Estimated standard error of the
difference between the two means
standard deviation
of sample 2
Equations/Formulas for Inferential Statistics
โ€ข
Mann Whitney
o Focuses on ranks rather than on means โ€“ medians
o Two Groups
o Formula
โ€ข Z= T1 โ€“ E(T1)
๐’—๐’—๐’— (๐‘ป1)
โ€ข
โ€ข
โ€ข
โ€ข
E(T1) = n1 (n+1)
2
Rank values from smallest to largest
Sum ranks in smaller group = T1
Compute E(T1)
Compute Variance Var T1 = n1 n2 S2
n
s2 = โˆ‘(Yi - ำฎ )2
n-1
Uses a Z dsitribution.
Equations/Formulas for Inferential Statistics
โ€ข
Kruskal Wallis
o Focuses on ranks (medians) rather than on means
o More than Two Groups
o Formula
๐Ÿ
๐’™
=
๐Ÿ๐Ÿ
๐‘ป๐Ÿ๐’Œ -3 (n+1)
โˆ‘
๐’ (๐’+๐Ÿ) ๐’Œ ๐’๐’Œ
T = total sum of ranks for each sample
n = total number of cases
nk = number of cases for the k sample
Uses X2 Distribution
Degrees of Freedom = k-1 (where K is number of groups)
Use when you want to compare more than two groups, and the distribution is
not normal.
Equations/Formulas for Inferential Statistics
โ€ข Formulas for Sample Size
Sample size (n) =
๐‘‘๐‘‘
.9604
๐‘
(๐‘+1)
D = degrees of freedom or margin of error (usually .05)
N= population size
.9604 = a constant related to at least 95% sure
This sample size is large enough that we can be at least
95% sure we can generalize to the population with a
margin of error of .05
โ€ข Prepared by Dr. Carol Albrecht