Download Basis Statistics - rci.rutgers.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Analysis of variance wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
BASIC STATISTICAL CONCEPTS
These are some of the topics that you are required to know for
the Data mining class. If you feel that you are not familiar with
some of them please read about them in your basic stats text
book or come and talk to me and I will give you a book to
read.
1. t-test:
1. The one-sample pooled t test statistic is given by
t
x 
s
n
t has approximately a t distribution with degrees of freedom
2. The two-sample pooled t test statistic is given by
t
where s 2p 
| x1  x2 |
,
1 1
sp

n1 n2
n1  1s12  n2  1s22 is the pooled estimate of variance. t
n
1
 n2  2

has approximately a t distribution with n1+n2 –2 degrees of
freedom. There is also a formula to estimate the degrees of
freedom and obtain a better approximation.
2
 s12 s22 
n  n 
2
 1
df 
2
2
1  s12 
1  s22 

n1  1  n1  n2  1  n2 
Finally a Conservative calculation of df’s:
df’s = the smaller of n1-1 and n2-1.
The un-pooled version of the statistic is
t
| x1  x2 |
s12 s22

n1 n2
2. THE ANALYSIS OF VARIANCE
(ANOVA)
This procedure is use to compare more than two population
means. Suppose that we have I samples from I populations with
means …, I and suppose we want to compare the three
population means.
H : 
Ha : not all of 12…., are equal,
at least two are different.
We have I samples with sizes
n1,n2,...,nI
x1 , x2 ,..., xI
We have I sample means
We have I sample standard deviations: s1,s2,...,sI
The test statistic is F = MSG/MSE which has an F-distribution with
(I -1) and (N-1) degrees of freedom.
Analysis of Variance Table:
Df
Sum of Sq
Mean Sq
Treatment
I-1
SSG
MSG
Residuals
N-I
SSE
MSE
Total
N-1
SST
F-Value
Pr(F)
F=MSG/MSE
p
3. Pearson’s correlation coefficient:
measures how close {Yi1} and {Yi2} are to linearity.
ˆ 
N
N
where
sc 2 
 (Y
i 1
s12
s1s2
ic
 Yc )
N
2
and s12 
 (Y
i 1
i1
 Y1 ) Yi 2  Y2 
N
4. SIMPLE AND MULTIPLE REGRESSION
See any text book or go to www.rci.rutgers.edu/~cabrera/401
5. Outlier identification:
Outliers are observations that appear to be inconsistent with
the majority of the data.
1. The z-score rule (Grubbs’ test): Calculate a z-score, zi,, for
every observation:
zi 
Xi  X
s
where X and s are the mean and standard deviation. Call Xj
an outlier if |zj| is large, say, greater than five.
2. The CV rule: Call the furthest observation Xj from the mean,
X , an outlier if the coefficient of variation, CV  s / X exceeds
some prespecified cutoff.
3. The resistant z-score rule: Calculate a resistant z-score:
zi* 
X i  X

s
where X and s are the median and MAD. Call Xj an outlier if
|z*j|>5.
6. Chisquare test:
D
Not D
Risk
O11
O12
No Risk
O21
O22
r
c
W  
(Oij  Eij ) 2
Eij
W has a Chisquare distribution with df=(nrows-1)(ncols-1)
i 1 j 1