* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Basis Statistics - rci.rutgers.edu
Sufficient statistic wikipedia , lookup
History of statistics wikipedia , lookup
Psychometrics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Misuse of statistics wikipedia , lookup
Analysis of variance wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
BASIC STATISTICAL CONCEPTS These are some of the topics that you are required to know for the Data mining class. If you feel that you are not familiar with some of them please read about them in your basic stats text book or come and talk to me and I will give you a book to read. 1. t-test: 1. The one-sample pooled t test statistic is given by t x s n t has approximately a t distribution with degrees of freedom 2. The two-sample pooled t test statistic is given by t where s 2p | x1 x2 | , 1 1 sp n1 n2 n1 1s12 n2 1s22 is the pooled estimate of variance. t n 1 n2 2 has approximately a t distribution with n1+n2 –2 degrees of freedom. There is also a formula to estimate the degrees of freedom and obtain a better approximation. 2 s12 s22 n n 2 1 df 2 2 1 s12 1 s22 n1 1 n1 n2 1 n2 Finally a Conservative calculation of df’s: df’s = the smaller of n1-1 and n2-1. The un-pooled version of the statistic is t | x1 x2 | s12 s22 n1 n2 2. THE ANALYSIS OF VARIANCE (ANOVA) This procedure is use to compare more than two population means. Suppose that we have I samples from I populations with means …, I and suppose we want to compare the three population means. H : Ha : not all of 12…., are equal, at least two are different. We have I samples with sizes n1,n2,...,nI x1 , x2 ,..., xI We have I sample means We have I sample standard deviations: s1,s2,...,sI The test statistic is F = MSG/MSE which has an F-distribution with (I -1) and (N-1) degrees of freedom. Analysis of Variance Table: Df Sum of Sq Mean Sq Treatment I-1 SSG MSG Residuals N-I SSE MSE Total N-1 SST F-Value Pr(F) F=MSG/MSE p 3. Pearson’s correlation coefficient: measures how close {Yi1} and {Yi2} are to linearity. ˆ N N where sc 2 (Y i 1 s12 s1s2 ic Yc ) N 2 and s12 (Y i 1 i1 Y1 ) Yi 2 Y2 N 4. SIMPLE AND MULTIPLE REGRESSION See any text book or go to www.rci.rutgers.edu/~cabrera/401 5. Outlier identification: Outliers are observations that appear to be inconsistent with the majority of the data. 1. The z-score rule (Grubbs’ test): Calculate a z-score, zi,, for every observation: zi Xi X s where X and s are the mean and standard deviation. Call Xj an outlier if |zj| is large, say, greater than five. 2. The CV rule: Call the furthest observation Xj from the mean, X , an outlier if the coefficient of variation, CV s / X exceeds some prespecified cutoff. 3. The resistant z-score rule: Calculate a resistant z-score: zi* X i X s where X and s are the median and MAD. Call Xj an outlier if |z*j|>5. 6. Chisquare test: D Not D Risk O11 O12 No Risk O21 O22 r c W (Oij Eij ) 2 Eij W has a Chisquare distribution with df=(nrows-1)(ncols-1) i 1 j 1