Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1 Descriptive statistics—methods of summarizing data Inferential statistics—generalizing from sample data NO FORMULAS Chapter 2 Variable Data o Univariate o Bivariate o Multivariate o Categorical(qualitative) o Numerical(quantitative) Placebo Observation Experiment—planned intervention Treatment—the intervention in an experiment Control group—the one that does not get the treatment Blind study—the individuals do not know if they are getting the treatment or a placebo Double blind—both the individual and the person measuring the results Confounding variable—one whose impact cannot be distinguished from another Extraneous factors—one not currently of interest but thought to impact outcome Census—entire population Bias o Selection—part of the population systematically excluded o Mearsurement/response bias—when wording tends to influence o Nonresponse—intended individuals do not respond Sampling o Simple random sample—all items have an equal chance of inclusion o Stratified sample—population predivided (economic factors, salary, distance from a river, o Age, etc) choose SRS from each o Blocking—creating groups that are similar to reduce the impact of other variables or study o them further o Cluster—choosing an SRS from areas (zip codes: in a big city some from chinatown, little o Italy etc.) Randomization—methods for assigning to treatment “equally” NO FORMULAS Chapter 3 o Frequency distribution—a table that displays categories and number of responses Relative frequency—proportion indicating percent in a category Bar chart—categorical data bars do don’t touch Pie chart—used when # of categories is small enough to display easily Percent *360=degrees in wedge Stem and Leaf—used for small data sets—retains individual pieces of data Dot plot—shows patterns—loses individual values in the data Histogram—numeric data , bars do touch o Class width—pieces of data in a group—if not given use to determine o number of classes for the data Skewed—left(negative)/right(positive) indicates which tail is longer Cumulative—add next group to the previous Density—relative freq./class width Modes o Unimodal, Bimodal, Multimodal Heavy tail –much data in the tails—slope from peak shallower than that of a normal curve Light tail—not much data in tails(possible outliers)—rapid decrease to long tails or just more rapid decrease if no outliers (pg 77) Sampling variability—the extent to which the samples differ from the population NO MAJOR FORMULAS other than box and whisker, IQR, outliers Chapter 4 o o o o o o o Mean—average or balance point on a scale—equal areas to both sides—impacted by outliers Trimmed mean—order the set then remove the trim percent from each end Median—middle piece of data—not as impacted by outliers Mode—most frequent data Range—difference between high and low items Deviations—difference of each piece from the mean Standard deviation—the average of the deviations Variance—standard deviation squared Box and Whisker—shows shape, spread and center of data—used to show relative normality Outlier IQR * 1.5 Extreme outlier IAR * 3 25% of data in each part Empirical Rule—key word is approximately 1 sd—68% 2sd—95% 3sd—99.7% Chebyshev’s Rule—key word is “at least” o Z-score—tells how many sd an items is from the mean Critical z—tells percent of data below this point FORMULS page one of the formula sheet and bottom of page two for z-score o Chapter 5 o o o o o o Scatterplot—plots bivariate data Correlation coefficient (r)—range -1 to 1—determines how well the data fits a linear regression Does not depend on which is x and which is y, or units of measure -1 to -.8 and .8 to 1—strong -.8 to -.5 and .5 to .8—moderate -.5 to .5 –weak CORRELATION DOES NOT IMPLY CAUSATION o o o o o o o Spearmans Rank correlation Rank the x’s, rank the y’s, keep pairs together and plot the ranks to determine correlation Coefficient of determination—(r2)—the proportion of variation attributed to a linear relationship Linear regression—y = a + bx Turn on diagnostics (2nd 0 x-1 diagnostics on enter) Also called the least squares line Interpolation—proper use of the line Extrapolation—may work but not recommended without stating it is extrapolation May be used to predict y from x but not x from y Residual—actual value minus predicted value (hint AP class—Actual – Predicted) Residual plot—plots (x, residual value) Should not have any particular pattern if the linear regression is a good fit Standard error or deviation of least squares line o notice n-2 not n-1 (hint to remember there are two variables in a reg line) Transforming non-linear data— use the chart below to help determine how to transform the date based on which quadrant the shape is following + indicates raise the power on that variable, - indicates to lower the power on that variable including taking the sqrt or the log of the variable use when transformed + - + - R2 can be found by taking 1-SSResid/SSTo when given ^ symbol indicates predicted value FORMULAS for a and b for the line are on page one of the formula sheet usually use the calculator Chapter 6 o Sample space—all the possible outcomes of an event Event—all the outcomes of an experiment Simple event has one outcome Disjoint events or mutually exclusive events—have no outcomes in common Probability—successful outcomes/all outcomes OR—P(E∪F)=P(E)+P(F)- P(E∩F) indicates addition of probabilities-overlap of the events AND— P(E∩F)=P(E)P(F) indicates multiplication of probabilities Conditional probability—p(E|F) probability of E given F reduces the number of total items by restricting the numerator and denominator to the given condition (ie look at one column or row in a chart and use the total of the row or column to divide by) P(E|F)= can be cross multiplied if convenient to solve o o Chapter 7 o o o o o Independent events –the outcome of one situation does not depend on another P(E|F)=P(E) Therefore independence implies P(E∩F)=P(E)P(F) Without replacement changes the probability of the event with each selection Random variable Probability distribution— for discrete variable--histogram of the probabilities of each possible outcome—sum of all the probabilities must equal 1 for a continuous variable it becomes a smooth curve the density function relates back to z-scores and normal curves measures of central tendency for probability distribution μx describes where the probability distribution is centered ie its mean σx is the standard deviation of the distribution sometimes called the expected value variance and SD is the square root of variance Measures of central tendency for a linear regression o o o o o o o o o o Chapter 8 Binomial Distribution Has exactly two outcomes s n-s n= number of trials, s= number of success desired nCsπ (1-π) =nπ x x= Geometric random variable Sequence of trials ends when there is a success p(x)=(1-π)x-1π where x is the number of trials until a success Normal Probability Plot—(normal (z) score, observation) a normal probability plot that is linear suggests normality if r<crit r from table on 368 then doubt is cast on linearity (probably not on AP test) statistic—a quantity computed from a sample sampling variability—the differences that occur in statistics based on the sample taken sampling distribution—the distribution that occurs when many samples are taken o o o for a sample mean must be SRS, pop is normal or dist is approx. normal based on large sample (n>30) σx=σ/sqrt(n) o for a proportion must be SRS, pop is normal, np and n(1-p) >5 liberal test o o Use p=.5 when not given Chapter 9 o Chapter 10 o o o o o o o o o o o o o Chapter 11 o o o o Confidence interval—an interval of plausible values, the actual statistic is included in the interval whatever % the CI is. Ie a 95% CI 95 of 100 CI’s from the same sample size will contain the true statistic (mean or proportion) Bound the amount added and subtracted from the statistic CI=statistic ±crit z or t (SD) Use t when SD for population is not known Standard error—is the estimated standard deviation Knowing the bound you can find the necessary sample size for a specific CI Null Hypothesis—H0 –the claim about a population that is initially assumed true Always = to the value Alternate Hypothesis—Ha—the competing claim Can be < > or ≠ Hypothesis are always about a population statistic μ or π Types of error Type I—rejecting H0 when it was true denoted by α (level of significance) Type II—failing to reject a false H0 denoted by β to find: Determine the critical by reverse calculating a z score Use the critical to against the assumed value to calculate the new z look up the critical z compute the p-value based on the type of test this is β The smaller the α, the larger the β Observed significance level (p-value—derived from critical t) Rejecting or Failing to reject P-value < α reject P-value > α fail to reject Use this terminology because we statistics is about the probability not the certainty that something happens Z gives percentage below the given value use 1- for the upper tail T gives percentage above the given value but only has values from 0 to 3 so a negative has the same value below as the positive has above df=n-1 With both be careful of the question asked Power –probability of rejecting the null Or Power= 1-prob of not rejecting the null So Power = 1- β Two sample means or proportion Generally state μd or πd = sample 1 – sample 2 Ho: μd or πd =0 or the stated value Ha: >, <, ≠ 0 Determine p-value and compare to α Chapter 12 o o Other formulas for two sample tests are on page 3 of the formula packet Paired samples one before and one after a treatment df = n-1 Chi squared testing One Column or row array--Goodness of fit test Two way table Homogeneity—two separate samples are compared Independence—one sample taken then split in two ways,