Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AP Statistics: TPS3e Review Notes Ch 1 - 8 Section 1.1 - Displaying Distributions with Graphs Statistics is the science of data: Data sets contain information of individuals (often people, but also animals, things, etc.) The characteristics describing these individuals are variables Categorical variables: non-numerical categories (gender, color, etc.) Described with bar graphs and pie charts Quantitative variables: numerical categories (height in inches, age in years, etc.) Described with dotplots, stemplots and histograms Ex. Stemplot Dotplot Histogram Describing graphs: ***Look for a pattern and clear deviations from that pattern*** Always describe the following four characteristics of the graph, when applicable: Center: The value that divides the distribution in half Spread: Give the smallest & largest value Shape: Some graphs have simple shapes: Symmetric: The shape of the graph looks relatively equal on both sides of the center Skewed: Skewed Right Skewed Left Uniform: Not skewed one way or the other Unimodal: A graph having one mode (or “peak”) Bimodal: A graph having two modes (or “peaks”) Outliers: Observations that lie outside the overall pattern of the distribution Percentiles: The nth percentile of a distribution is a value such that n percent of the observations fall at or below it. Ogive: a graph of relative cumulative frequency graph Sec 1.2 – Describing and Comparing Distributions Describing Shape: there are three ways to describe a shape. A distribution can either be: Skewed Left, where most of the data is to the left of the mean: reverse of a Skewed Left Distribution, or Symmetric. , Skewed Right, which is the Measuring the Center: The most common measure of center is the Mean, or average. To find the mean of a set of observations, add the values and divide by the number of observations. The Median is the midpoint of a distribution, when the numbers are arranged from smallest to largest. When the number of observations is odd, the median is the center observation. When there is an even number of observations, the median is the mean of the two center observations in the list. If the distribution is exactly symmetric, the mean and median will be equal. Measuring the spread: Quartiles: the first Quartile, Q1, is the median of observations to the left of the median in an ordered list. The third Quartile is the median of observations to the right of the overall median in an ordered list. The Interquartile Range (IQR) is the distance between Q1 and Q3. Outliers: an observation can be called an outlier if it is more than 1.5 X IQR above Q3 or below Q1. Outliers can be shown in a modified boxplot. Five Number Summary: the minimum, Q1, Median, Q3 and maximum in a set of data. Standard Deviation (Equation shown below): how far a set of data is from its mean, on average. The square of the standard deviation is the Variance. The five number summary is generally best to describe a skewed distribution, Mean and standard deviation are better for symmetric and nearly symmetric distributions. Linear transformations: changing the original variable by adding or multiplying. Multiplying each number in a distribution by b multiplies measures of center(mean, median) and measures of spread (standard deviation and IQR) by b. adding a to each observation in distribution ads a to the measures of center, by has no effect on spread. Sec 2.1-2.2: Density Curves and the Normal Distributions Density Curves Is always on or above the horizontal axis Has an area of exactly 1 underneath it Area under density curve = the proportion of observations that fall in that interval of values MEDIAN = Equal-Areas point (Left Area = 0.50 = Right Area) MEAN = Balance point (if curve were made of solid material) Symmetric density curve: MEAN = MEDIAN Skewed Right: MEAN > MEDIAN Skewed Left: MEAN < MEDIAN MEAN: μ STD DEV: σ Normal Distributions symmetric, unimodal, and bell-shaped N(μ,σ) , where μ = population mean, and σ = population std dev EMPIRICAL RULE: 68-95-99.7 rule 68% of the observations fall within σ of the mean 95% of the observations fall within 2σ of the mean 99.7% of the observations fall within 3σ of the mean An observation's percentile is the % of the distribution that is at or to the left of the observation Sec 3.1: Scatterplots & Correlation SCATTERPLOT: o When you have two quantitative variables for the same individuals o X = Explanatory variable; Y = Response variable FORM: Linear Nonlinear (Curved) DIRECTION: Positive association (positive slope) Negative association (negative slope) STRENGTH: How closely do the data points align with the regression line OUTLIERS: Points that deviate from overall pattern - Regression Outlier = large residual = large vertical y error - Influential Point = changes the LSRL, correlation, slope, calculations Sec 3.2-3.3: More on Correlation & Least-Squares Regression Residual = vertical error = Observed – Predicted. By definition, a LSRL of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. To calculate an LSRL, you must know the means x and y , the standard deviations S x and S y and their correlation r. Because the y value is a prediction as opposed to an observed value, we use y (pronounced y-hat), thus making the LSRL equation: ŷ = a + bx The individual components are: Sy S x Intercept: a y b x Slope: b r The coefficient of determination, r 2 , is the fraction of the variation in the values of y that is explained by the least squares regression of y on x. A residual is the difference between the observed y value, and the y value predicted by the LSRL. A residual plot plots the residuals on the y-axis, and the x values the same as the scatter plot. There should be no pattern in the residual plot. The points should be randomly scattered on either side of the x-axis. If there is a pattern, than the pattern does not well fit a linear model. An influential observation is any observation that if removed would drastically change the result of a calculation. Outliers in a scatter plot could be influential observations, because they could change the LSRL. 4.2 - Relations in Categorical Data Categorical variables include sex, race and occupation; they can also be created by grouping values of a quantitative variable into classes (e.g. age groups, income levels, etc.). A Two Way Table is used to describe two categorical variables. Years of school completed by age, 2000 (thousands of persons) Education Did not complete high school Completed high school 1 to 3 yrs of college 4 or more years of college Total Age Group 25 to 34 4474 11546 10700 11066 37786 35 to 54 9155 26481 22618 23183 81435 55+ 14224 20060 11127 10596 56008 Total 27853 58087 44445 44845 175230 The row totals and column totals in a two-way table give the marginal distributions of the two individual variables. Marginal distribution of education: Education Did not Finish H.S. Completed H.S. 1-3 yrs of college >4 yrs of college proportion 27853/175230 58087/175230 44445/175230 44845/175230 Conditional Distribution refers to one categorical group, such as a level of education. P (1 to 3 yrs of college | 25-35 yr old) = 10700/37786 = .283 = 28.3 % P (1 to 3 yrs of college | 35-54 yr old) = 22618/81436 = .278 = 27.8% P (1 to 3 yrs of college | 55+ yr old) = 11127/56008 = .199 = 19.9% We can compare the conditional distributions of various categories using bar graphs. Simpson’s Paradox refers to the reversal of direction of a comparison or an association when data from several groups are combined to form a single group. It is an example of the effect of lurking variables on an observed association. Sec 4.3 - Cautions about Correlation and Regression Correlation and Regression must be interpreted carefully. Plot the data and be sure that the relationship is roughly linear. Correlation and regression describe only linear relationships. The correlation r and the least-squares regression line are not resistant. Correlations based on averages are usually too high when applied to individuals. Extrapolation is the use of a regression line for prediction far outside the domain of values of the explanatory variable x that you used to obtain the line or curve. Such predictions are often not accurate. A Lurking Variable is a variable that is not among the explanatory or response variable in a study and yet may influence the interpretation of relationships among those variables. Causation, a change in X causes a change in Y. Ex: SXSW causes heavier traffic. Even a very strong association between two variables is not by itself good evidence that there is a cause and effect link between the variables. To establish causation, one must run a designed experiment. Common Response, X and Y respond commonly to changes in an unseen variable. Ex: Ice cream sales and shark attacks both increase during summer. Two Variables are confounded when their effects on a response variable cannot be distinguished from each other. The confounded variables may be either explanatory variables or lurking variables. Sec 5.1 – Designing Samples EXPLORATORY ANALYSIS VS. STATISTICAL INFERENCE Exploratory Analysis provides us with data about the variables and their relations to each other. Statistical Inference produces answers to specific questions that tend to generalize about the whole population and produces us with confidence levels about our certainty to our answer. OBSERVATION VS. EXPERIMENT Observational Study observe the individuals to measure the wanted information, but does not attempt to influence the results. Experiment impose a treatment on individuals in order to observe their responses. POPULATION VS. SAMPLE Population – entire group of individuals that we want information about Sample – the part of the population that we actually examine in order to gather information SAMPLE VS. CENSUS Sample – studying a part of the population to gain information about the whole Census – attempts to contact every individual in the entire population Select Types of Bias (NOTE: You can never eliminate bias, only limit it): Undercoverage: occurs when some groups in the population are left out of the process of choosing the sample Nonresponse: occurs when individuals can’t be contacted or do not cooperate Wording of the Question: confusing or leading questions can introduce strong bias, and even minor changes in wording can change a survey’s outcome Sampling Designs1: Simple Random Sample (SRS): every set of individuals has an equal chance to be the sample actually selected To choose a SRS: Assign a numerical value to each individual in the population Use a Random Digits Table or Random Number Generator to select values Voluntary Response Sample consist of people who choose to respond Stratified Random Samples divide the population into groups of similar individuals (stratas) and then chooses a SRS in each stratum and combines the SRSs to form the full sample Multistage samples select successively smaller goups within the population, resulting in samples consisting of clusters of individuals Convenience samples choose the individuals easiest to reach for the sample Cluster samples divides the population into groups (clusters) that are representations of the population and then conducts the SRS on the clusters to determine the sample. In a random digits table: Each entry in the table is equally likely to be any of the digits from 0 to 9 Entries are independent of each other. Sec 5.2 Designing Experiments 1 NOTE: Larger random samples give more accurate results than smaller samples Experimental Units undergo treatments Experiments are able to minimize factors since they are able to match and combine different levels of factors. Factor 2 : Beta carotene Two factors each with yes no two levels (yes, no) Advantage of Experiments Four treatments (boxes) aspirin aspirin Can give good yes beta carotene placebo evidence for causation Factor 1: Aspirin Advantage of no placebo placebo Able to study the beta carotene placebo specific factors of interest Control groups are usually given a sham treatment (placebo usually) Randomization: Control the effects of lurking variables by comparing treatments Randomize to deplete bias and insure a Simple Random Sample (SRS) Replicate each treatment on many units to reduce chance variation Comparing is the simplest form of control….takes care of confounding the effect of a treatment with other influences, such as lurking variables. Block experimental units that are similar in some way that’s important to the response. Then, randomize within each block. (ex: split by gender before randomization) Matched pairs – common form of blocking for comparing just two treatments. Two ways… subject receives both treatments in random order subjects are matched in pairs as closely as possible, and one subject in each pair receives each treatment The biggest weakness of experiments is lack of realism; when the experiment actually doesn’t imitate real life in the way we need it to study the specific condition. An example of block experimental design Men Random Assignment Subjects Women Random Assignment Group 1 Group 2 Group 3 Group 1 Group 2 Group 3 Therapy 1 Therapy 2 Therapy 3 Therapy 1 Therapy 2 Therapy 3 Compare Survival Compare Survival To randomize, can use table of random digits, or RandomInt on the calculator. A way to prevent bias is making the experiment double-blind when neither the subject nor the experimenter knows which treatment the subject received. Sec 6.1 Simulating Experiments Three methods for solving questions involving chance 1. Carry out an experiment many times and calculate the result’s relative frequency. Cons: slow, costly, impractical, logistically difficult 2. Develop a probability model and use it to calculate a theoretical answer (i.e. like the one developed in Ch.6). Cons: we must know something about the rules of probability and thus might not be feasible 3. Start with a model that reflects the truth about the experiment. Develop a procedure to imitate/simulate a number of repetitions of the experiment. Pros: quick, simplifies difficult problems via the use of a calculator and formal mathematical analysis (this is the method we are exploring) Simulation: the imitation of chance behavior based on a model that accurately reflects the experiment under consideration Independence: when one trial doesn’t affect another Steps for a simulation: State the problem/ describe the experiment. State the assumptions. (expected probabilities, independence of trials, etc.) Assign digits to represent outcomes. Note: remember, if you are only using the digits up to a certain number (ex. 01-20), you must note that you are not including the other digits that are drawn in your experiment (i.e. 00, 21-99). Simulate many repetitions. State your conclusions. To conduct a simulation: a. Use a random number table (state which row you are using) b. For Calculator: 1st seed the calculator: For TI-83: any number rand For TI-89: rand(any number) For TI-83: Use randInt(lower bound, upper bound, # of random integers desired) For TI-89: Use tistat.randint(lower bound, upper bound, # of random integers desired) Note: depending on the experiment, you may or may not be able to include repeated numbers in your simulation Sec 6.2 – 6.3: Probability Big Idea: Chance behavior is unpredictable in the short run, but has a regular and predictable pattern in the long run. (Remember flipping coin a ton of times) P(A)= # outcomes in A # outcomes in Sample Space Rules: Trials must be independent. Each Trial should not affect another one. Any probability is a number between 1 and 0 All possible outcomes together (Sample Space) must have probability of 1 Compliment Rule: Probability an event does not occur is 1 minus the probability the event occurs. P(Ac)= 1-P(A) Addition Rule: Two events A and B are disjoint (mutually exclusive) if no outcomes are in common and can never happen at the same time. P(A or B) =P(AUB) = P(A) + P(B) Multiplication Rule: If events A and B are independent, then P(A and B) = P(A∩B) = P(A)∙P(B) Vocabulary and Definitions: Random: outcomes are uncertain, but there is a regular distribution of outcomes in large number of repetitions. Probability: the proportion of times the outcome would occur in long series of repetitions. Long-term relative frequency. Sample Space S: set of all possible outcomes. Event: an outcome or set of outcomes of a random phenomenon. Probability Model: a mathematical description of a random phenomenon consisting of a Sample space S and a way of assigning probabilities to events. Key Definitions: Intersection - A set that contains elements shared by two or more given sets Disjoint: When two or more sets have nothing in common Union - A set, every member of which is an element of one or another of two or more given sets Independent: the occurrence of one event does not affect the probability of another occuring Key Equations: General Multiplication Rule for Any Two Events: P(A and B) = P(A)P(B | A) General Addition Rule for Unions of Two Events: P(A or B) = P(A) + P(B) – P(A and B) Conditional Probability: P(B | A) = P(A and B) / P(A) Addition Rule for Disjoint Events: Independence Test: If A, B, and C are disjoint, then P(one or more A, B, C) = P(A) + P(B) + P(C). P(B | A) = P(B) or P(A and B)=P(A)P(B) Bayes’s Rule: There is a complicated formula for it but don’t worry about it. Use the 2x2 table approach that you learned in class. Sec 7.1 Discrete and Continuous Random Variables random variable – a variable whose value is a numerical outcome of a random phenomenon. For example if X is the result rolled on a die (1,2,3,4,5, or 6) then X is a variable, because its value changes every time the die is rolled. continuous random variable – takes all values in an interval of numbers. Its probability distribution is described by a density curve. The probability of any event is the area under the density curve and above the values of X that make up the event. Only intervals of values have positive probability. If X is normally distributed with mean μ and standard deviation σ then the standardized variable is represented by: (X - μ) σ discrete random variable – has a countable number of possible values. Has probability between 0 and 1. Sum of probabilities = 1. Value of X: x1 x2 x3 … xk Probability: p1 p2 p3 … pk The probability pi must satisfy two requirements: 1. Every probability pi is a number between 0 and 1. 2. p1 + p2 + … pk = 1. normal distribution – one type of continuous probability distribution. uniform distribution – represents the results of many trials and has height 1 over the interval from 0 to 1. The area under a density curve is 1, and the probability of any event is the area under the density curve and above the event in question. designing a simulation: For TI – 89 tistat.randNorm(μ, σ, n) list1 SortA list1 For TI – 83 randNorm(μ, σ, n) L1 SortA(L1) This creates a list with the given mean, standard deviation and number of units. Sec 7.2 Mean and Variance of Discrete Random Variables The probability distribution of a random variable X, like a distribution of data, has a mean μx and a standard deviation σx. Suppose that X is a discrete random variable with the distribution: Value of X: X1 Probability: P1 X2 P2 X3 P3 … … X4 P4 Xk Pk The mean of a discrete random variable is also called the expected value. To find the mean of X, multiply each X by the probability and add the products. μ1 = X1P1 + X2P2+ X3P3+…+ XkPk = Σ XiPi The variance is the average of the squared deviation (X- μx) of the variable X from its mean μx. σ2x = (X1- μ1) 2P1 + (X2- μ2) 2P2 + (X3- μ3) 2P3 +…+ (Xk- μk) 2Pk= Σ (Xi- μX) 2 Pi The standard deviation σx of X is the square root of the variance. Law of Large Numbers: The more samples you take, the closer the mean of the observed values gets to the mean μ of the population. This also works for proportions. Rules for Means: Rules for Variances: If X is a random variable and A and B are fixed numbers, then μa+bX= a + bμx σ2a+bX= b2 σ2x The mean of (a plus bX) equals a plus b times the mean of X The variance of (a plus bX) equals b squared times (variance of x) squared If X and Y are random variables, then If X and Y are independent random variables, then: μX+Y= μx + μy variables, then: σ2X+Y= σ2x + σ2y σ2X-Y= σ2x + σ2y The mean of (X+Y) or (X-Y) equals the mean of X plus the mean of Y The variance of (X+Y) or (X-Y) equals the variance of X plus the variance of Y * You can never add standard deviations, only variances. Any linear combination of independent normal random variables is normally distributed. Sec 8.1: The Binomial Distribution binomial setting - 2 outcomes (success & failure) - fixed number of observations (n) - independent observations - P(success) = about constant binomial distribution - for x successes B(n,p) probability distribution function (pdf) - assigns a probability to each value of x - x is a discrete random variable - under the DISTR menu (2nd VARS): binompdf(n,p,k) cumulative distribution function (cdf) - sum of all the probabilities [0,x] - probability of at most x successes in n trials - x is a random variable - under the DISTR menu (2nd VARS): binomcdf(lower bound, upper bound, xbar,) binomial coefficient - number of ways to get k successes in n observations (nCk) = n! note: nCk is “n choose k” also written as k!(n k )! n k - on formula sheet! - under the PRB menu (MATH ): n nCk k binomial probability - probability of k successes in n observations P(X=k) = (nCk) pk (1-p)n-k mean & standard deviation of B(n,p) = np = npq - these are also on the formula sheet! normal approximations - as n increases, the distribution B(n,p) becomes more normal - if np 10 and n(1-p) 10 then you can use the approximation: N(np, npq )