Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Analysis is the process of organizing, displaying, summarizing, and asking questions about data. Definitions: Individuals – objects (people, animals, things) described by a set of data Variable - any characteristic of an individual Categorical Variable – places an individual into one of several groups or categories. Quantitative Variable – takes numerical values for which it makes sense to find an average. + is the science of data. Data Analysis Chapter 1 Statistics + Why the distinction is important You will receive NO credit (really!) on the AP exam if you construct a graph that isn’t appropriate for that type of data Type of Variable Appropriate Graph Categorical Pie Chart, Bar Graph Quantitative Dotplot, Stemplot, Histogram The purpose of a graph is to help us understand the data. After you make a graph, always ask, “What do I see?” How to Examine the Distribution of a Quantitative Variable In any graph, look for the overall pattern and for striking departures from that pattern. Describe the overall pattern of a distribution by its: •Shape •Center •Spread Don’t forget your SOCS! Note individual values that fall outside the overall pattern. These departures are called outliers. + Examining the Distribution of a Quantitative Variable Displaying Quantitative Data a Boxplot + Construct Consider our NY travel times data. Construct a boxplot. 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85 Min=5 Q1 = 15 M = 22.5 Q3= 42.5 Max=85 Recall, this is an outlier by the 1.5 x IQR rule Describing Quantitative Data Example + Transformations Think about this like a curve on a test. If I add Think about this like a Adding, subtracting, multiplying and dividing all the numbers in 5 points to everyone’s a test. If I add the data. acurve data seton is called “transforming” test, how will the 5 points to everyone’s ONE OF THE CENTRAL CONCEPTS FROM THIS CHAPTER spread of the grades test, how will the class is knowing how transformations affect measures of center and change? average change? spread. I WILL TEST ON THIS. OFTEN. A LOT. Mean Standard deviation Adding or subtracting Mean gets added or subtracted same amount Standard deviation doesn’t change Multiplying or dividing Mean gets multiplied or divided by same amount Standard deviation gets multiplied or divided by same amount A z-score tells us how many standard deviations from the mean an observation falls, and in what direction. Definition: If x is an observation from a distribution that has known mean and standard deviation, the standardized value of x is: x mean z standard deviation A standardized value is often called a z-score. Jenny earned a score of 86 on her test. The class mean is 80 and the standard deviation is 6.07. What is her standardized score? x mean 86 80 z 0.99 standard deviation 6.07 Describing Location in a Distribution Position: z-Scores + Measuring + Mean and Median of a Density Curve Symmetric: Mean = Median Skewed Left: Mean < Median Skewed Right: Mean > Median The median of a density curve is the equal-areas point, where ½ of the area is to the left and ½ of the area is to the right. The mean of a density curve is the balance point, where the curve would balance if it were made of solid material. All Normal distributions are the same if we measure in units of size σ from the mean µ as center. Definition: The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1. If a variable x has any Normal distribution N(µ,σ) with mean µ and standard deviation σ, then the standardized variable z x - has the standard Normal distribution, N(0,1). Normal Distributions Standard Normal Distribution + The and Response Variables Definition: A response variable (the dependent variable) measures an outcome of a study. An explanatory variable (the independent variable) may help explain or influence changes in a response variable. Note: In many studies, the goal is to show that changes in one or more explanatory variables actually cause changes in a response variable. However, other explanatory-response relationships don’t involve direct causation. Scatterplots and Correlation Most statistical studies examine data on more than one variable. In many of these settings, the two variables play different roles. + Explanatory Linear Association: Correlation Linear relationships are important because a straight line is a simple pattern that is quite common. Unfortunately, our eyes are not good judges of how strong a linear relationship is. Definition: The correlation r measures the strength of the linear relationship between two quantitative variables. •r is always a number between -1 and 1 •r > 0 indicates a positive association. •r < 0 indicates a negative association. •Values of r near 0 indicate a very weak linear relationship. •The strength of the linear relationship increases as r moves away from 0 towards -1 or 1. •The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship. Scatterplots and Correlation A scatterplot displays the strength, direction, and form of the relationship between two quantitative variables. + Measuring about Correlation 1. Correlation makes no distinction between explanatory and response variables. 2. r does not change when we change the units of measurement of x, y, or both. 3. The correlation r itself has no unit of measurement. Cautions: • Correlation requires that both variables be quantitative. • Correlation does not describe curved relationships between variables, no matter how strong the relationship is. • Correlation is not resistant. r is strongly affected by a few outlying observations. • Correlation is not a complete summary of two-variable data. Scatterplots and Correlation How correlation behaves is more important than the details of the formula. Here are some important facts about r. + Facts and Regression Wisdom 1. The distinction between explanatory and response variables is important in regression. Least-Squares Regression Correlation and regression are powerful tools for describing the relationship between two variables. When you use these tools, be aware of their limitations + Correlation and Regression Wisdom 3. Correlation and least-squares regression lines are not resistant. Definition: An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals. Other outliers may not have large residuals. An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line. Least-Squares Regression 2. Correlation and regression lines describe only linear relationships. + Correlation Sampling at a School Assembly Describe how you would use the following sampling methods to select 80 students to complete a survey. (a) Simple Random Sample (b) Stratified Random Sample (c) Cluster Sample Sampling and Surveys + Example: Study versus Experiment Definition: Experiments In contrast to observational studies, experiments don’t just observe individuals or ask them questions. They actively impose some treatment in order to measure the response. + Observational An observational study observes individuals and measures variables of interest but does not impose treatment on the individuals. An experiment deliberately imposes some treatment on individuals to measure their responses. When our goal is to understand cause and effect, experiments are the only source of fully convincing data. The distinction between observational study and experiment is one of the most important in statistics. + Completely Randomized Comparative Experiment Experimental Units Random Assignment Group 1 Treatment 1 Compare Response Group 2 Random Assignment illustrates the principle of RANDOMIZATION. Choosing an adequately large sample ensures REPLICATION. Treatment 2: Placebo Having a control group illustrates the principle of CONTROL. for Experiments In an experiment, researchers usually hope to see a difference in the responses so large that it is unlikely to happen just because of chance variation. We can use the laws of probability, which describe chance behavior, to learn whether the treatment effects are larger than we would expect to see if only chance were operating. If they are, we call them statistically significant. Definition: An observed effect so large that it would rarely occur by chance is called statistically significant. A statistically significant association in data from a well-designed experiment does imply causation. Experiments + Inference + Rules of Probability Probability Rules Basic • For any event A, 0 ≤ P(A) ≤ 1. • If S is the sample space in a probability model, P(S) = 1. • In the case of equally likely outcomes, number of outcomes corresponding to event P(A) total number of outcomes in sample space A • Complement rule: P(AC) = 1 – P(A) • Addition rule for mutually exclusive events: If A and B are mutually exclusive, P(A or B) = P(A) + P(B). Diagrams and Probability Define events A: is male and B: has pierced ears. Probability Rules Recall the example on gender and pierced ears. We can use a Venn diagram to display the information and determine probabilities. + Venn Probability and Independence Definition: Two events A and B are independent if the occurrence of one event has no effect on the chance that the other event will happen. In other words, events A and B are independent if P(A | B) = P(A) and P(B | A) = P(B). Example: Are the events “male” and “left-handed” independent? Justify your answer. P(left-handed | male) = 3/23 = 0.13 P(left-handed) = 7/50 = 0.14 These probabilities are not equal, therefore the events “male” and “left-handed” are not independent. Conditional Probability and Independence When knowledge that one event has happened does not change the likelihood that another event will happen, we say the two events are independent. + Conditional Conditional Probabilities General Multiplication Rule P(A ∩ B) = P(A) • P(B | A) Conditional Probability Formula To find the conditional probability P(B | A), use the formula = Conditional Probability and Independence If we rearrange the terms in the general multiplication rule, we can get a formula for the conditional probability P(B | A). + Calculating Who Reads the Newspaper? What is the probability that a randomly selected resident who reads USA Today also reads the New York Times? P(A B) P(B | A) P(A) P(A B) 0.05 P(A) 0.40 0.05 P(B | A) 0.125 0.40 There is a 12.5% chance that a randomly selected resident who reads USA Today also reads the New York Times. Conditional Probability and Independence In Section 5.2, we noted that residents of a large apartment complex can be classified based on the events A: reads USA Today and B: reads the New York Times. The Venn Diagram below describes the residents. + Example: + Discrete and Continuous Random Variables There are two types of random variables: discrete and continuous. Discrete random variables have a finite (countable) number of possible values. The table of possible values of x and the associated probabilities is called a PROBABILITY DISTRIBUTION. They usually arise out of COUNTING something. We use a histogram to graph a discrete random variable. Continuous random variables take on all values in an interval of numbers. So, there are an infinite number of possible values. They usually arise out of MEASURING something. The graph of a continuous RV is a density curve. Young Women’s Heights + Example: Read the example on page 351. Define Y as the height of a randomly chosen young woman. Y is a continuous random variable whose probability distribution is N(64, 2.7). What is the probability that a randomly chosen young woman has height between 68 and 70 inches? P(68 ≤ Y ≤ 70) = ??? 68 64 2.7 1.48 z 70 64 2.7 2.22 z P(1.48 ≤ Z ≤ 2.22) = P(Z ≤ 2.22) – P(Z ≤ 1.48) = 0.9868 – 0.9306 = 0.0562 There is about a 5.6% chance that a randomly chosen young woman has a height between 68 and 70 inches. Transformations Effect on a Random Variable of Multiplying (Dividing) by a Constant Multiplying (or dividing) each value of a random variable by a number b: • Multiplies (divides) measures of center and location (mean, median, quartiles, percentiles) by b. • Multiplies (divides) measures of spread (range, IQR, standard deviation) by |b|. • Does not change the shape of the distribution. Note: Multiplying a random variable by a constant b multiplies the variance by b2. Transforming and Combining Random Variables How does multiplying or dividing by a constant affect a random variable? + Linear Transformations Effect on a Random Variable of Adding (or Subtracting) a Constant Adding the same number a (which could be negative) to each value of a random variable: • Adds a to measures of center and location (mean, median, quartiles, percentiles). • Does not change measures of spread (range, IQR, standard deviation). • Does not change the shape of the distribution. Transforming and Combining Random Variables How does adding or subtracting a constant affect a random variable? + Linear Random Variables Mean of the Difference of Random Variables For any two random variables X and Y, if D = X - Y, then the expected value of D is E(D) = µD = µX - µY In general, the mean of the difference of several random variables is the difference of their means. The order of subtraction is important! Variance of the Difference of Random Variables For any two independent random variables X and Y, if D = X - Y, then the variance of D is D2 X2 Y2 In general, the variance of the difference of two independent random variables is the sum of their variances. Transforming and Combining Random Variables We can perform a similar investigation to determine what happens when we define a random variable as the difference of two random variables. In summary, we find the following: + Combining Today Now, we’ll learn about the basis for the binomial calculations: the binomial formula. n k P( X k ) p (1 p )n k k Notice the = mark. This is a combinatorial. It is read “n choose k.” p stands for the probability of success. n represents the number of observations. k is the value of x of which you’re asked to find the probability. Comparison of Binomial to Geometric Binomial Geometric Each observation has two outcomes (success or failure). Each observation has two outcomes (success or failure). The probability of success is the same for each observation. The probability of success is the same for each observation. The observations are all independent. The observations are all independent. There are a fixed number of trials. There is a fixed number of successes (1). So, the random variable is how many successes you get in n trials. So, the random variable is how many trials it takes to get one success. Mean & Variance of Geometric RV The formula for the mean of a geometric RV is 1 X p The formula for the variance of a geometric RV is (1 p) p2 2 X These formulas are NOT given to you on the exam. We’ve studied two large categories of RVs: discrete and continuous Among the discrete RVs, we’ve studied the binomial and geometric The graph of a binomial RV can be skewed left, symmetric, or skewed right, depending on the value of p. The graph of a geometric RV is ALWAYS skewed right. Always. Other discrete RVs can be given to you in the form of a table. Among the continuous, we’ve studied the normal RVs. To find probabilities of a normal RV, convert to a Z score and use Table A. Definition: A parameter is a number that describes some characteristic of the population. In statistical practice, the value of a parameter is usually not known because we cannot examine the entire population. A statistic is a number that describes some characteristic of a sample. The value of a statistic can be computed directly from the sample data. We often use a statistic to estimate an unknown parameter. Remember s and p: statistics come from samples and parameters come from populations We write µ (the Greek letter mu) for the population mean and x (" x bar") for the sample mean. We use p to represent a population proportion. The sample proportion pˆ ("p - hat" ) is used to estimate the unknown parameter p. What Is a Sampling Distribution? As we begin to use sample data to draw conclusions about a wider population, we must be clear about whether a number describes a sample or a population. + Parameters and Statistics + PCFS Sample Proportions p = the proportion of _____ who … Sample Means Parameter µ = the mean … Conditions SRS Random SRS np≥10 and n(1-p)≥10 Normality Population ≥ 10n Independence Population ~ Normally OR “since n> 30, the CLT says the sampling distribution of x-bar is Normal.” Population ≥ 10n x mean z std . dev Formula z Sentence x mean std . dev + Confidence Intervals – puzzles for math nerds statistic critical value standard deviation of statistic Parameter Mean Proportion Statistic p p x Standard Deviation of the Statistic n p(1 p) n Standard Error of the Statistic s n p(1 p) n + Characteristics of the t-distributions They are similar to the normal distribution. They are symmetric, bell-shaped, and are centered around 0. The t-distributions have more spread than a normal distribution. They have more area in the tails and less in the center than the normal distribution. That’s because using s to estimate σ introduces more variation. As the degrees of freedom increase, the tdistribution more closely resembles the normal curve. As n increases, s becomes a better estimator of σ. P-Values The null hypothesis H0 states the claim that we are seeking evidence against. The probability that measures the strength of the evidence against a null hypothesis is called a P-value. Definition: The probability, computed assuming H0 is true, that the statistic would take a value as extreme as or more extreme than the one actually observed is called the P-value of the test. The smaller the P-value, the stronger the evidence against H0 provided by the data. Small P-values are evidence against H0 because they say that the observed result is unlikely to occur when H0 is true. Large P-values fail to give convincing evidence against H0 because they say that the observed result is likely to occur by chance when H0 is true. + Interpreting + Types of Errors Truth about the population Conclusion based on sample H0 true H0 false (Ha true) Reject H0 Type I error Correct conclusion Fail to reject H0 Correct conclusion Type II error Type I Error: Reject the H0 when the H0 is true. Type II Error: Fail to reject the H0 when the H0 is false. + Recall: PCFS for a One Prop Z Interval Steps One Proportion Z Interval P p = proportion of _________ who _____________ C Random: SRS Normality:n p 10 and n(1 p) 10 Independence: pop ≥ 10n F One Prop Z Interval (_____, ______) S We are ______% confident that the interval captures the true proportion of __________ who _________. + PCFS for a One Prop Z Test How many differences can you spot? Steps One Proportion Z Test P p = proportion of _________ who _____________ H0: p = ______ Ha: p ______ C Random: SRS Normality: np0 10 and n(1 p0 ) 10 Independence: pop ≥ 10n F One Prop Z Test Z = ________ p = ________ S Since p α, we reject/fail to reject H0. We conclude/cannot conclude that __________________________. + Cautions When conducting a matched-pairs t-test, you have to be careful. The parameter you’re studying is μD – the mean DIFFERENCE. You also have to write which order you subtracted. The boxplot you construct is the distribution of DIFFERENCES. The name of the test (what you write in the formula section) is specifically a MATCHED-PAIRS T-TEST. Your conclusion should state something about the mean DIFFERENCE in ___... Two Proportion Z Interval + Two Proportion Z Interval Parameters p1 = the proportion of ____ who _____ p2 = the proportion of ____ who _____ Notice nowindependent there are two of pretty Conditionsthat • Two random samples (or random assignment to treatment groups) much everything. Two parameters: pn1 pand p 10; n2 1 p 10 Two sample sizes: n and n • Normality: n1 p1 10; n1 1 p1 10 2 2 2 2 1 2 • Independence: Population 1 ≥ 10n1 and populations Population 2 ≥ 10n2 Two Two statistics: p-hat p-hat 1 and 2 Formula Two Prop Z Interval (______, ______) And the sentence talks about the Sentence We ___________% confident that the difference inare proportions. interval (____, _____) captures the true difference in proportion of _______ and ________ who ______________. + 42 Why this gets difficult Hypothesis tests or confidence intervals in isolation aren’t too hard. For example, if you know in your homework you’re only constructing Two Prop Z Intervals, that shouldn’t be too hard. It’s often very hard to tell the difference between matched pairs data and two sample data On each of the following slides, determine if the data are single sample, matched pairs, or two sample settings. Footer Text 5/3/2017 + Example 1 + 44 Example 2 Footer Text 5/3/2017 + 45 Example 3 Footer Text 5/3/2017 + Example 4 + 47 Example 5 Footer Text 5/3/2017 Inference Summary Means Proportions (Hypothesis Test and Confidence Intervals) (Hypothesis Test and Confidence Intervals) One-sample t procedures Matched pairs t procedures Two-sample t procedures One Proportion Z Procedures Two Proportion Z Procedures χ2 tests Comparison Chart Goodness of Fit Equal Proportions Association Hypotheses H0: The sample distribution matches the hypothesized distribution. Ha: The sample distribution does not match the hypothesized distribution. H0: p1 = p2 = p3 = … Ha: Not all of the proportions are equal. H0: There is no relationship between the two variables. Ha: There is some relationship between the two variables. Conditions SRS All of the expected counts are at least 5. Independent SRSs or random assignment All of the expected counts are at least 5. SRS All of the expected counts are at least 5. Test Enter observed counts in L1. Enter expected counts in L2 (you may have to calculate these. Use χ2 GOF command. df = # of categories minus 1. Enter observed counts in [A]. Use the χ2 Test command. *LOOK AT THE EXPECTED COUNTS IN [B] TO CHECK CONDITIONS! Enter observed counts in [A]. Use the χ2 Test command. *LOOK AT THE EXPECTED COUNTS IN [B] TO CHECK CONDITIONS! So, the difference between the equal proportions test and the association test is ARE WE COMPARING SEVERAL POPULATIONS or DID THE DATA ARISE BY CLASSIFYING OBSERVATIONS INTO CATEGORIES?