Download Chapter 1 * Exploring Data Fact Sheet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
AP Statistics Midterm Overview/Review
Fall 2015
The test is 50 multiple choice questions.
You will be given a formula sheet and tables and you may use your calculator. There will be no sharing of calcs
Break-down of number of multiple choice questions by chapter:
Chapter 1: 8 questions
Chapter 2: 5 questions
Chapter 3: 7 questions
Chapter 4: 8 questions
Chapter 5: 8 questions
Chapter 6: 8 questions
Chapter 7: 6 questions
Facts about Chapter 1
1. Individuals or objects described in the data. A variable is a characteristic of data. There are different types
of variables:
a. Categorical Variable – records which of several groups or categories an individual belongs, qualitative
variable
b. Quantitative Variable – numerical values for which it makes sense to do arithmetic operations
2. Presentation of Data:
a. Bar Chart – compares the sizes of the groups or categories by count or by proportion
b. Pie Chart – Compares parts to the whole
c. Dotplots – Compares the range of the data and its variables
d. Histogram – graphing one quantitative variable in groups
e. Stemplot – organizes and groups data, but allows us to see as many of the digits in each data value as
we wish
f. Time Plot- Plots each observation against the time at which it occurred (reveals trends)
g. Ogive – relative cumulative frequency graph, percentiles
h. Boxplot – a graph of the five number summary
3. When giving a numerical summary of a set of data the center and spread should be given. Distribution can
be described by:
a. Symmetric – right and left sides of the histogram are approximately mirror images of each other
b. Skewed right – the right side of the distribution extends much farther out than the left
c. Skewed left – the left side of the distribution extends much farther out than the right
4. The mean and the median describe the center of a distribution in different ways. The mean is the
arithmetic average of the observations, and the median is the midpoint of the values. The mean and
median are close in symmetric distributions. In skewed distribution the mean is out towards the tail.
5. The mean and standard deviation are strongly influenced by outliers or skewness in a distribution. They
serve as good descriptions for symmetric distributions and are most useful for the normal distributions.
The median and quartiles are not affected by outliers. The five-number summary is the preferred
numerical summary for skewed distributions.
6. When you use the median to indicate the center of a distribution, its spread is given by quartiles. The first
quartile Q1 has one-fourth of the observations below it, and the third quartile Q3 has three-fourths of the
observations above it. An extreme observation is an outlier if it is smaller that Q1 – (1.5 x IQR) or larger
the Q3 + (1.5 x IQR).
1
7. The five-number summary is (Min, Q1, Median, Q3, Max). Boxplots based on the five-number summary
and are useful for comparing two or more distributions. Outliers may be plotted as isolated points.
8. The variance, s2, and the standard deviation s, are common measures of spread about the mean. The
standard deviation s is zero when there is no spread and it gets larger as the spread increases.
9. When you add a constant, a, to all the values in a data set, the mean and median increase by a. Measures
of spread (s and IQR) do not change. When you multiply all the values in a data set by a constant, b, the
mean, median, IQR, and standard deviation are multiplied by b. These linear transformations are useful
for changing units of measurement.
10. Two way tables organize data about two categorical variables. The two variables are: row variables and
column variables. The row totals and the column totals give the marginal distribution. The sum of the
rows and the sum of the columns have the same total.
11. To find the conditional distribution of the row variable for one specific value of the column variable, look
only at that one column in the table. Find each entry in the column as a percent of the column total.
Facts about Chapter 2
1. A density curve is a mathematical model where the area under the curve is 1.
2. A normal density curve has both the mean and median in the center with equal proportion to either side.
3. The median of the density curve is the equal areas point, the point with half the area under the curve to its
left and the remaining half of the area to its right.
4. The mean of the density curve is the balance point, at which the curve would balance if made of solid
material.
5. The mean is always pulled toward the tail for any skewed curve.
6. Normal distributions (the nice bell shaped curve). The standard deviation tells us about the spread of the
curve. A small standard deviation and you have a tall, skinny curve. A large standard deviation gives you a
wide, flat curve.
7. The Empirical Rule: 68% of the observations fall within 1 SD of the mean, 95% of the observations fall
within 2 SDs of the mean, and 99.7% of the observations fall within 3 SDs of the mean.
8. We abbreviate a normal distribution with mean, μ, and standard deviation, σ, as N(μ, σ).
9. Standard normal distribution is the normal distribution N(0, 1).
10. A variable can be standardized by using the z-score: z = (x-μ)/σ. Remember the tables give you the
proportion to the left of the given z-score!
11. Normal probability plot provides a good assessment of the adequacy of the normal model for the data.
Facts about Chapter 3
1. A response variable measures an outcome of a study, also called the dependent variable.
2. An explanatory variable attempts to explain the observed outcome, also called the independent variable.
3. A scatter plot shows the relationship between the two quantitative variables measured on the same
individuals.
4. Positive association means an increase in one variable leads to an increase in the other.
5. Negative association means a decrease in one variable leads to a decrease in the other.
6. The strength of a relationship in a scatter plot is determined by how closely the points follow a clear line.
7. Form – The general pattern a relationship follows. Forms include linear, curved, and clustered.
8. The least-squares regression line is the straight y=a+bx that minimizes the sum of the distances of the
observed points from the line.
9. Correlation r is the slope of the least-squares regression line when we measure both x and y in
standardized units.
10. Influential observations are individual points that substantially change the regression line - they are often
outliers in the x direction.
2
11. Correlation and regression need to be interpreted with caution. The correlation and regression can only
be described if the data has a linear relation.
12. Extrapolation needs to be avoided. This is the use of a regression line or curve for prediction for values of
the explanatory variable outside the domain of the data from which the line was calculated. *Remember:
correlations based on averages are usually too high when applied to individuals*
Facts about Chapter 4
1. Lurking variables are hidden variable which could explain the relationship between the explanatory and
response variables, in other words, they are unidentified variables other than x and y that influence the
relationship. *Note: correlation and regression could be misleading if you do not take these lurking
variables into consideration. Also, high correlation does not imply causation.*
2. The basic principles of statistical design of experiments are control, randomization, replication and
compare.
3. Simple Random Sample (SRS) of size n consists of n individuals from the population chosen in such a way
that every set of n individuals has an equal chance to be the sample actually selected.
4. A stratified random sample is a sample that is divided into similar groups called strata and then a SRS is
performed separately on each strata.
5. The design of a study is biased if it systematically favors certain outcomes. SRS’s are designed to eliminate
biased due to randomization.
6. Voluntary Response Samples and Convenience Samples usually contain bias due to their design and
sampling technique.
7. An observational study observes individuals and measures variables of interest but does not attempt to
influence the responses, where as an experiment deliberately imposes some treatment on individuals to
observe their responses.
8. Sampling is studying a part to gain information about the whole where as a census attempts to contact
every individual in the entire population.
9. Undercoverage, non-response, and response bias are detrimental to the reliability of a sample. They
create strong bias because undercoverage and non-response eliminate an important cluster of people who
are alike and don’t choose to respond and response bias can draw certain answers out of the individuals.
10. In an experiment, each unit is given a treatment and then their results are observed.
11. The Placebo Effect is very important to experiments because it eliminates bias from expectations in a cure
or specific results
1.
2.
3.
4.
5.
6.
7.
Facts about Chapter 5
The probability of any outcome of a random phenomenon is the proportion of times the outcome would
occur in a very long series of repetitions. That is, Probability is long-term relative frequency.
The probability P(A) of any event A satisfies O< P(A) < 1.
The sample space S of a random phenomenon is the set of all possible outcomes. If S is the sample space
in a probability model, then P(S)= 1.
The complement of any event A is the event that A does not occur, written as Ac. The complement rule
states that P(Ac)= 1-P(A).
Two events A and B are disjoint (also called mutually exclusive) is they have no outcomes in common and
so it can never occur simultaneously. If A and B are disjoint, P(A or B) = P(A) + P(B). This is the addition rule
for disjoint events.
Two events A and B are independent if knowing that one occurs does not change the probability that the
other occurs. If A and B are independent,
P(A and B)=P(A)P(B). This is the multiplication rule for disjoint events.
When P(A) >0, the conditional probability of B given A is
3
P(A/B)= P(A and B)/P(A).
8. For any two events A and B, P(A or B) = P(A) + P(B) – P(A and B)
9. Equivalently, P(A U B) = P(A) + P(B) – P(A ∩ B). ∩ is the symbol for an intersection of any collection of
events is the event that all of the events occur. U is the symbol for a union of any collection of events is the
event that at least one of the collections occurs.
10. The probability that both of two events A and B happen together can be found by P(A and B)= P(A)P(A‫׀‬B).
Here P(B‫׀‬A) is the conditional probability that B occurs given the information that A occurs.
11. If events A, B, and C are disjoint in the sense that no two have any outcomes in common, then P(one or
more of A, B, C)= P(A)+ P(B) + P(C). This rule extends to any number of disjoint events.
Facts about Chapter 6
1. Random Variable: A variable whose value is a numerical outcome of a random phenomenon.
2. Discrete Random Variable: "x" has a countable number of possible values.
3. Continuous Random Variable: "X" takes all the value in an interval of numbers.
4. The mean for a discrete random variable is expressed by the sum of each possible value multiplied by their
individual probabilities= xipi
5. The variance, or the measure of the spread of the distribution is shown as the sum of the averages of the
standard deviations.  (xi-x)pi
6. The standard deviation for the discrete random variable is the square root of the variance. xip
7. The Law of Large Numbers states that as the number of observations drawn increases the mean of the
observed values eventually approaches the mean of the population and will stay approximately close to
the stated mean
*do not confuse this with the false law of large numbers which states that we expect short sequences of
random events to show the kind of average behavior that in fact appears in the long run.
8. Rules for Means (not necessarily have to be independent variables)
1.If x is a random variable and a and b are fixed numbers then, =a+bx
2. If x and y are random variables then x+y=x+y
9. Rules for Variances
If x is a random variable and a and b are fixed numbers then 2 a+bx=(bX)2
If x and y are independent variables then,
2 x+y=2 x +2 y
2 x-y=2 x +2 y
10. Combining Random Variables- If x and y are normal independent random variables and a and b are fixed
numbers then ax+by is also normally distributed, or the sum or difference of independent random
variables has a normal distribution.
12. Binomial random variables require a fixed number of trials.
13. Geometric random variables have the property that the number of trials varies
14. Binomial Setting:
a. Observations are either success or failure
b. Fixed number of observations “n”
c. Observations are independent
d. The probability of success, p, is the same for each observation
15. Geometric Setting
a. Observations are either success or failure
b. Probability for success is defined as some variable p
c. Observations are independent
d. The variable of interest is the number of trials required to obtain first success
4
16. In binomial distributions as the number of trials, n, gets larger, the binomial distribution gets close to a
normal distribution, and we can use normal probability calculations to approximate hard to calculate
binomial probabilities.
17. Mean
Binomial Distribution- μ= np
(n= sample size p= probability)
Geometric Distribution- μ= 1/p (p= probability)
18. Standard Deviation
Binomial Distribution- σ = √(npq)
(n= sample size p= probability of success q= probability of failure)
Geometric Distribution- don’t need to know!
19. Geometric Distribution
P(x=n)= p(1-p)n-1
P(x>n)= (1-p)n
20. Binomial Distribution
P(x=k)= (nCr )pk(1-p)n-k
1.
2.
3.
4.
5.
Facts about Chapter 7
Parameter= population, Statistic= Sample
A statistic is unbiased if the mean of its sampling distribution is equal to the true value of the parameter.
Bias refers to the center!
The variability of a statistic describes it’s spread. Larger samples give smaller spreads.
The population distribution of a variable describes the values of the variable for all individuals in a
population. The sampling distribution of a statistic describes the values of the statistic in all possible
samples of the same size from the same population. Don’t confuse the sampling distribution with a
distribution of sample data, which gives the values of the variable for all individuals in a particular
sample.
For sampling distributions of p-hat:
a. The mean is p-hat which is an unbiased estimator of p
b. As long as 10n < N (10% condition), then you can estimate 𝜎𝑝̂ = √
𝑝(1−𝑝)
𝑛
c. If np>10 and nq>10 (large counts condition) then we can assume the sampling distribution is
approx Normal
6. For sampling distributions of x-bar:
a. The mean is x-bar which is an unbiased estimator of µ
𝜎
b. As long as 10n < N (10% condition), then you can estimate 𝜎𝑥̅ = 𝑛
√
c. If you are told the population distribution is approx. normal then your sample will be approx.
normal. No matter what your pop distribution looks like, as long as the sample is big enough
(n>30) then the Central Limit Theorem (CLT) says that your sampling dist will be approx.
Normal.
5