Download Chapter 1 Descriptive statistics—methods of summarizing data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Chapter 1
 Descriptive statistics—methods of summarizing data
 Inferential statistics—generalizing from sample data
NO FORMULAS
Chapter 2
 Variable
 Data
o Univariate
o Bivariate
o Multivariate
o Categorical(qualitative)
o Numerical(quantitative)
 Placebo
 Observation
 Experiment—planned intervention
 Treatment—the intervention in an experiment
 Control group—the one that does not get the treatment
 Blind study—the individuals do not know if they are getting the treatment or a placebo
 Double blind—both the individual and the person measuring the results
 Confounding variable—one whose impact cannot be distinguished from another
 Extraneous factors—one not currently of interest but thought to impact outcome
 Census—entire population
 Bias
o Selection—part of the population systematically excluded
o Mearsurement/response bias—when wording tends to influence
o
Nonresponse—intended individuals do not respond

Sampling
o
Simple random sample—all items have an equal chance of inclusion
o
Stratified sample—population predivided (economic factors, salary, distance from a river,
o
Age, etc) choose SRS from each
o
Blocking—creating groups that are similar to reduce the impact of other variables or study
o
them further
o
Cluster—choosing an SRS from areas (zip codes: in a big city some from chinatown, little
o
Italy etc.)

Randomization—methods for assigning to treatment “equally”
NO FORMULAS
Chapter 3




o


Frequency distribution—a table that displays categories and number of responses
Relative frequency—proportion indicating percent in a category
Bar chart—categorical data bars do don’t touch
Pie chart—used when # of categories is small enough to display easily
Percent *360=degrees in wedge
Stem and Leaf—used for small data sets—retains individual pieces of data
Dot plot—shows patterns—loses individual values in the data

Histogram—numeric data , bars do touch
o
Class width—pieces of data in a group—if not given use
to determine
o
number of classes for the data

Skewed—left(negative)/right(positive) indicates which tail is longer

Cumulative—add next group to the previous

Density—relative freq./class width

Modes
o
Unimodal, Bimodal, Multimodal

Heavy tail –much data in the tails—slope from peak shallower than that of a normal curve

Light tail—not much data in tails(possible outliers)—rapid decrease to long tails or just

more rapid decrease if no outliers (pg 77)

Sampling variability—the extent to which the samples differ from the population
NO MAJOR FORMULAS other than box and whisker, IQR, outliers
Chapter 4


o








o
o
o

o
o
o

Mean—average or balance point on a scale—equal areas to both sides—impacted by
outliers
Trimmed mean—order the set then remove the trim percent from each end
Median—middle piece of data—not as impacted by outliers
Mode—most frequent data
Range—difference between high and low items
Deviations—difference of each piece from the mean
Standard deviation—the average of the deviations
Variance—standard deviation squared
Box and Whisker—shows shape, spread and center of data—used to show relative
normality
Outlier IQR * 1.5
Extreme outlier IAR * 3
25% of data in each part
Empirical Rule—key word is approximately
1 sd—68%
2sd—95%
3sd—99.7%
Chebyshev’s Rule—key word is “at least”
o

Z-score—tells how many sd an items is from the mean
Critical z—tells percent of data below this point
FORMULS page one of the formula sheet and bottom of page two for z-score
o
Chapter 5



o
o
o
o
o
o
Scatterplot—plots bivariate data
Correlation coefficient (r)—range -1 to 1—determines how well the data fits a linear
regression
Does not depend on which is x and which is y, or units of measure
-1 to -.8 and .8 to 1—strong
-.8 to -.5 and .5 to .8—moderate
-.5 to .5 –weak
CORRELATION DOES NOT IMPLY CAUSATION

o



o
o
o
o
o


o

Spearmans Rank correlation
Rank the x’s, rank the y’s, keep pairs together and plot the ranks to determine correlation
Coefficient of determination—(r2)—the proportion of variation attributed to a linear
relationship
Linear regression—y = a + bx
Turn on diagnostics (2nd 0 x-1 diagnostics on enter)
Also called the least squares line
Interpolation—proper use of the line
Extrapolation—may work but not recommended without stating it is extrapolation
May be used to predict y from x but not x from y
Residual—actual value minus predicted value
(hint AP class—Actual – Predicted)
Residual plot—plots (x, residual value)
Should not have any particular pattern if the linear regression is a good fit
Standard error or deviation of least squares line
o




notice n-2 not n-1 (hint to remember there are two variables in a reg line)
Transforming non-linear data— use the chart below to help determine how to transform
the date based on which quadrant the shape is following + indicates raise the power
on that variable, - indicates to lower the power on that variable including taking the
sqrt or the log of the variable use
when transformed
+
-
+
-

R2 can be found by taking 1-SSResid/SSTo when given

^ symbol indicates predicted value
FORMULAS for a and b for the line are on page one of the formula sheet usually use the calculator
Chapter 6











o
Sample space—all the possible outcomes of an event
Event—all the outcomes of an experiment
Simple event has one outcome
Disjoint events or mutually exclusive events—have no outcomes in common
Probability—successful outcomes/all outcomes
OR—P(E∪F)=P(E)+P(F)- P(E∩F) indicates addition of probabilities-overlap of the
events
AND— P(E∩F)=P(E)P(F) indicates multiplication of probabilities
Conditional probability—p(E|F) probability of E given F reduces the number of total
items by restricting the numerator and denominator to the given condition (ie look at
one column or row in a chart and use the total of the row or column to divide by)
P(E|F)=
can be cross multiplied if convenient to solve

o
o

Chapter 7


o
o
o

o
o

Independent events –the outcome of one situation does not depend on another
P(E|F)=P(E)
Therefore independence implies P(E∩F)=P(E)P(F)
Without replacement changes the probability of the event with each selection
Random variable
Probability distribution—
for discrete variable--histogram of the probabilities of each possible outcome—sum of all
the probabilities must equal 1
for a continuous variable it becomes a smooth curve the density function
 relates back to z-scores and normal curves
measures of central tendency for probability distribution
μx describes where the probability distribution is centered ie its mean
σx is the standard deviation of the distribution

sometimes called the expected value

variance and SD is the square root of variance
Measures of central tendency for a linear regression
o
o

o
o
o
o

o
o

o
o
Chapter 8






Binomial Distribution
Has exactly two outcomes
s
n-s
n= number of trials, s= number of success desired
nCsπ (1-π)
=nπ
x
x=
Geometric random variable
Sequence of trials ends when there is a success
p(x)=(1-π)x-1π where x is the number of trials until a success
Normal Probability Plot—(normal (z) score, observation)
a normal probability plot that is linear suggests normality
if r<crit r from table on 368 then doubt is cast on linearity (probably not on AP test)
statistic—a quantity computed from a sample
sampling variability—the differences that occur in statistics based on the sample
taken
sampling distribution—the distribution that occurs when many samples are taken
o
o
o
for a sample mean
must be SRS, pop is normal or dist is approx. normal based on large sample
(n>30)
σx=σ/sqrt(n)
o
for a proportion
must be SRS, pop is normal, np and n(1-p) >5 liberal test


o
o
Use p=.5 when not given
Chapter 9





o


Chapter 10

o

o


o
o
o
o

o
o
o
o



o


o
o
Chapter 11

o
o
o
o

Confidence interval—an interval of plausible values, the actual statistic is included
in the interval whatever % the CI is. Ie a 95% CI 95 of 100 CI’s from the
same sample size will contain the true statistic (mean or proportion)
Bound the amount added and subtracted from the statistic
CI=statistic ±crit z or t (SD)
Use t when SD for population is not known
Standard error—is the estimated standard deviation
Knowing the bound you can find the necessary sample size for a specific CI
Null Hypothesis—H0 –the claim about a population that is initially assumed true
Always = to the value
Alternate Hypothesis—Ha—the competing claim
Can be < > or ≠
Hypothesis are always about a population statistic μ or π
Types of error
Type I—rejecting H0 when it was true denoted by α (level of significance)
Type II—failing to reject a false H0 denoted by β
 to find:
 Determine the critical by reverse calculating a z score
 Use the critical to against the assumed value to calculate the new z look
up the critical z compute the p-value based on the type of test this is β
The smaller the α, the larger the β
Observed significance level (p-value—derived from critical t)
Rejecting or Failing to reject
P-value < α reject
P-value > α fail to reject
Use this terminology because we statistics is about the probability not the
certainty that something happens
Z gives percentage below the given value use 1- for the upper tail
T gives percentage above the given value but only has values from 0 to 3 so a
negative has the same value below as the positive has above
df=n-1
With both be careful of the question asked
Power –probability of rejecting the null
Or
Power= 1-prob of not rejecting the null
So
Power = 1- β
Two sample means or proportion
Generally state μd or πd = sample 1 – sample 2
Ho: μd or πd =0 or the stated value
Ha: >, <, ≠ 0
Determine p-value and compare to α


Chapter 12



o
o
Other formulas for two sample tests are on page 3 of the formula packet
Paired samples one before and one after a treatment df = n-1
Chi squared testing
One Column or row array--Goodness of fit test
Two way table
Homogeneity—two separate samples are compared
Independence—one sample taken then split in two ways,