Download Midterm Review

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Review of Chapter One
This chapter presented some important basics. There were fundamental definitions, such as sample and population, along with some very
basic principles.
Data: are observations (such as measurements, genders, survey responses) that has been collected.
Statistics: is the science of collecting, analyzing, and drawing conclusions from sample data.
A population is the complete collection of all elements (scores, people, measurements, and so on) to be studied.
A census is the collection of data from every member of the population.
A sample is a subcollection of members selected from a population.
A parameter is a numerical measurement describing some characteristic of a population.
A statistic is a numerical measurement describing some characteristic of a sample.
Two Types of Data:
• Quantitative data: consists of numbers representing counts or
measurements. e.g. the heights of students.
1
• Qualitative (or categorical) data: can be separated into different categories that are distinguished by some nonnumerical characteristic. e.g. the genders (male/female) of students.
Two Types of Quantitative Data:
• Discrete Data: results when the number of possible values is either a finite number or a countable number.
• Continuous data result from infinitely many possible values that
correspond to some continuous scale that covers a range of values
without gaps, interruptions, or jumps.
Four Levels of Measurement of Data(see Table 1-1 on Page 10 for
more details):
• Nominal: Categories only. Data can not be arranged in an ordering scheme.
• Ordinal: Categories are ordered, but differences can’t be found
or are meaningless.
• Interval: Differences are meaningful, but there is no natural starting point and ratios are meaningless.
• Ratio: There is a natural zero starting point and ratios are meaningful.
2
Review of Chapter Two
In this chapter we considered methods for describing, exploring, and
comparing data sets.
We have the following important characteristics of data:
• Center: A representative or average value that indicates where
the middle of the data set is located.
• Variation: A measure of the amount that the data values vary
among themselves.
• Distribution: The nature or shape of the distribution of the data.
• Outliers: Sample values that lie very far away from the vast majority of the other sample values.
• Time: Changing characteristics of the data over time.
Frequency distribution: lists data values (either individually or by
groups of intervals), along with their corresponding frequencies (or
counts).
Relative frequency distribution: includes the same class limits as
a frequency distribution, but relative frequencies are used instead of
actual frequencies.
3
Cumulative frequency distribution: the cumulative frequency for a
class is the sum of the frequencies for that class and all previous classes.
Histogram: a bar graph in which the horizontal scale represents classes
of data values and the vertical scale represents frequencies. The heights
of the bars correspond to the frequency values, and the bars are drawn
adjacent to each other (without gaps).
Dotplot: a graph in which each data value is plotted as a point (or
dot) along a scale of values. Dots representing equal values are stacked.
Stem-and-Leaf Plot: represents data by separating each value into
two parts: the stem (such as the leftmost digit) and the leaf (such as
the rightmost digit).
Pie Chart: a graph of a frequency distribution for a categorical data
set. Each category is represented by a slice of the pie and the area
of the slice is proportional to the corresponding frequency or relative
frequency.
Measures of Center
P
Sample mean: x̄ = nx , which is used to measure the center of a
P
sample with sample size n. Here
x represents the sum of all data
values.
P
Population mean: µ = Nx , which is the average value of the entire population (with size N ).
Median: the middle value in the ordered list of sample data. If n
is even, the median is the average of the two middle values.
4
Mean from a frequency distribution: x̄ =
notes frequency and x is the class midpoint.
P
(f ·x)
P
f ,
where f de-
Skewness: a distribution of data is skewed if it is not symmetric
and extends more to one side than the other. A distribution of data is
symmetric if the left half of its histogram is roughly a mirror image
of its right half.
Measures of Variation
Range: the difference between the highest value and the lowest value
of a set of data.
2
Sample variance: s =
tion of sample data.
P
(x−x̄)2
n−1 ,
which is used to measure the varia-
Sample standard deviation:
s=
√
rP
(x − x̄)2
s2 =
n−1
Population variance: denoted by σ 2 , population standard deviation: denoted by σ, which are used to measure variation of the entire
population.
Formula to calculate standard deviation from a frequency distribution:
s P
P
n[ (f · x2 )] − [ (f · x)]2
,
s=
n(n − 1)
where n is the sample size (or the total of frequencies), x is the class
midpoint, and f is the class frequency.
5
Range Rule of Thumb:
• minimum “usual” value =(mean)- 2 × (standard deviation).
• maximum “usual” value=(mean) + 2 × (standard deviation).
Empirical Rule for Data with a Bell-shaped Distribution:
• About 68% of all values fall within 1 standard deviation of the
mean.
• About 95% of all values fall within 2 standard deviations of the
mean.
• About 99.7% of all values fall within 3 standard deviations of the
mean.
Chebyshev’s Theorem:
The proportion of any set of data lying within K standard deviations of
the mean is always at least 1 − 1/K 2 , where K is any positive number
greater than 1. For K = 2 and K = 3, we get the following statements:
• At least 3/4 (or 75%) of all values lie within 2 standard deviations
of the mean.
• At least 8/9 (or 89%) of all values lie within 3 standard deviations
of the mean.
6
Measures of Relative Standing:
z score: z = (x − x̄)/s (sample), and z = (x − µ)/σ (population).
It is positive (negative) if the data value lies above (below) the mean.
If z score is given, we can find the corresponding x value: x = x̄ + z · s.
Quartiles and Percentiles:
• Q1 (First quartile): separates the bottom 25% of the sorted values from the top 75%.
• Q2 (Second quartile): separates the bottom 50% of the sorted
values from the top 50%. The same as the median.
• Q3 (Third quartile): separates the bottom 75% of the sorted
values from the top 25%.
•
percentile of value x =
number of values less than x
· 100
total number of values
• See Figure 2-15 for converting from the kth percentile to the corresponding data value.
7
Review of Chapter Three
In this chapter, we discussed the basic concepts related to probability
and developed some basic skills to calculate probabilities in a variety
of important circumstances.
Event: any collection of results or outcomes of a procedure.
Simple event: an outcome or an event that cannot be further broken
down into simpler components.
Sample space: the collection of all possible simple events.
Probability: a number between 0 and 1 that reflects the likelihood
of occurrence of some event. P denotes a probability and P (A) denotes the probability of event A occurring.
Some Basic Properties of Probability:
• The probability of an impossible event is 0.
• The probability of an event that is certain to occur is 1.
• 0 ≤ P (A) ≤ 1 for any event A.
Some Important Formulas:
• P (A or B) = P (A) + P (B) − P (A and B).
8
• If A and B are mutually exclusive, then P (A or B) = P (A)+P (B).
• P (Ā) = 1 − P (A), where Ā is the complement of event A.
P (at least one) = 1 − P (none).
• If all the n possible simple events of a procedure are equally likely
to occur, then
P (A) =
number of simple events in A
.
n
Conditional Probability: Let A and B be two events of a procedure.
The conditional probability of B given that A has occurred is
P (B|A) =
P (A and B)
.
P (A)
Multiplication Rule: P (A and B) = P (A) · P (B|A).
Independence:
• Two events A and B are independent if P (A and B) = P (A)P (B).
If P (A) 6= 0, then this definition is equivalent to P (B|A) = P (B),
which means that the probability of event B does not depend on
whether A has occurred or not.
• If A1 , A2 , · · · , Ak are independent, then
P (A1 and A2 and · · · and Ak ) = P (A1 )P (A2 ) · · · P (Ak ).
9
Review of Chapter Four
In this chapter we introduced some important concepts like random
variable and probability distribution. Two important discrete probability distributions, binomial distribution and Poisson distribution, were
discussed.
• A random variable has values that are determined by chance.
• A probability distribution consists of all values of a random
variable, along with their corresponding probabilities. A probability distribution must satisfy two requirements:
X
P (x) = 1, and, for each value of x, 0 ≤ P (x) ≤ 1.
• Important characteristics of a probability distribution can be
explored by constructing a probability histogram and by computing its mean and standard deviation using these formulas:
X
µ=
[x · P (x)]
qX
σ=
[x2 · P (x)] − µ2 .
• In a binomial distribution, there are two categories of outcomes
and a fixed number of independent trials with a constant probability. The probability of x successes among n trials can be found by
using the binomial probability formula, or Table A-1, or software.
Binomial probability formula:
P (x) =
n!
· px · q n−x , x = 0, 1, 2, . . . , n
(n − x)!x!
10
where n =number of trials, x = number of successes among n
trials, p = probability of success in any trial, and q = 1 − p.
• In a binomial distribution, the mean and standard deviation can
√
be easily found by calculating the values of µ = np and σ = npq.
• A Poisson probability distribution applies to occurrences of
some event over a specific interval, and its probabilities can be
calculated with
P (x) =
µx · e−µ
, x = 0, 1, 2, . . .
x!
where µ is the mean of x. The standard deviation is σ =
11
√
µ.
Review of Chapter Five
In this chapter, we introduced continuous probability distributions and
focused on the most important category: normal distributions.
Continuous probability distribution: is described by a smooth
density curve. Areas under this curve are interpreted as probabilities.
A continuous random variable has a uniform distribution if its values
spread evenly over the range of possibilities. The graph of a uniform
distribution results in a rectangular shape. Its density function is given
by
1
, if a ≤ x ≤ b,
f (x) = b−a
0,
otherwise
Normal distribution: a continuous probability distribution that is
specified by a particular type of bell-shaped and symmetric density
curve. Its density function is given by
2
1
− 21 ( x−µ
)
σ
e
f (x) = √
,
2πσ
where µ is the mean and σ is the standard deviation.
Standard normal distribution: the normal distribution with µ = 0
and σ = 1. One can use Table A-2 to find probabilities for given z
scores.
Some useful formulas:
• P (a < z < b) = P (z < b) − P (z < a).
12
• P (z > a) = 1 − P (z < a).
• P (z > a) = P (z < −a) (Symmetry)
Finding a z score from a known area (or probability): Using
the cumulative area from the left, locate the closest probability in the
body of Table A-2 and identify the corresponding z score.
Standardized procedure:
If x has a normal distribution with mean µ and standard deviation
σ, then z = (x − µ)/σ has a standard normal distribution. We have
P (x < a) = P (z < (a − µ)/σ).
Given the value range of x, find probability related to x:
(1) State the problem.
(2) Standardize z = (x − µ)/σ and find the value range for z.
(3) Use Table A-2 to find the desired probability.
Given the probability, find the value of x:
(1) State the problem.
(2) Use Table A-2 to find the z score from given probability.
(3) Convert back to x value by x = µ + (z · σ).
13