Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 2010, Business Stat
2006
Jaimie Kwon
STAT 2010, Elements of Statistics for
Business and Economics
Lecture Notes
Prof. Jaimie Kwon
Statistics Dept
Cal State East Bay
Disclaimer
These lecture notes are for internal use of Prof. Jaimie Kwon, but are
provided as a potentially helpful material for students taking the course.
A few things to note:
The lecture in class always supersedes what’s in the notes
These notes are provided “as-is” i.e. the accuracy and relevance of
the contents are not guaranteed
The contents are fluid due to constant update during the lecture
The contents may contain announcements etc. that are not relevant
to the current quarter
Students are free to report typos or make suggestions on the notes
via emailing or in person to improve the material, but they need to
understand the above nature of the notes
Do not distribute these notes outside the class
-1-
STAT 2010, Business Stat
2006
Jaimie Kwon
Best Practice for note-taking in class
I do not recommend students relying on this lecture notes in place of
actual notes he/she writes down
Bring a notepad and write down materials that I go over in the class,
using this lecture notes as the independent reference; you don’t
miss a thing by not having a printout of this lecture note in (and
outside) the class
If you still want to print these notes, it’d be better to print them 4
pages on a single page (using “pages per sheet” feature in MS
Word), preferably double sided (to save trees)
-2-
STAT 2010, Business Stat
2006
Jaimie Kwon
Some canonical examples:
Benefit of low-fat diet (Jan 2006)
# of supporters of Bush/Gore in Florida exit poll (Florida, 2000)
Is driving an SUV more dangerous than driving a passenger car?
To cash in now and retire or keep working, for GM workers (Mar
2006)?
When do I have to leave home to be at school on time (this
morning)?
Has consumer confidence in the US increased or decreased from
last to this month (March 2006)?
Where do I put this $1,000? Google stock? Coca-Cola stock? A
mutual fund? Certificate of deposit (CD)? What are expected
returns and risks? (pay day)
The number of mothers opting for cesarean birth is on the rise.
On the other hand, cesarean babies have higher risk of breathing
problem (March 30, 2006)
Arnold is back (almost). The Californian governor’s approval
rating is 47% now, a 7% increase in a single month. (March 30,
2006)
What’s the daily number of reports related to statistics? Interval
variable? Categorical?
What’s common in above examples: decision under uncertainty
-3-
STAT 2010, Business Stat
2006
Jaimie Kwon
1 What is statistics?
Statistics: a way to extract information from data
Descriptive statistics: methods of organizing, summarizing, and
presenting data in such a way that useful information is produced
Graphical methods
Numerical summary of data
Inferential statistics: a body of methods used to draw conclusions or
inferences about characteristics of population based on sample data
Key paradigm of statistics
Population: the group of all items of interest
Parameter: a descriptive measure of a population
Sample: a set of data drawn from the population
Statistic: a descriptive measure of a sample
Statistical inference: the process of making and estimate, prediction
or decision about a population based on sample data
Exercises 1.3, 4
2 Graphical and tabular descriptive statistics
2.1 Types of data
Variable: some characteristic of a population or sample
The values of the variable are the possible observations of the
variable. (Integers b/w 0-100, real numbers, M/F, A-F)
Data are the observed values of a variable (plural for datum)
Types of data/variable
-4-
STAT 2010, Business Stat
2006
Jaimie Kwon
Interval data/variable are real numbers, a.k.a. quantitative or
numerical
Nominal data/variable have categorical values without orders,
a.k.a. qualitative or categorical
Ordinal data/variable are similar to nominal but their values can
be ordered
(“Categorical variable” is the generic name for nominal and
ordinal variables)
Hierarchy? (Course grade: score to letter grade to pass/fail)
Exercises 2.1-2.3
2.2 Techniques for nominal data
Frequency distribution: a table of the categories and their counts
Relative frequency distribution : shows the proportion (not count) of
each category
A bar chart is used to display frequencies
A pie chart shows relative frequencies
Exercises 2.11
2.3 Graphical techniques for interval data
How to visualize the data? Histogram
E.g. Items with defects (Xr02-35)
x=c(4, 9, 13, 7, 5, 8, 12, 15, 5, 7, 3, 8, 15, 17, 19, 6, 4, 10, 8, 22,
16, 9, 5, 3, 9, 19, 14, 13, 18, 7); hist(x)
Example (recycle below): mean time spent on the internet; 0, 7, 12,
5, 33, 14, 8, 0, 9, 22 (hrs /month)
x=c(0, 7, 12, 5, 33, 14, 8, 0, 9, 22); hist(x, nclass=4)
-5-
STAT 2010, Business Stat
2006
Jaimie Kwon
We’ve all seen histograms. Here’s how you draw one:
Build class intervals, equally wide, non-overlapping intervals that
cover the complete range of observations.
Create a frequency distribution, by counting the # of observations
that fall into each class interval
Draw the histogram, rectangles whose bases are class intervals
and heights are frequencies
How many class intervals?
More class intervals for {more, less} data points.
Table 2.6 for the rule of thumbs;
Sturges’ formula: “1+3.3 log(n)”
My favorite: eyeballing
How wide is each interval? Round (range/# of classes) to
something convenient.
Reading histograms…
Symmetry and Skewness (positively/negatively)
How many peaks? unimodal, bimodal
Bell shape (symmetric, unimodal; important)
Which variables are likely to have
A positively skewed distribution?
A negatively skewed distribution?
Symmetric distribution?
Symmetric, bell shaped distribution?
Bimodal distribution?
-6-
STAT 2010, Business Stat
2006
Jaimie Kwon
Stem-and-leaf display
Ogive
Ex. 2.33, 35(a)(c)
2.4 Describing the relationship between two variables
Bivariate methods are used to study the relationship between two
variables (Cf. Univariate methods)
Dependent variable (Y) vs. independent variable (X)
Four possible combinations: {categorical, integer} {X, Y} variable
Two categorical variables:
E.g. Gender and choice of doctorate, 1998 (Ex. 2.56, Xr02-56)
Example: Blue collar/white collar/professional vs NYTimes/USA
today/SF Chronicles; ad targeting
A contingency table lists the frequency of each combination of
the values of two categorical variables
To study the differences in the row variable among the column
variable; compute the column totals and divide each frequency
by it to obtain column relative frequencies
Two interval variables:
E.g. Size vs. price of home (100 ft2 vs K dollars) which are
dependent and independent variable? Use of X and Y. (e.g. Xm0209)
Draw scatter diagram using X and Y
Interpreting scatter diagrams:
-7-
STAT 2010, Business Stat
2006
Jaimie Kwon
Linear relationship: most of the points fall close to a straight line
through points (cf. least squares method)
Two main characteristics of linear relationship:
Strength (strong, medium, weak, none)
Direction (positively linear, negatively linear)
Nonlinear relationship
Ex. 2.55 (Xr02-55), 56 (Xr02-56)
2.5 Time series data
Bankrate, Hbrhomes graph (<> cross-sectional data)
Ex 2.73 (Xr02-73)
3 Art and science of graphical presentations
graphical excellence
graphical deception
presenting statistics: writing reports and oral presentations
-8-
STAT 2010, Business Stat
2006
Jaimie Kwon
4 Numerical descriptive techniques
4.1 Measures of central location
Label observations in a sample as x1 , x2 ..., xn
We typically use n for the sample size, N for population size
Population quantities are usually not computable, especially
when N=
Example (recycle below): mean time spent on the internet; 0, 7, 12,
5, 33, 14, 8, 0, 9, 22 (hrs /month)
x=c(0, 7, 12, 5, 33, 14, 8, 0, 9, 22);mean(x);hist(x)
Three measures of central location
Arithmetic mean:
n
sample mean x
xi
i 1
n
N
; population mean:
x
i 1
i
N
Median: the observation that falls in the middle of the sorted data
Mode: value that occurs with the greatest frequency
Which to use?
Mode is usually a poor measure.
Compared to mean, median is less sensitive to extreme
observations and in many cases more interpretable
Geometric mean: useful for finance, when averaging growth rate
over years
Let Ri be the rate of return in period i. The geometric mean Rg of
the returns R1,…,Rn is (1+Rg)n = (1+R1)…(1+Rn); Solving for Rg,
-9-
STAT 2010, Business Stat
2006
Jaimie Kwon
we have 1 Rg n (1 R1 )...(1 Rn ) ; example with R1=100% and R2=50%. ($1,000 -> $2,000 -> $1,000 again)
Ex 4.3, 4.10 (geometric mean)
4.2 Measures of variability
Measure of spread or variability of the data
Example: 8, 4, 9, 11, 13 (# of hours the students spent studying stat
last week)
Range = largest value observed - smallest value observed (too
simple)
n
Variance: sample variance s 2 s x2
N
variance 2 x2
x
i 1
x
i 1
x
2
i
n 1
, population
2
i
N
Why n-1? We will see in Chapter 10.1;
Compute “deviations” first and squaring, summing, dividing.
Why squaring? (absolute value is also possible; MAD)
The unit? (square of the original unit)
2
n
x
i
1 n 2 i1
2
Shortcut for sample variance: s
xi n
n 1 i1
Standard deviation (SD): sample standard deviation s s 2 ,
population standard deviation 2
Same unit as the original data; easy to interpret
s2=2 =0 if and only if ___
- 10 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Empirical Rule: Given a set of n measurements that is
approximately normal (bell-shaped), it follows that the interval with
endpoints
xs
contains ~ 68% of the measurements
x 2s
contains ~ 95% of the measurements
x 3s
contains almost all of the measurements
E.g. Analysis of the monthly returns on an investment shows the
distribution is approximately bell shaped and mean=10% and
sd=4%. What can you say about the distribution of the return?
hist(rnorm(240, 10, 4), col=’red’)
How often is the return between 6 to 14%?
How often is the return larger than 14%?
s
or
x
Coefficient of variation (CV):
Ex 4.23, 24((b) and (c) only; also compute standard deviations as
well), 27, 28
4.3 Percentiles and box plots
Percentiles are everywhere (test scores…)
The p’th percentile: the value for which p percent of observations
are less than that value and (100-p)% are greater than that value
Quartiles are 25th, 50th, 75th percentiles (divide the data into
quarters),
each called first/lower quartile, median, and third/upper quartile
each labeled Q1, Q2, Q3
(cf. quintiles and deciles)
- 11 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Location of a p’th percentile in the sorted numbers is approximately
L p (n 1)
p
100
Recycle the internet data example:
Simple, rounding approach
Detailed approach
Relationship between the skewness and distribution of quartiles
If Q2 is closer to Q1 than Q3, then ____ skewed
If Q2 is closer to Q3 than Q1, then ____ skewed
Inter-quartile range (IQR) : Q3-Q1; spread of the middle 50% of the
observations
(horizontal) Box plots:
Q1, Q2, Q3 for the box boundaries;
Left and right ‘whiskers’ extend outward from the box boundaries
to the outermost values that are within 1.5 * IQR from the box
boundaries
Points outside the whiskers are ‘outliers’ (>1.5*IQR outward from
Q1 or Q3); interesting or incorrect points
Multiple box plots: Great tool for comparing distribution of multiple
groups
Ex 4.37, 4.43, 4.48 (do only “describe your findings” part; the
boxplot is provided in the handout; feel free to try Minitab to draw
the boxplot per in class instruction but it’s not required)
- 12 -
STAT 2010, Business Stat
2006
Jaimie Kwon
4.4 Measures of linear relationship
Numerical measure for direction and strength of the linear
relationship
Example: (which are X and which are Y?)
baseball wins vs. home/road attendance (Baseball attendance);
GMAT score vs. MBA GPA (xm04-16)
Covariance between variables X and Y:
N
Population covariance xy
( x )( y
i
i 1
x
i
N
y
)
,
n
Sample covariance: sxy
( x x )( y y )
i 1
i
i
n 1
,
n
n
x
i yi
n
1
xi yi i 1 i 1
Shortcut for sample covariance: s xy
n 1 i 1
n
Manual calculation:
I
xi
yi
1
2
13
…
6
20
N
7
27
xi x
Total
Average
Xi=2,6,7; yi=13, 20, 27;
How about yi=27, 20, 13?
How about yi=20, 27, 13?
- 13 -
yi y
xi x yi y
xi yi
STAT 2010, Business Stat
2006
Jaimie Kwon
Look at the sign (direction) and magnitude (strength) –
How do we judge magnitude of covariance?
Coefficient of correlation
Population correlation
s
xy
; sample correlation r xy
x y
sx s y
Correlation is between -1 and 1
Java Applet for correlation coefficient
Least squares method: an objective way of producing a straight line
through data points in scatter diagram
It produces a straight line such that the sum of squared
deviations between the points and the line is minimized
Equation for a line:
yˆ b0 b1 x ,
where
b0 : intercept
b1 : slope
ŷ : the (predicted) value of y determined by the line
Use calculus to find coefficients b0, b1 which minimizes
n
(y
i 1
i
yˆ i ) 2
Least squares line coefficients are given by
b1
sxy
sx2
and b0 y b1x .
Ex 4.55, 56, 58 (xr04-58; computer use is OK but show your work)
- 14 -
STAT 2010, Business Stat
2006
Jaimie Kwon
4.5 Comparing graphical and numerical techniques
Comparing returns on two investment; centers=expected return;
spreads=risks (low-risk vs high-risk)
Business stat marks vs. math stat marks: unimodal, bimodal, …
Relationship b/w price and size of houses
4.6 General guidelines for exploring data
Look at the shape of the distribution; find Center; spread; peaks;
skewness (bell curve?)
Shapes guide on which numerical techniques to use
Optional (won't be graded): Ex 4.84, 4.86 (you have to use the
computer, preferrably Minitab, for these two problems)
- 15 -
STAT 2010, Business Stat
2006
Jaimie Kwon
5 Data collection and sampling
5.1 Methods of collecting data
Direct observation (observational data): aspirin vs. heart attack
example; limitations; inexpensive
Surveys: Gallup Poll example; market research; response rate
Personal interview
Telephone interview
Self-administered survey
Questionnaire design
Experiment (experimental data): same example
Ex 5.1
5.2 Sampling
The chief motif for a sample rather than population: cost
Use sample quantities as ‘estimates’ for the corresponding
population quantities
E.g. Nielson ratings (what is watched by 1000 television viewers);
quality control
“Target population” (the population about which we want to draw
inferences) vs. “sampled population” (the actual population from
which the sample has been taken)
E.g. The Literary Digest : predicted Alfred Landon’s 3 to 2 victory
over the incumbent Franklin D. Roosevelt based on 10 million
sample ballots
That are sampled from phone directory
- 16 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Of which “only” 2.3 million were returned (‘self-selected
samples’)
Ex. 5.6, 5.7
5.3 Sampling plans
A “simple random sample” is a sample selected in such a way that
every possible sample with the same # of observations is equally
likely to be chosen
Simple and good (do it “randomly”!!)
How to do it?? (random sample; jar; …)
A “stratified random sample” is obtained by separating the
population into mutually exclusive sets, or strata, and then drawing
simple random samples from each stratum
To extract more information
Criteria for separating a population into strata include: gender,
age, occupation,…
Sampling procedure and analysis can be complicated: plan
ahead and consult stat pros!
A “cluster sample” is a simple random sample of groups or clusters
of elements
Reduce geometric distances the surveyor must cover to gather
data (reduce cost)
Increases sampling error
Sample size and accuracy: The larger the sample size is, the more
accurate the sample estimates becomes
- 17 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Details in Chapters 10 and 12
Ex 5.11, 14-16
5.4 Sampling and nonsampling errors
Sampling error: differences between the sample and the population
that exist only because of the observations that happened to be
selected for the sample
E.g. the mean annual income of North American blue-collar
workers
Estimate the mean income of the population by the mean x of
the sample. The value of x will deviate from simply by chance
This deviation can be large simply due to bad luck
The only way to reduce the expected size of this error is to take a
larger sample
Given a fixed sample size, we state the probability that the
sampling error is less than certain amount (Ch. 10)
Nonsampling error: more serious; taking a larger sample won’t help
here; due to mistakes made in the acquisition of data or due to the
sample observations being selected improperly
Error in data acquisition
“Non-response error”: error or bias introduced when responses
are not obtained from some members of the sample
Selection bias
Ex 5.17, 5.18
- 18 -
STAT 2010, Business Stat
2006
Jaimie Kwon
6 Probability
Probability is critical in statistical inference since it provides the link
between the population and the sample
6.1 Assigning probability to events
A “random experiment” is a process that leads to one of several
possible outcomes
E.g. coin flipping; grade on a stat test; time to assemble
computer; party preference
A “sample space’ of a random experiment is a set of all possible
outcomes of the experiment (exhaustive and mutually exclusive)
S {O1 , O2 ,..., Ok }
Requirements of probabilities: given a sample space S, the
probabilities assigned to outcome must satisfy two requirements:
The probability of any outcome must be between 0 and 1, i.e.
0 POi 1
The sum of the probabilities of all the outcomes in the sample
space must be 1, i.e.
PO 1
i
i
Three approaches to assigning probabilities
The classical approach
The relative frequency approach
The subjective approach
An “event” is a set of outcomes in a sample space
A “simple event” is an individual outcome
- 19 -
STAT 2010, Business Stat
2006
Jaimie Kwon
The “probability of an event” is the sum of probabilities of the simple
events that constitute the event
Most useful way to interpret probability is the relative frequency
approach for a hypothetical, infinite number of experiments
Ex. 6.1-3 (in class), 8
6.2 Joint, marginal, and conditional probability
Want to consider ‘combinations’ of events
Example: relationship between whether a mutual fund outperforms
market and whether the manager of the fund has an MBA from a
top-20 program
Consider a population of 1,000 mutual funds
Mutual fund
Mutual fund
outperforms
does not
market
outperform
Totals
market
The manager
110
290
60
540
has MBA
The manager
does not have
MBA
Totals
1,000
- 20 -
STAT 2010, Business Stat
2006
Jaimie Kwon
The “intersection of events A and B,” denoted “A and B,” is the
event that occurs when both A and B occurs.
The probability of the intersection is called the “joint probability”
P(A randomly selected mutual fund outperforms and its manager
has an MBA degree) =
What is the joint probability if we sample a mutual fund from the
above population?
Mutual fund
Mutual fund
outperforms
does not
market
outperform
Totals
market
The manager
.11
.29
.06
.54
has MBA
The manager
does not have
MBA
Totals
“Marginal probabilities” are computed by adding across rows or
down columns
P(A randomly selected mutual fund manager has MBA degree) = ?
i.e., When a mutual fund is randomly selected, the probability
that its manager has an MBA is ___
i.e., ___ all mutual fund managers have an MBA
- 21 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Try, P(A randomly selected mutual fund outperforms the market) = ?
“Given that a fund is fund is managed by an MBA, what’s the
probability that it outperforms the market?”
Given A, what’s the probability of B?
The “Conditional probability of B given A”, written P(B|A), is the
probability of event B given the occurrence of another related event
A.
Formally, it can be computed as P(B|A)=P(A and B)/P(A)
Two events A and B are “independent” if P(A|B)=P(A) or
P(B|A)=P(B)
i.e., the probability of one event is not affected by the occurrence
of the other event
Checking dependence: For the table like above, we can check all
four combinations but showing it for only one of them [P(B) P(B|A)
for some A and B] is enough. On the other hand, showing
independence would be more work
The “union” of events A and B is the event that occurs when either A
or B or both occur. It is denoted as “A or B”
E.g. determine P(A1 or B1)
Approach #1 : sum the components
#2 : 1- P(the other component)
Ex 6.86
- 22 -
STAT 2010, Business Stat
2006
Jaimie Kwon
6.3 Probability rules and trees
Want to calculate the probability of more complex events from the
probability of simpler events
Complement rule: the “complement” of event A is … and is denoted
by AC. The rule says P(AC)=1-P(A); e.g.
Multiplication rule: P(A and B) = P(A|B)P(B) or, P(B|A)P(A)
Proof:
If independent,… it reduces to:
The joint probability of any two independent events A and B is
P(A and B)=P(A)P(B)
Ex 6.5: 7 males and 3 females. P(two randomly selected students
are both female)?
Ex 6.5: 7 males and 3 females. P(two randomly selected students
by two professors to answer questions are both female)?
Addition rule: P(A or B)=P(A)+P(B)-P(A and B)
[revisit the above example]
When two events are mutually exclusive (two events cannot occur
together), the joint prob is 0, thus the above reduces to…
P(paper A)=?, P(paper B)=?, P(both papers)=?. Then P(either
paper)=?
Probability trees
First choice, second choice, joint probability
{F,M}, {F,M}|F and {F,M}|M, {FF, FM , MF, MM} (for the two
cases above)
Ex. 6.47, 51-55, 67, 68
- 23 -
STAT 2010, Business Stat
2006
6.4 Bayes’ Law
Skip
6.5 Identifying the correct method
Read
- 24 -
Jaimie Kwon
STAT 2010, Business Stat
2006
Jaimie Kwon
7 Random variables and discrete probability
distributions
Motivation: Want to tell if a coin is fair. Throw it 100 times. Reject the
null hypothesis that the coin is fair if # of heads is too large or small.
But where do we draw the line? 90? 70? How extreme is the observed
value? Need to know probability distribution of the number of heads
from a balanced coin.
7.1 Random variables and probability distributions
E.g. # of heads in flipping of two coins; total of two dice
Random variable : a function or rule that assigns a number to each
outcome of an experiment
Two types of random variable :
Discrete random variable: takes on a countable number of
values; e.g.
Continuous random variable: takes on uncountable number of
values.
Probability distribution: a table, formula, or graph, that describes the
values of a random variable and the probability associated with
these values.
X vs. x: X: name of a random variable; x: value of the random
variable
P(X=x) or P(x)
Requirements for a discrete probability distribution function
(distribution of a discrete random variable):
- 25 -
STAT 2010, Business Stat
2006
Jaimie Kwon
0 P( x) 1 for all x
P( x) 1
x
Example. Consider a game where the player draws a card from a
deck of cards and wins $100 for spade ace, $5 for any heart and $0
for anything else. If we let X be the winning (in $), specify P(x).
x
P(x)
Example. Consider investing money to a start-up company. After a
year, it either fails, has moderate success, or has a big success with
probabilities 0.8, 0.15 and __, respectively. In each case, the
investment return is given by $0, $1,000 and $10,000.
What’s the quantity ot consider as a random variable X?
What’s P(X>0)? What’s P(X=0)?
x
P(x)
- 26 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Population mean: E ( X ) xP( x) (“the expected value of X”)
x
Population variance: V ( X ) ( x ) 2 P( x)
2
x
Shortcut calculation for population variance: V ( X ) x 2 P( x) 2
x
Population SD :
2
Note that we’re using the same terms as in Chapter 2. It’s not a
coincidence. Consider a population consisting of N individuals and
assume that for a variable X, the population relative frequency of the
value x, (# of individuals that are x)/N, is given by P(x). Then the
N
population mean
x
i 1
N
i
as a descriptive measure for the
population is same as E ( X ) xP( x) , the expected value of X.
x
Same can be said for the population variance and standard
deviation.
Laws of expected value and variance: for a random variable X and a
constant c,
E(c)=c
E(X+c) = E(X)+c
E(cX)=cE(X)
V(c)=0
V(X+c)=V(X)
V(cX)=c2V(X)
Example. The monthly sales at a computer store has a mean of
$25,000 and SD of $4,000. Also,
- 27 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Profits = 30% of the sales – fixed costs of $6000.
Find the mean and SD of monthly profits
[conventional method vs. empirical rule method]
Ex. 7.1(d), 2(d), 7, 19(a)(d), 39 (in answering 7.39, use the fact that
answers to 7.38 is E(X) =4.00 and V(X) = 2.40)
7.2 Bivariate distributions
Bivariate distribution provides probabilities of combinations of two
random variables (Cf. univariate distribution)
Joint probability are written P(x,y): again, table or formula
X and Y are # of houses sold by two agents, Xavier and Yvonne per
day; P(x,y) = P(X=x,Y=y) are given below:
x
0
y
1
0
.11
.29
1
.06
.54
Requirements:
0P(x,y) 1 for all x,y
- 28 -
STAT 2010, Business Stat
2006
Jaimie Kwon
xy P(x,y) = 1
Marginal probabilities P(x)= all_y P(x,y), P(y) = all_x P(x,y)
E(X)=X
V(X)=2X
X
E(Y)=Y
V(Y)=2Y
Y
Covariance: Cov( X , Y ) x x y y P( x, y)
x
y
Shortcut calculation: Cov( X , Y ) xyP( x, y ) x y
x
Coefficient correlation
y
Cov( X , Y )
XY
Two discrete random variables X and Y are independent if two
events {X=x} and {Y=y} are independent for any x and y
In other words, if P(x,y)=P(x)P(y) for any x and y
More informally, if X and Y don’t affect each other
Laws of expected value and variance of the sum of two variables
X+Y, P(x+y)
X+Y
E(X+Y) = E(X) + E(Y)
V(X+Y) = V(X) + V(Y) + 2COV(X,Y)
If X and Y are indep, COV(X,Y)=0 and =0
Total # of houses sold by Xavier and Yvonne
Ex. 7.43-46, 55, 56
- 29 -
STAT 2010, Business Stat
2006
Quiz #1 scores (out of 36)
Mean = 32.09
Median = 32
SD = 2.6
- 30 -
Jaimie Kwon
STAT 2010, Business Stat
2006
Jaimie Kwon
7.3 Binomial distribution
Binomial random variable is the number of successes in n
independent trials with a constant success probability p. We write
X~bin(n,p) to describe that a random variable X follows such a
binomial distribution.
Such experiment is called a binomial experiment:
Consists of a fixed # of trials (n)
Two possible outcomes (‘success’ and ‘failure’)
The success probability is p.
The trials are indep.
Each trial is called a ‘Bernoulli process’
E.g. Flipping coin; draw cards (not binomial); political survey (not
quite but come close)
E.g. a clueless student takes an exam consists of 5 multiple choice
(1 out of 4) questions.
Delineate n and p
What’s the probability that he gets no answers correct? P(X=0);
two answers correct? P(X=2)=?
What’s the chance that P(fail the quiz) = P(X2)=?
For a class full of similar studnets, What’s the mean score? SD?
hist(rbinom(20, 5, 1/4))
Mathematically, we can show that if X~bin(n,p),
n
P(x) = P(X=x) = p x 1 p n x for x=0,1,…,n
x
- 31 -
STAT 2010, Business Stat
n
2006
Jaimie Kwon
n!
Here,
, which reads “n choose x,” is the number of
x x! (n x)!
different ways of choosing x objects from n objects.
P(Xx) : cumulative probability
Binomial table: Table 1 in appendix B provides values of cumulative
probability for selected n and p. (x, P(X<=x))
P(X3) =?
Can compute by (1-P(X2)
In general, P(Xx) = 1-P(X x-1)
P(X=3)=?
P(X=x) = P(Xx) – P(X x-1)
General formula for mean and var of a binomial random variable :
np
2 np(1 p)
Ex. 7.81-83, 89 (computer), 90
7.4 Poisson Distribution
Another useful discrete probability distribution. # of occurrences of
events in an interval of time or specific region of space.
Some examples
e x
Formula: P(x) =
where e=2.71828…
x!
Skip
- 32 -
STAT 2010, Business Stat
2006
Jaimie Kwon
8 Continuous probability distributions
8.1 Probability density functions
Need a completely different approach to deal with a continuous
random variables since
There are infinitely, uncountably many possible values
the probability of individual value is virtually zero, i.e. P(X=x) = 0
for any x
Example. duration of a commute
Table of (intervals: relative frequency)
E.g. 0-10 min: .3, 10-20 min: .5, 20-30 min: .2
We can only determine the probability of a range of values only
The probabilities sum up to 1
If we divide relative frequency by interval width, we have a set of
rectangles whose area equals the probability that the random
variable will fall into each interval.
Imagine very large # of small intervals. A function f(x) that
approximates the curve is called a probability density function (pdf):
Requirements for a pdf over a range a ≤ x ≤ b
f(x)≥0 for all x
the total area under the curve between a and b is 1
Probability of an interval: the area under the curve
Integral calculus helps… but we don’t want to do it.
Uniform distribution
Uniform pdf is given by f(x) = 1/(b-a) where a ≤x ≤b
- 33 -
STAT 2010, Business Stat
2006
Jaimie Kwon
P(x1 < X < x2) = (x2-x1)*(1/(b-a))
Ex. 8.1, 9,10
8.2 Normal distribution
The most important distribution in probability and statistics
Normal pdf: p( x)
(x )2
exp
where e=2.71828… and
2 2
2 2
1
=3.14159…
We write: X ~ N(,2), or X follows a normal distribution with mean
and standard deviation
Example: For a certain professor, the duration of the morning
commute follows a normal distribution with mean 30 and standard
deviation 10, i.e. the commute duration X ~ N(30, 102). Then we
want to answer questions like:
What’s the probability that the trip will take more than 50 minute?
What’s the probability that the trip will take between 20 and 50
minutes?
On 2.5% of days, the trip will take longer than ___ minutes.
- 34 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Example: For a certain population (say, a large school), the
student’s SAT score is normal distributed with mean 500 and
standard deviation 50, i.e. the SAT score X ~ N(500, 502). Then we
want to answer questions like:
What’s the probability that a randomly selected student scores
more than 600?
What’s the probability that a randomly selected student scores
between 400 and 550?
To be in top 5% in the population, how much does a student
need to score?
To be in bottom 5% in the population, how much does a student
need to score?
Symmetrical, bell shaped
- 35 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Centered around the mean
The spread specified by the variance 2
Try applets:
Normal Distribution Parameters
Normal Distribution Areas
Calculating normal probabilities
Compute the area in the interval under the curve.
Use the probability table
Need a separate table for different and ? No - by
standardizing the random variable
If X~ N(,2), the transformed variable, denoted by Z, is called the
“standard normal random variable”: Z
X
~ N(0, 1)
“probability statement about X probability statement about Z”
If X ~ N(30, 102), what is P(25 < X < 40)?
25 30 X 30 40 30
P(25 < X < 40) = P
= P(-.5 < Z < 1)
10
10
- 36 -
10
STAT 2010, Business Stat
2006
Jaimie Kwon
“Z=-.5 corresponds to a value of X that is one-half a standard
deviation below the mean”
The table gives P(0 < Z < z) for positive z.
P(Z > 0) =
P(Z < 0) =
P(Z > 2) =
1-P(0 < Z < 2) =
P(Z < -3) =
P(Z>3) =
P(0 < Z < 1) =
P(-.5 < Z < 0) =
- 37 -
STAT 2010, Business Stat
2006
P(0 < Z < .5) =
P(-.5 < Z < 1) =
P(-.5 < Z < 0) + P(0 < Z < 1)
= P(0 < Z < .5) + P(0 < Z < 1) =
P(1 < Z < 2) =
P(0 < Z < 2) – P(0 < Z < 1) =
P(-2 < Z < -1) =
- 38 -
Jaimie Kwon
STAT 2010, Business Stat
2006
Jaimie Kwon
P(1 < Z < 2) =
We sometimes need to compute ZA, the value z such that the area
to the right under the standard normal curve is A, i.e., such that P(Z
> ZA)=A
Use the table backward
Z0.025 =
Z0.05 =
ZA = 100(1-A)th percentiles of a standard normal random variable
If X ~ N(, 2), find x such that P(X > x) = A
For example, if X ~ N(600, 502), find x such that P(X > x) = 0.05
Convert the problem to Z
- 39 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Find z0.05
Convert back to X space
How about top 10 percent? How about top 1 percent?
Ex.8.19-24, 31-32, 37-41, 58
8.3 Exponential distribution
8.4 Other continuous distributions
Student-t distribution
Very commonly used in statistical inference. (chapters 12, 13, 15,
17, 18, 19)
We use symbol T() to denote the random variable that follows
the student-t distribution with degrees of freedom.
This we write as T() ~ t() (a la X ~ N(, 2))
We sometime write T() as T, if is clear from the context.
Example: if a random variable T follows the student-t distribution
with 10 degrees of freedom, then:
What’s the probability that T will be greater than 1.812?
What’s the value of t such that T is greater than t 5% of time?
- 40 -
STAT 2010, Business Stat
2006
Jaimie Kwon
What’s the value of t such that T is less than t 5% of time?
The distribution looks very close to the standard normal; the larger v
is the closer it is.
E(T) = 0 and V(T) = /(-2) for >2
Computing student-t probabilities
Student-t probabilities can be computed using computer (TDIST
in Excel)
Finding student t values such that P(T t A,v ) A (TINV in Excel)
Table 4 of the book
t.05,10 = 1.812
t.05,25 = 1.708
-t.05,25 = -1.708
Chi-squared distribution
X2 ~ 2(v)
Looks like …. For different v
Finding chi-squared values P( 2 A2, ) A
- 41 -
STAT 2010, Business Stat
2006
(use table 5)
2.05,8 = 15.5073
2.95,25 = 2.73264
F distribution
F~F(v1, v2)
Finding P( F FA, , ) A
1
2
Ex. 8.83, 84
- 42 -
Jaimie Kwon
STAT 2010, Business Stat
2006
Jaimie Kwon
Midterm score (Out of 60)
mean(x) = 48.4
median(x) = 49
>
- 43 -
STAT 2010, Business Stat
2006
Jaimie Kwon
9 Sampling distributions
9.1 Sampling distribution of the mean
Example (same as above): For a certain population (say, a large
school), the student’s SAT score is normally distributed with mean
500 and standard deviation 50, i.e. the SAT score X ~ N(500, 502).
If we randomly sample 25 students from the school and have them
take SAT, what can we say about the distribution of the sample
mean SAT score? In particular,
What’s the mean of X ?
What’s the standard deviation of X ?
What’s the distribution of X ?
How does a conclusion changes if the original distribution of the
inidividual score was not normal?
- 44 -
STAT 2010, Business Stat
2006
Jaimie Kwon
(exact; also, we don’t need many n)
In particular, what’s P( X > 550)? What’s P( X < 450)? What’s
P(450 < X < 550)
Compare this with P(X > 550), P(X < 450) and P(450 < X < 550)
[the effect of smaller standard deviation]
Fair die example; 1 die; 2 dice; 5? 10? Sampling distribution of the
mean of fair dice and CLT
Let X be the outcome of a single throw of a fair die
Distribution of X
- 45 -
STAT 2010, Business Stat
2006
Jaimie Kwon
X and X2 are computed to be 3.5 and 2.92
The “sampling distribution of the mean” of two fair dice, X .
Takes on what values?
1.0, 1.5, 2.0, …., 5.5, 6.0
The “sampling distribution” of a statistic is created by repeated
sampling from one population.
X and X2 , computed to be 3.5 and 1.46 (half of the original)
Consistent with what the theory tells us. See below.
Sampling distribution of X for larger n=5, 10, 25.
X X
- 46 -
STAT 2010, Business Stat
2
X
X
2
n
2006
Jaimie Kwon
(distribution becomes narrower when n increases) or
n
Sampling distribution of X becomes increasingly bell shaped.
- 47 -
STAT 2010, Business Stat
2006
Jaimie Kwon
To summarize….
The sampling distribution of the sample mean X :
X X , and
X2
2
n
or, equivalently, X
n
.
Also, the distribution is approximately normal regardless of the
original population distribution, for a sufficiently large sample size
(say, n 30). The larger the sample size is, the more closely the
sampling distribution of X will resemble a normal distribution.
If the original distribution of X is normal, then X is exactly normal.
The result is called the Central Limit Theorem (CLT):
The sampling distribution of the sample mean of a random sample
drawn from any population is approximately normal for a sufficiently
large sample size (say, n 30). The larger the sample size is, the
more closely the sampling distribution of X will resemble a normal
distribution.
Implication for the inference?
A claim has been made that the SAT score for a private school
has the distribution X~N(550, 100^2). To check this claim, a
sample of 25 people have been surveyed and the sample mean
was found to be 500. What is the P(X-bar < 500) if the dean’s
claim was true.
- 48 -
STAT 2010, Business Stat
2006
Jaimie Kwon
P(X-bar < 500) = P(Z<(500-550)/(100/5)) =P(Z<-2.5) = …
What’s the conclusion?? The precursor of hypothesis testing
Z.025 = 1.96
P(-1.96<Z<1.96)=.95
X
1.96) .95
/ n
P( 1.96 / n X 1.96 / n ) .95
P(1.96
In general, P( z / 2 / n X z / 2 / n ) 1
For the above example, P(760.8<X-bar < 839.2) = .95
P(748.5 < X-bar < 851.5) = .99
the precursor of interval estimation
Ex. 9.5, 6, 7, 9, 10, 11, 15, 16
9.2 Sampling distribution of a proportion
Among a very large population, 48% support a certain bill and 52%
do not. If we randomly select 100 people, what can we say about
the sampling distribution of the sample proportion of the people who
support the bill? Among others,
What’s the mean of the sample proportion?
What’s the standard deviation of the sample proportion?
- 49 -
STAT 2010, Business Stat
2006
Jaimie Kwon
What’s the distribution of the sample proportion?
What’s the chance that the sample proportion is greater than
50%?
In binomial experiment, the estimator of the population proportion of
successes is the sample proportion pˆ
divided by the sample size.
- 50 -
X
, the # of successes
n
STAT 2010, Business Stat
2006
Jaimie Kwon
Normal approximation to binomial experiment: Distribution of a
sample proportion is given by:
E ( pˆ ) p
p (1 p )
n
p(1 p)
.
n
V ( pˆ ) P2ˆ
Pˆ
or
Also, the variable Z =
Pˆ p
p(1 p) / n
is approximately standard normal,
provided that n is large. (i.e. both np ≥ 5 and n (1 p) ≥ 5)
Ex. 9.30, 34
9.3 Sampling distribution of the difference between two means
For two separate population A and B (say, two large schools), the
SAT score of individual student follows N(550, 502) and N(500, 502)
distributions, respectively. In other words, if we let X1 and X2 to
denote respective random variables, X1 ~ N(550, 502) and X2 ~
N(500, 502). If we randomly select 25 students each from population
A and B, what is the distribution of the difference between two
sample means, X 1 X 2 ? In particular,
What’s the mean of X 1 X 2 ?
What’s the standard deviation of X 1 X 2 ?
- 51 -
STAT 2010, Business Stat
2006
Jaimie Kwon
What’s the distribution of X 1 X 2 ?
What’s P( X 1 X 2 > 60)=?
How do the above change if X1 and X2 don’t follow a normal
distribution?
For independent random samples of size n1 and n2 drawn from of
two normal populations N(1, 12) and N(2, 22), respectively, the
difference of sample means X 1 X 2 has a normal distribution. Even
when the two original distributions are not normal, the distribution of
X1 X 2
is approximately normal if both n1 and n2 are large (say both
n1 30 and n2 30). Also,
X X 1 2 and X2 X
1
2
1
2
12
n1
22
n2
Ex. 9.45, 46
- 52 -
STAT 2010, Business Stat
2006
Jaimie Kwon
10 Intro to estimation
So far, we assumed known parameters and study the sampling
distribution of various statistics.
What if, we don’t know the value of parameters but have observed a
single value of a statistic?
We want to say something about the parameters.
For a certain population (say, a large school), the student’s SAT
score is normally distributed with mean 500 and standard deviation
50.
If we randomly sample 25 students from the school and have them
take SAT, what can we say about the distribution of the sample
mean SAT score? In particular,
A certain school has the population mean score of 500 and standard
deviation of 50. If we randomly sample 25 students from the school
and have them take SAT, what can we say about the distribution of
the sample mean SAT score? In particular, ….
A certain school has the unknown population mean score and
standard deviation of 50. When we randomly sampled 25 students
- 53 -
STAT 2010, Business Stat
2006
Jaimie Kwon
from the school and had them take SAT, we observed x = 520.
What can we say about the distribution of the sample mean SAT
score?
A certain school has the unknown population mean score and
unknown standard deviation . When we randomly sampled 100
students from the school and had them take SAT, we observed x =
52 and s = 45. What can we say about the distribution of the sample
mean SAT score?
Two general procedures for inference: estimation and hypothesis
testing
10.1 Concept of estimation
A “point estimator” draws inferences about a population by
estimating the value of an unknown parameter using a single value
or point
An “interval estimator” draws inferences about a population by
estimating the value of an unknown parameter using an interval
E.g. mean weakly income of sample of 25 students is $400. (can
also use $380-$420)
An “unbiased estimator” of a population parameter is an estimator
whose expected value is equal to that parameter
- 54 -
STAT 2010, Business Stat
2006
Jaimie Kwon
An unbiased estimator is said to be “consistent” if the difference
between the estimator and the parameter grows smaller as the
sample size grows larger
X is consistent estimator of
10.2 Estimating the population mean when the population SD is
known
In general, confidence interval is of the form:
(the estimate) (a constant) (standard error of the estimate)
E.g. (SAT score) We believe X ~ N(, 502). For a certain school, if x
= 520 for n = 25, what is 95% CI for ? 90% CI? 99% CI?
100(1-)% confidence interval estimator of the unknown population
mean is x z / 2
n
, x z / 2
,
n
or x z / 2
n
“Lower confidence limit” and “upper confidence limit”
The probability 1- is called the “confidence level”
- 55 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Table of (100(1 – )%, , /2, z/2) for 90%, 95%, 99% confidence
levels
Why? : P[100(1 – )% confidence interval containing ] = (1 – )
The variable Z
X
is standard normal or approximately
/ n
standard normal
CI is random, but is not.
Interpreting the CI: It’s important to realize that we observe only one
sample and only one value of x . Cannot be correct all the time. Aim
to be correct 95% of time.
The sampling error of 100(1 – )% confidence interval is z / 2
n
.
We want {larger, smaller} sampling error, or {wider, narrower} CI
Larger leads to {narrower, wider} interval
Increasing the confidence level 100(1-)% leads to {narrower,
wider} interval
Increasing sample size n leads to {narrower, wider} interval
- 56 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Ex. 10.9, 10, 11, 15, 21
10.3 Selecting the sample size
Sample size to estimate the mean within W at (1-) confidence level,
z
n /2
W
2
Ex. In estimating the population mean SAT score, I want to estimate
it with W=10 for confidence level = .90 or alpha=1. How many
samples do I need?
Ex. 10.41, 42, 51
- 57 -
STAT 2010, Business Stat
2006
Jaimie Kwon
11 Introduction to hypothesis testing
Are there enough statistical evidences to enable us to conclude that
a belief or hypothesis about a parameter is supported by the data.
11.1 Concepts of hypothesis testing
E.g. Is a particular school A has the mean SAT score greater than
the national average of 0 = 500? We assume X ~ N(, 502) and just
observed x = 510 for n=100.
The null hypothesis H0:
“the private school has the same mean SAT score as the
national average of 500 (usually specified as the status quo)”
The alternative hypothesis H1:
“the private school has the mean SAT score higher than 500”
There are two possible decisions: accept H0 or reject H0.
Decision
Accept H0
Reject H0
Truth H0 true
H1 true
More common to say “cannot reject H0” than “Accept H0”
- 58 -
STAT 2010, Business Stat
2006
Jaimie Kwon
The decision is either correct or wrong. When the decision is
wrong, we commit either:
Type I error: wrongfully reject H0
= P(Type I error)
Type II error: wrongfully accept H0
= P(Type II error)
The type I error probability of a certain testing procedure is called
the significance level of the test, or sometimes just level of the
test and is written as .
The test statistic: a statistic which we base our decision upon
“The observed sample mean score”
If the value of test stat is inconsistent with H0 (and more
consistent with H1), we reject H0.
“Sufficient evidence” = “evidence beyond a reasonable doubt”
We use “sampling distribution” of the test statistic to decide
how sufficient the evidence is
The rejection region is a range of values such that if the test
statistic falls into that range, we reject the H0 in favor of H1
- 59 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Testing begins by assuming H0 is true. We reject the H0 if the test
statistic has the value that is inconsistent with H0 but is consistent
with H1. But how inconsistent does it have to be for us to reject it?
That’s up to us. How aggressive do we want to be in rejecting the
null?
Aggressive
More likely to reject the correct H0
More likely to commit type I error
Test with a larger
Conservative
Less likely to reject the correct H0
Less likely to commit type I error
Test with a smaller
A particular decision rule (test) is obtained by deciding on level ,
the type I probability we are willing to accept. Typically, = 0.05
= 5% is used.
The conclusion of the test is stated either as:
“We reject H0 at 5% significance level”
“We cannot reject H0 at 5% significance level”, etc.
- 60 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Somehow, we don’t say
“accept H0 at 5% significance level” or
“accept H1 at 5% significance level”.
The P-value of a test is the probability of observing a test statistic
at least as extreme as the one observed given that H0 is true.
In the example, H0: = 500. The alternative H1 could be of the
form:
H1: > 500
H1: < 500
H1: ≠ 500
Testing either the first two are called one-sided hypothesis
testing and testing the third called two-sided hypothesis
testing.
11.2 Testing the population mean when the population standard
deviation is known
Recall the example: Is a particular school A has the mean SAT
score greater than the national average of 0 = 500? We assume X
~ N(, 502) and just observed x = 510 for n=100.
- 61 -
STAT 2010, Business Stat
2006
Jaimie Kwon
For testing H0: = 0 vs. H1: > 0,
the test at level rejects H0 if
z
x 0
/ n
> z
The p-value = P(Z > z)
That test has type I error probability = .
If H0 is rejected, we say “the result is statistically significant at
significance level ”
Can we reject H0 at = .10?
At = .05?
At = .01?
(1.28, 1.64, 2.33 for 10%,5%,1%)
P-value = P(Z > observed z)
- 62 -
STAT 2010, Business Stat
2006
Jaimie Kwon
In the example, P-value = P( X > 510, given that H0 is true)
= P(Z>2) = .0228
Large P-value suggests H0 is more likely
Small P-value suggests H1 is more likely
The level hypothesis testing is equivalent to “Reject H0 if p is
{ > or } ”
It is a better practice to report the P-value than just “accept” or
“reject”
The test at level reject H0 if and only if P-value < .
Equivalently, P-value is the smallest significance level at which a
test can reject H0.
For testing H0: = 0 vs. H1: < 0,
the test at level rejects H0 if
z
x 0
/ n
< z .
The p-value = P(Z< z)
E.g. For another school B, the test score X ~ N(, 502). Is school B
significantly worse than the national average?
- 63 -
STAT 2010, Business Stat
2006
Jaimie Kwon
We observed x = 492.5 for n = 100.
Can we reject H0 at = 0.05?
What’s the P-value?
P-value = P(Z<-1.5)= 0.0668
For testing H0: = 0 vs. H1: ≠ 0,
the test at level rejects H0 if
| z |
x 0
/ n
> z/2 .
(Or, equivalently, if z < z/2 or z > z/2)
The p-value = 2 P(Z>|z|)
E.g. For another school C, the test score X ~ N(, 502). Is school C
significantly different from the national average? Observed x =
492.5 for n=100.
Can we reject H0 at = 0.05?
- 64 -
STAT 2010, Business Stat
2006
Jaimie Kwon
What’s the P-value?
P-value =2 P(Z>1.5) = 0.1236
Interpreting the test results
Ex. 11.7-9, 13-15, 28
11.3 What about type II error probability?
Recall that a test at level rejects H0: = 0 for H1: > 0 rejects if
z
x 0
> z. Let’s suppose that indeed H1 is true, specifically, the
/ n
true = 1 > 0. Then
P(Type II error when = 1)
= P(Not reject H0 when = 1)
= P(
x 0
z when = 1)
/ n
= P( x 0 z / n when = 1)
x 1
1
when = 1)
z 0
/ n
/ n
= P( Z z 0 1 )
/ n
= P(
- 65 -
STAT 2010, Business Stat
2006
Jaimie Kwon
: increases as 1 (> 0) gets closer to 0 (i.e., as the problem
becomes harder)
Decreasing Type I error (smaller ) leads to larger type II error
There’s no free lunch
If n increases, Type II error decreases for the given
Since there is more information
Power of the test = 1 P(Type II error) = P(correctly rejecting the
null)
OC (operating characteristic) curve
Ex. 11.48, 49, 61 (??)
11.4 The road ahead
You’re pretty much done for the quarter!
Three steps
Define the problem
Identify the appropriate method
Interpret the results
Describe a population
Compare two populations
Compare two or more populations
Analyze the relationship between two variables
- 66 -
STAT 2010, Business Stat
2006
Jaimie Kwon
12 Inference about a population
12.1 Inferences about population mean for a normal population,
unknown
Is a particular school has the mean SAT score greater than the
national average of 0 = 500? We assume X ~ N(,2) with unknown
. We just observed x = 510 and s = 45.0 for n = 25.
100(1-)% confidence Interval for , unknown
x t / 2,n1
s
n
For H0: = 0 vs. H1: > 0,
test at significance level rejects H0 if
t > t,n-1
The P-value is P(t computed t)
For H0: = 0 vs. H1: < 0,
test at significance level rejects H0 if
t < t,n-1
The P-value is P(t computed t)
- 67 -
STAT 2010, Business Stat
2006
Jaimie Kwon
For H0: = 0 vs. H1: 0,
test at significance level rejects H0 if
|t| > t/2,n-1
(Or, equivalently, if t < t/2,n-1 or t > t/2,n-1)
The P-value is 2P(t |computed t|)
The effect of non-normality on the inference based on t distribution
What kind of non-normality? (skewed, heavy tailed)
Effect on the power, level of test, etc.
Consider “robust methods” of estimation and inference
Checking required condition
Normality by histogram; if n is large, OK.
Ex. 12.1, 2d, 3d, 4d, 8a, 9, 13, 21
12.2 Inference about a population variance
Skip
12.3 Inference about a population proportion
E.g. from an exit poll of n = 765 voters, x = 407 people were
observed to have voted for a bill.
What’s p̂ ?
What’s the 95% Confidence interval?
- 68 -
STAT 2010, Business Stat
2006
Jaimie Kwon
For H0: p = .5 vs. H1: p > .5, can we reject H0 at = 5%?
pˆ
x
n
If np 5 and n(1 p) 5 , the distribution of p̂ can be approximated by
p (1 p ) 2
N ( pˆ , ) N p,
n
2
pˆ
100(1-)% confidence interval for the population proportion p is
given by
pˆ z / 2ˆ pˆ
ˆ
ˆ
where ˆˆ p(1 p)
n
For H0: p = p0 vs. H1: p > p0, the test at level rejects H0 if
z
pˆ p0
pˆ
> z.
P-value = P(Z > z)
For H0: p = p0 vs. H1: p < p0, the test at level rejects H0 if
z
pˆ p0
pˆ
< z.
P-value = P(Z < z)
- 69 -
STAT 2010, Business Stat
2006
Jaimie Kwon
For H0: p = p0 vs. H1: p > p0, the test at level rejects H0 if
z
pˆ p0
pˆ
> z/2.
(Or, equivalently, if z < z/2 or z > z/2)
P-value = 2 P(Z > |z|)
Sample size for estimating p within W at confidence level alpha =
2
z
p(1 p )
. Conservative estimate is given by the formula for
n /2
n
p̂ = ½.
Wish to estimate the above proportion within .03. What’s required
n?
Ex. 12.54, 58, 66
- 70 -
STAT 2010, Business Stat
2006
Jaimie Kwon
13 Inference about comparing two population
Keywords: pooled variance estimator; equal-variances test statistic,
How to tell if two variances are equal? Methods are there but
informal method would be fine for now.
If there is no strong evidence against equal variance, it’s usually
“better” to assume the equal variance one. (why?)
Checking required conditions
Draw histogram to check normality; if sample size is large, we’re
OK; this t-testis robust too; if not normal there are nonparametric
methods
E.g. comparing mean SAT score for school 1 and school 2
X1~N(1, 12), X2~N(2, 22)
For n1=25 and n2=25 samples, x1 =530; x2 =500 and s1=90 and
s2=120
Assuming equal variances; sp=106; denom=30; 95% CI for mu1mu2? (-30, 90); df=48; P-value = .16
Assuming unequal variances; nu=44.5 or 45 or 44; 95% CI for
mu1-mu2= (-30, 90) (slightly larger than the previous one)
13.1 Inference about the difference between two means:
independent samples
Consider three cases
Case 1. Both population distributions are normally distributed
with 1 2
Case 2. Both sample sizes n1 and n2 are large
- 71 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Case 3. The sample sizes n1 or n2 are small and the population
distributions are non-normal
Concentrate on Case 1 for now
Population distributions are normal with equal variances
Statistics: y1, y2 , s1, s2
Confidence interval for 1 2 , independent samples:
y1 y2 t / 2 (n1 n2 2)s p
sp
1 1
n1 n2
where
(n1 1) s12 (n2 1) s22
n1 n2 2
Why? (sample distribution of y1 y2 )
Reasonably stable for mound-shaped distributions and
approximately equal SD
A statistical test for 1 2 , independent samples:
H0: 1 2 D0
Ha: 1 2 D0
(D0 is a specified value, often 0)
T.S. : t
y1 y2 D0
sp
1 1
n1 n2
R.R. : for a level , reject H0 if t t (n1 n2 2)
Check assumptions and draw conclusions
Test whether the mean score of school 1 is higher than school 2.
Use =.05.
P-value of the test?
95% confidence interval on the difference of means?
- 72 -
STAT 2010, Business Stat
2006
Jaimie Kwon
sp=?
Three critical conditions
Two random samples are independent
Population distributions are normal or mound-shaped
Two population variances are equal
Approximate t-test for independent samples, unequal variance
T.S. t '
df
( y1 y2 ) D0
s12 s22
n1 n2
and d.f. is
s
s
2
1
2
1
/ n1 s 22 / n2
2
/ n1
s2 / n
2 2
n1 1
n2 1
2
[ or
s12 / n1
(n1 1)( n2 1)
c
where
(round down to the
s12 s22
(1 c) 2 (n1 1) c 2 (n2 1)
n1 n2
nearest integer)]
Similar for confidence interval.
Ex. 13.1a, 2a, 3a, 5b, 7
13.2 Observational and experimental (controlled study) data
The latter is more expensive but can shed more right on causality
E.g. Slytherine may not be a better school than Gryffindor; it may be
that simply there are more good students going there; what kind of
experimental study would be possible?
13.3 Inference about the difference between two means: matched
pairs experiment
What if each measurement in one sample is “matched” or “paired”
with a particular measurement in the other sample?
E.g. comparing repair estimates from two garages for each of 15
cars damaged by accidents
- 73 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Two-sample t-test gives a nonsense result. Why?
Also called ‘paired t-test’ (more common than ‘matched t-test’
Ask: is there some natural relationship exist between each pair of
observations?
E.g. SAT score for before-and-after attending certain prep school;
Not the two-sample t-test. But run the regular t-test on the
differences
Solution: use differences di y1i y2i and obtain its sample mean and
SD, d , sd .
Test hypotheses about d 1 2
Paired t test
H0: d D0
Ha: d D0
(D0 is a specified value, often 0)
T.S. : t
d D0
sd / n
R.R. : for a level , reject H0 if t t (n 1)
Check assumptions and draw conclusions
Confidence interval for d based on paired data: d t / 2 (n 1)
sd
n
Of course, assuming
the distribution of the di s is (close to) a normal distribution
the differences are independent
- 74 -
STAT 2010, Business Stat
2006
Jaimie Kwon
13.4 Inference about the ratio of two variances
Skip
13.5 Inference about the difference between two population
proportions
Use ˆ1
y1
y
and ˆ 2 2
n1
n2
Confidence intervals for 1-2 are given by ˆ1 ˆ 2 z / 2ˆ ˆ ˆ where
1
ˆ ˆ ˆ
1
2
1 (1 1 ) 2 (1 2 )
n1
2
n2
Statistical test for H0: 1-20 etc. is based on
z
ˆ1 ˆ 2
.
1 (1 1 ) 2 (1 2 )
n1
n2
E.g. accident rate of vehicles with ABS and those without ABS
- 75 -
STAT 2010, Business Stat
2006
Jaimie Kwon
14 Statistical inference: review of chapters 12 and 13
Graphical methods and numerical measures for univariate data
Variable type
Categorical
X
Methods
Frequency, relative frequency, p̂ ;
bar-chart, pie-chart
Interval
x , median, s , percentiles;
histogram, boxplot, (stem-and-leaf, ogive)
Graphical methods and numerical measures for bivariate data
Y
Categorical
Interval
Categorical contingency table , pˆ 1 , pˆ 2 ; x1 , x2 , s1 , s2 ;
X
Interval
bar-chart
side-by-side boxplots
?
r,
Cov(X,Y), yˆ b0 b1 x ;
scatter plot
Univariate statistical inference techniques
Variable type
X
Methods
Categorical
One-sample proportion
Interval
One-sample z-test; one-sample t-test
- 76 -
STAT 2010, Business Stat
2006
Jaimie Kwon
Bivariate statistical inference techniques
Y
Categorical
Interval
Categorical Two-sample proportion; Two-sample t-test;
X
Interval
Chi-squared analysis
ANOVA
Logistic regression
Regression
- 77 -
STAT 2010, Business Stat
2006
15 Analysis of variance
Skip
16 Chi-squared tests
Skip
- 78 -
Jaimie Kwon
STAT 2010, Business Stat
2006
Jaimie Kwon
17 Simple linear regression and correlation
Regression analysis is used to predict the value of one variable (Y)
on the basis of other variables ( X 1 , X 2 ,... X k )
E.g. midterm score vs. final score for a class
17.1 Model
Simple linear regression model:
y 0 1 x
where is the “error variable”
17.2 Estimating the coefficients
Least squares regression line is obtained by finding b0, b1 which
minimizes
n
(y
i 1
i
yˆ i ) 2 , where ŷ , the (predicted) value of y, is
determined by the line
yˆ b0 b1 x
“Least squares line coefficients” are given by
b1
sxy
sx2
and b0 y b1x .
See the old notes for formula for s xy , s x , etc.
E.g. Computing the regression line from basic statistics
17.3 Required conditions on the error variable
Conditions:
The probability distribution of is normal
E()=0
- 79 -
STAT 2010, Business Stat
2006
Jaimie Kwon
The standard deviation of is , which is constant regardless of
the value of x.
The value of for different observations are independent
(Or, people says i ~ iid N(0,2))
17.4 Assessing the model
How well does our model fit the data? (There may be no relationship
at all!)
“The sum of squares for error (SSE)” can also be computed as
SSE yi yˆ i
2
i
2 s xy2
(n 1) s y 2
s x
The {smaller, larger} SSE suggests more accurate model
“The standard error (SE) of estimate” is an estimate of , given by
s
SSE
n2
The {smaller, larger} suggests more accurate model
To formally test whether the slope is non-zero, do the following:
H0: 1=0 vs. H1: 1≠0
Test statistic is given by
t
b1 0
where
sb1
sb1
s
(n 1) s x2
If assumptions regarding error variable hold: Under H0, t follows
student t distribution with v=n-2 degrees of freedom
At significance level , reject H0 if |t|>t/2(n-2)
- 80 -
STAT 2010, Business Stat
2006
Jaimie Kwon
100(1-)% C.I. for 1 is given by
b1 t / 2 (n 2) sb1
Coefficient of determination is given by
R2
sxy2
2 2
x y
s s
1
SSE
r2
2
yi y
i
=(explained variation in y)/(total variation in y)
= (Regression SS)/(Total SS)
The {higher, lower} value of R2 means better fit of the linear
model
Need to be able to extract SOME information from Minitab output
Typical disclaimer: correlation doesn’t imply causality
17.5 (Optional) Applications in finance
17.6 Using the regression equation
E.g. Things we really care about: Predicting the final score from the
midterm score
Predicting the particular value of y for a given x:
yˆ t / 2 (n 2) s
2
1 ( xg x )
1
n (n 1) s x2
Estimating the expected value of y for a given x:
yˆ t / 2 (n 2) s
2
1 ( xg x )
n (n 1) s x2
These intervals gets {wider, narrower} as xg moves away from x
18 Multiple regression
Extension of the simple linear regression to multiple X variables
- 81 -
STAT 2010, Business Stat
2006
Jaimie Kwon
E.g. predicting final from midterm, quiz #1, quiz #2 scores
- 82 -
STAT 2010, Business Stat
2006
Jaimie Kwon
19 Logistics and things you/I really care
Glossary
See the syllabus
19.1 Couple of words about quiz #2 and final
19.1.1
Quiz #2
n=25; Sample mean = 23.64; sampled median = 24; sample
SD=4.26
Correlation between quiz#2 and midterm score=0.572
19.1.2
Final
No need for cheat-sheet for part I (you will be given a formula sheet)
Make your own cheat-sheet for part II (covers chapter 8~)
Need Assist form for part I of the final!
- 83 -
STAT 2010, Business Stat
2006
Jaimie Kwon
A reminder that effective Winter Quarter 2006, Assist (“Scantron”)
forms will no longer be provided by the Statistics Department for the
STAT 1000 and STAT 2010 standardized tests.
Students can purchase Assist forms at the CSUEB Bookstore for 50
cents
INSTRUCTIONS FOR COMPLETING ASSIST FORMS:
1. Students should enter as much of their names as possible in the
“Your Last Name” boxes.
2. The two letters of their Net ID are entered in the "First Initial"
and "Middle Initial" boxes on that same line.
3. The four digits of their Net ID should be entered in the first four
boxes of the "Social Security Number" section.
- 84 -
STAT 2010, Business Stat
2006
19.2 Practice midterm (50 minutes)
…
- 85 -
Jaimie Kwon
STAT 2010, Business Stat
2006
Jaimie Kwon
19.3 Practice final (50 minutes)
Is anti-lock brake system (ABS) in cars really effective? If it were
effective,
The number of accidents would decrease, and
The cost of accident repairs would be less
Data were collected on 500 cars with ABS and 500 cars without.
The number of cars involved in accidents was recorded, as was the
cost of repairs.
42 out of 500 cars without ABS had accident and 38 out of 500
cars with ABS had accident in a given year. What can we
conclude?
For the repair cost for the two groups, we obtain:
n1 =42, x1 = 2,075 and s1= 671
n2 =38, x 2 = 1,714 and s2= 624
For the two situations above, perform:
Compute the 95% CI for the parameter of interest
Set up the null and alternative hypotheses
Compute test statistic and perform the test at 5% significance
level
Compute the P-value for the test (if you can)
- 86 -
STAT 2010, Business Stat
2006
Jaimie Kwon
STAT/MATH 6401, Advanced Probability I, Fall 2005 Course Note
Dr. Jaimie Kwon
June 27, 2017
Table of Contents
1
What is statistics? ...........................................................................3
2
Graphical and tabular descriptive statistics .....................................4
2.1
Types of data ...........................................................................4
2.2
Techniques for nominal data ....................................................5
2.3
Graphical techniques for interval data ......................................5
2.4
Describing the relationship b/w two variables ...........................7
2.5
Time series data .......................................................................8
3
Art and science of graphical presentations .....................................8
4
Numerical descriptive techniques ...................................................9
5
4.1
Measures of central location .....................................................9
4.2
Measures of variability ............................................................10
4.3
Measures of relative standing and box plots ...........................11
4.4
Measures of linear relationship ...............................................13
4.5
Comparing graphical and numerical techniques .....................15
4.6
General guidelines for exploring data .....................................15
Data collection and sampling ........................................................16
5.1
Methods of collecting data ......................................................16
- 87 -
STAT 2010, Business Stat
2006
Jaimie Kwon
5.2
Sampling ................................................................................16
5.3
Sampling plans .......................................................................17
5.4
Sampling and nonsampling errors ..........................................18
6
Probability ....................................................................................19
6.1
Assigning probability to events ...............................................19
6.2
Joint, marginal, and conditional probability .............................20
6.3
Probability rules and trees ......................................................23
6.4
Bayes’ Law .............................................................................24
6.5
Identifying the correct method ................................................24
7
Random variables and discrete probability distributions ...............25
7.1
Random variables and probability distributions.......................25
7.2
Bivariate distributions .............................................................28
7.3
Binomial distribution ...............................................................31
7.4
Poisson Distribution ................................................................32
8
Continuous probability distributions ..............................................33
8.1
Probability density functions ...................................................33
8.2
Normal distribution .................................................................34
8.3
Exponential distribution ..........................................................40
8.4
Other continuous distributions ................................................40
9
Sampling distributions ..................................................................44
9.1
Sampling distribution of the mean ..........................................44
9.2
Sampling distribution of a proportion ......................................49
9.3
Sampling distribution of the difference between two means ...51
10
Intro to estimation ......................................................................53
- 88 -
STAT 2010, Business Stat
2006
Jaimie Kwon
10.1
Concept of estimation ..........................................................54
10.2
Estimating the population mean when the population SD is
known 55
10.3
11
Selecting the sample size ....................................................57
Introduction to hypothesis testing ..............................................58
11.1
Concepts of hypothesis testing ............................................58
11.2
Testing the population mean when the population standard
deviation is known ...........................................................................61
11.3
What about type II error probability? ....................................65
11.4
The road ahead ...................................................................66
12
Inference about a population .....................................................67
12.1
Inference about a population mean when the sd is unknown
67
12.2
Inference about a population variance .................................68
12.3
Inference about a population proportion ..............................68
13
Inference about comparing two population ................................71
13.1
Inference about the difference between two means:
independent samples ......................................................................71
13.2
Observational and experimental (controlled study) data ......73
13.3
Inference about the difference between two means: matched
pairs experiment ..............................................................................73
13.4
Inference about the ratio of two variances ...........................75
13.5
Inference about the difference between two population
proportions ......................................................................................75
- 89 -
STAT 2010, Business Stat
2006
Jaimie Kwon
14
Statistical inference: review of chapters 12 and 13 ....................76
15
Analysis of variance ..................................................................78
16
Chi-squared tests ......................................................................78
17
Simple linear regression and correlation ....................................79
17.1
Model ..................................................................................79
17.2
Estimating the coefficients ...................................................79
17.3
Required conditions on the error variable ............................79
17.4
Assessing the model ...........................................................80
17.5
(Optional) Applications in finance ........................................81
17.6
Using the regression equation .............................................81
18
Multiple regression ....................................................................81
19
Logistics and things you/I really care .........................................83
19.1
Couple of words about quiz #2 and final ..............................83
19.1.1
Quiz #2..........................................................................83
19.1.2
Final ..............................................................................83
19.2
Practice midterm (50 minutes) .............................................85
19.3
Practice final (50 minutes) ...................................................86
- 90 -