Download Why such a big deal about a sample of 30 or more?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Inductive probability wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Probability amplitude wikipedia , lookup

Law of large numbers wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Peter Michael Saya II
Dr. Chauhan
Population vs a random sample
. Population
mean value (μ)
sample
X
mean value
A sample mean is expected to be close to Population mean.
If one of the values is unknown, use another one to predict the unknown
Probability: use a known population mean value to predict a sample mean
Since a sample mean varies from sample to sample we need to
study it’s distribution:
Statement of a Useful Theorem
 Consider a population of data with a mean value (μ) of
100 and a standard deviation (σ) of 10.
 If many random samples, each of size n=9, are drawn,
and from each sample a mean, (xbar) is calculated, we
can conclude that
1. The average value of all such xbars=
2. The standatr Dafa
3. The graph of x bar:NORMAL UNDER TWO
CONDITIONS:
4. The result is useful in the sense that if we know how
x bar behaves, we could use this in probability as well
as in stat
The distribution of sample mean
• Many samples each of size n: Three conclusions:
•
1) The mean value of all such sample means=µ
•
2) The standard deviation of all such sample means=
•
•
3) The graph of all such sample means is NORMAL if

•
•
•
•
•
•
n
The population is normal or
the sample size,n, is large
How large is too large depends on the shape of the
the population data but we are just told to use a
rule of 30 or above.
• This result is many applications, and I have explored on two of them
and will talk about one.
• Basically, if a value is normally distributed we can calculate the
probability of its being say bigger than a specified value by using a
normal table.
My research involves …..
 In my project, three different types of non normal
populations were generated
 Uniform
 Exponential
 T
 For each population, I investigated if we really need a
sample of 30 to assume that the graph of
X
is normal
Population Distributions used:
 In my project, three different types of populations
were used
 Each population was composed of random data
generated by Minitab.
 The three population types were:
 Uniform
 Exponential
t
As stated earlier…
 There are a couple requirements in order to use
specific formulas and get valid results:
 The sample size must be large
and
 The population must be normally distributed

The uniform, exponential, and t distributions are NOT
normally distributed so therefore in order to use specific
formulas, the sample size must be at least 30!
OR DOES IT?
Probability Theory
 Minitab generated 160,000 values for each of the 3
types of distributions
 In each population, samples of different sizes were
created
 The mean of each sample (xbar) was then calculated
by Minitab
 Selecting a random xbar value and using the above
equation, a z-score was calculated and the associated
probability of that z-score was determined using the
normal table
Probability Theory
 That theoretical probability was then compared to the
actual probability
 Actual probability was found by numerically sorting
the xbars in Minitab and counting the values above a
predetermined number
 That frequency was divided by 4,000 (the no. of
samples drawn)
 This is the actual probability
Probability Theory Results:
Uniform Distribution
σ=2.88133
μ=5.00
n
actual P( X>5.5)
calc P(x>5.5)
% difference
5
0.34800
0.3520
1.15%
10
0.29225
0.2946
0.80%
15
0.24950
0.2514
0.76%
20
0.22300
0.2206
-1.08%
25
0.19575
0.1949
-0.43%
30
0.17175
0.1736
1.08%
35
0.15525
0.1539
-0.87%
40
0.13175
0.1379
4.67%
Sample size does not matter for uniform distributions!
Probability Theory:
Exponential Distribution
 Minitab can create different types of exponential
distributions
 The shape of the exponential distribution depends on
a value that Minitab calls “scale”
 The scale is the mean of the entire distribution
 I used scales of 5, 10, and 15 to see if changing the scale
affects whether or not more or less than 30 is needed
as a sample size for valid results
Probability Theory Results:
Exponential Distribution
scale= 5
σ=5.01505
μ=5.00
n
actual P(x>5.5)
calc P(x>5.5)
% difference
5
0.35175
0.41290
17.38%
10
0.34075
0.37830
11.02%
15
0.32625
0.35200
7.89%
20
0.30900
0.33000
6.80%
25
0.30000
0.30850
2.83%
30
0.27850
0.29460
5.78%
35
0.27250
0.27760
1.87%
40
0.26500
0.26430
-0.26%
Sample size affects the probability. Sample size of 25 gave
permissible results (permissible results are % difference < ~5%)
Probability Theory Results:
Exponential Distribution
scale= 10
σ=9.741
μ=10.0
n
actual P(x>11)
calc P(x>11)
% difference
5
0.35475
0.40900
15.29%
10
0.33950
0.37070
9.19%
15
0.32450
0.34460
6.19%
20
0.30700
0.32280
5.15%
25
0.29775
0.30150
1.26%
30
0.27950
0.28430
1.72%
35
0.26600
0.27090
1.84%
40
0.25450
0.25460
0.04%
Sample size affected the probability. A size of 25 or higher gave
permissible results.
Probability Theory Results:
Exponential Distribution
scale= 15
σ=14.6591
μ=15.00
n
actual P(x>16)
calc P(x>16)
% difference
5
0.39525
0.44040
11.42%
10
0.38325
0.41680
8.75%
15
0.38175
0.39740
4.10%
20
0.36475
0.38210
4.76%
25
0.34775
0.36690
5.51%
30
0.34550
0.35570
2.95%
35
0.32800
0.34460
5.06%
40
0.32225
0.33360
3.52%
Sample size affected the probability. A sample size of 30 or greater resulted in
permissible results.
Probability Theory:
t-distribution
 Minitab can create different types of t-distributions
 The shape of the t-distribution depends on a value that
Minitab calls degrees of freedom
 The higher the degrees of freedom, the more normal
the distribution looks
 The lower the degrees of freedom, the further the
values deviate from μ, which equals zero for t-graphs
 I used degrees of freedom 3 and 10 to see if changing
the degrees of freedom affects whether or not more or
less than 25 is needed as a sample size for valid results
Probability Theory Results:
t-distribution
D.o.F. = 3
σ=1.60332
μ=0
n
actual P(x>0.1)
calc P(x>0.1)
% difference
5
0.42975
0.44830
4.32%
10
0.42025
0.42470
1.06%
15
0.402
0.40900
1.74%
20
0.39250
0.39360
0.28%
25
0.37850
0.38210
0.95%
30
0.36875
0.37070
0.53%
35
0.36250
0.35940
-0.86%
40
0.34675
0.35200
1.51%
Sample size does not matter for t distributions!
Probability Theory Results:
t-distribution
D.o.F.=10
σ=1.1066
μ=0
n
actual P(x>0.1)
calc P(x>0.1)
% difference
5
0.41225
0.4207
2.05%
10
0.384
0.3859
0.49%
15
0.35
0.3632
3.77%
20
0.33650
0.3446
2.41%
25
0.31800
0.32640
2.64%
30
0.30525
0.31210
2.24%
35
0.28625
0.29810
4.14%
40
0.27475
0.28430
3.48%
Sample size does not matter for t distributions!
Two parts to the research:
Probability theory
1.
How does changing the sample size affect the
probability that the mean of a sample lies above a
certain value?


Calculate a theoretical probability using z-scores and
compare it to the actual probability from the data in
Minitab
Two parts to the research:
Confidence Intervals of the Population Mean (μ)
1.
How does changing the sample size affect the
frequency at which the mean of the population is
included in a 95% interval?




95% confidence interval means that 95% of the calculated
intervals should include the population mean
Using Minitab, a frequency count of the intervals that
included μ could be determined
Then dividing that frequency by the total number of
intervals and multiplying by 100 gave a percentage

This percentage was then compared to 95%