Download Mean - BCI-Calculus45

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Advanced Math 45
Statistics #2: Ungrouped Data
3. Calculations Using Ungrouped or Raw Data
1.
2.
3.
4.
5.
6.
Measures of Central Tendency
Measures of Central Tendency from a Frequency Table
Which is the best measure of Central Tendency?
Measures of Spread
Measures of Spread from a Frequency Table
Percentiles
1. Measures of Central Tendency
Measures of central tendency tell us about the “centre” of a set of data. The three
that are used most frequently are the mean, median and mode.
The Mode – this is the piece of data that occurs most frequently.
Data: 3, 2, 6, 3, 7, 12, 3, 14, 2
Mode: 3
Some sets of data can have more than one mode or no mode at all.
Data: 2, 3, 2, 3, 5, 7, 2, 5, 3,
Data: 5, 5, 5, 3, 3, 3, 7, 7, 7
Modes: 2 and 3
No mode.
The Median – the middle piece of the ordered data.
If there is an odd number of pieces of data then the middle piece of data is the
(n+1)/2th piece of data where n is the number of pieces of data
Data:
Ordered Data:
3, 2, 2, 6, 3, 7, 12, 3, 14
2, 2, 3, 3, 3, 6, 7, 12, 14
In this case we have 9 pieces of data so (9+1)/2 means the 5th piece of data is the
middle piece of data.
So Median: 3
If there is an even number of pieces of data then you take the average of the middle
two pieces of data, i.e. add the n/2th and n/2+1th pieces of data and divide by 2.
Data: Data:
Ordered Data :
3, 2, 6, 3, 7, 12, 3, 14
2, 3, 3, 3, 6, 7, 12, 14
In this case we have 8 pieces of data so take the 4th (8/2) and 5th (8/2+1) pieces of
data and divide their sum by 2.
Page 1 of 12
Advanced Math 45
Median =
(3+6)
2
Statistics #2: Ungrouped Data
= 4.5
̅ ) – divide total of data by the number of pieces of data
The Mean ( 𝒙
Data:
3, 2, 6, 3, 7, 12, 3, 14
Mean
𝑥̅ =
3 + 2 + 6 + 3 + 7 + 12 + 3 + 14 50
=
= 6.25
8
8
Summation (sigma) notation
∑ 𝑥 means the sum of all the x’s
so in our example above ∑ 𝑥 = 50
and
𝑥̅ =
∑ 𝑥 50
=
= 6.25
𝑛
8
Note that different symbols are used for the mean depending on if the mean is of a
sample or the population.
Mean of a sample: 𝑥̅
Mean of the population : μ
Assignment #2 Qu 1-3
2. Measures of Central Tendency From a Frequency Table
With larger quantities of data it is often useful to put data into a frequency table.
Example: Find the mean median and mode from the following frequency table.
x
0
1
2
3
4
5
Mode:
f
5
9
4
2
3
1
mode = 1 (x value with the highest frequency)
Median:
Page 2 of 12
Advanced Math 45
Statistics #2: Ungrouped Data
x
0
1
2
3
4
5
f
5
9
4
2
3
1
Cumulative Freq
5
5+9=14
14+4=18
18+2=20
20+3=23
23+1=24
∑ 𝑓 = 24
By totalling the frequency column we can see that there are 24 pieces of data so we
need to find the 12th and 13th pieces of data both of which have the value 1 in this
case.
1+1
Median = 2 = 1
Mean:
The easiest way to find the total of ALL the data is to add a frequency × data value
column (fx)
x
0
1
2
3
4
5
f
5
9
4
2
3
1
fx
0x5=0
1x9=9
2x4=8
3x2=6
4x3=12
5x1=5
∑ 𝑓 = 24
∑ 𝑓𝑥 = 40
Mean:
𝑥̅ =
Assignment #2 Qu 4
Page 3 of 12
∑ 𝑓𝑥 40
=
= 1.67
∑𝑓
24
Advanced Math 45
Statistics #2: Ungrouped Data
3. Which is the best measure of central tendency?
Or, put another way,
With three averages to choose from mean, median and mode – which should we
use?
The answer is – it depends on the situation! Each kind of measure is appropriate for
certain types of data.
The following table shows the advantages and disadvantages of these different
averages.
Average
Advantages
Disadvantages
Mean
All the data is used to find the answer
Very large or very small numbers can distort the
answer
Median
Very big and very small values don't affect it
Takes a long time to calculate for a very large set of
data
Mode or modal
class
The only average we can use when the data is
not numerical
1.
2.
3.
There may be more than one mode
There may be no mode at all if none of the
data is the same
It may not accurately represent the data
Example
This table shows the annual salary of people who work at a garden centre.
Annual salary (£)
Number of people
0 - 9,999
10
10,000 - 19,999
9
20,000 - 29,999
9
30,000 - 39,999
1
40,000 - 49,999
1
The modal class is £0 - £9,999.
Page 4 of 12
Advanced Math 45
Statistics #2: Ungrouped Data
Question
What is the disadvantage of using the modal class?
Answer
Even though the range £0 - £9,999 contains the most number of people, the next two
ranges have comparable numbers and so is not representative of the data.
Question
What is the disadvantage of using the mean?
Almost everyone earns under £30,000. The mean would be distorted by the fact that two
people earn much more than this.
Assignment #2 Qu 5 & 6
4. Measure of Spread
Range: The range of a set of data is the difference between the largest value and the
smallest value.
Given the data set:
12, 5, 17, 9, 16, 3
Range = 17 – 3 = 14.
Because the range if a set of data is the difference between the extreme values of a
set of data it can be unduly influenced by outliers.
A better way of looking at the spread of data is to use the standard deviation.
Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.
Its symbol is σ (the greek letter sigma)
The formula is easy: it is the square root of the Variance. So now you ask, "What is
the Variance?"
Variance
The Variance is defined as:
Page 5 of 12
Advanced Math 45
Statistics #2: Ungrouped Data
The average of the squared differences from the Mean.
To calculate the variance follow these steps:



Work out the mean (the simple average of the numbers)
Then for each number: subtract the Mean and square the result (the squared
difference).
Then work out the average of those squared differences. (Why Square?)
Example
You and your friends have just measured the heights of your dogs (in millimeters):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Your first step is to find the Mean:
Answer:
Mean =
600 + 470 + 170 + 430 + 300 1970
=
= 394
5
5
so the mean (average) height is 394 mm. Let's plot this on the chart:
Page 6 of 12
Advanced Math 45
Statistics #2: Ungrouped Data
Now, we calculate each dogs difference from the Mean:
To calculate the Variance, take each difference, square it, and then average the
result:
So, the Variance is 21,704.
And the Standard Deviation is just the square root of Variance, so:
Standard Deviation: σ = √21,704 = 147.32... = 147 (to the nearest mm)
And the good thing about the Standard Deviation is that it is useful. Now we can
show which heights are within one Standard Deviation (147mm) of the Mean:
So, using the Standard Deviation we have a "standard" way of knowing what is
normal, and what is extra large or extra small.
Rottweilers are tall dogs. And Dachshunds are a bit short ... but don't tell them!
But ... there is a small change with Sample Data
Our example was for a Population (the 5 dogs were the only dogs we were
interested in).
Page 7 of 12
Advanced Math 45
Statistics #2: Ungrouped Data
But if the data is a Sample (a selection taken from a bigger Population), then the
calculation changes!
When you have "n" data values that are:
 The Population: divide by n when calculating Variance (like we did)
 A Sample: divide by n -1 when calculating Variance
All other calculations stay the same, including how we calculated the mean.
Example: if our 5 dogs were just a sample of a bigger population of dogs, we would
divide by 4 instead of 5 like this:
Sample Variance = 108,520 / 4 = 27,130
Sample Standard Deviation = √27,130 = 164 (to the nearest mm)
Think of it as a "correction" when your data is only a sample.
Formulas:
The "Population Standard Deviation":
The "Sample Standard Deviation":
Looks complicated, but the important change is to divide by N-1 (instead of N) when
calculating a Sample Variance.
A way to make calculating the standard deviation easier is to create a table.
x
600
470
170
430
300
(𝑥 − 𝑥̅ )
600-394=206
470-394=76
170-394=-224
430-394=36
300-394=-94
∑(𝑥 − 𝑥̅ )2 = 108520
∑ 𝑥 = 1970
Mean: 𝑥̅ =
∑𝑥
𝑛
=
1970
5
= 394
∑(𝑥−𝑥̅ )2
Standard Deviation: 𝜎 = √
Page 8 of 12
(𝑥 − 𝑥̅ )2
2062=42436
762=5776
(-224)2=50176
362=1296
(-94)2=8836
𝑛
108520
=√
5
= 147.32
Advanced Math 45
Statistics #2: Ungrouped Data
Why do we square the differences? Check out what happens if you add up the
(𝑥 − 𝑥̅ ) column. Not very useful!
There is a second version of the variance or square root formula.
∑ 𝑥2
i.e. 𝜎 = √
𝑛
− 𝑥̅ 2
Lets see how we get this:
∑(𝑥 − 𝑥̅ )2
∑(𝑥 2 − 2𝑥̅ 𝑥 + 𝑥̅ 2 )
𝜎=√
=√
𝑛
𝑛
∑ 𝑥 2 − ∑ 2𝑥̅ 𝑥 + ∑ 𝑥̅ 2
=√
𝑛
=√
∑ 𝑥 2 − 2𝑥̅ ∑ 𝑥 + 𝑥̅ 2 ∑ 1
𝑛
∑ 𝑥2
𝑛𝑥̅ 2
=√
− 2𝑥̅ 𝑥̅ +
𝑛
𝑛
∑ 𝑥2
√
=
− 2𝑥̅ 2 + 𝑥̅ 2
𝑛
∑ 𝑥2
√
=
− 𝑥̅ 2
𝑛
This is often the easier formula to work with.
x
600
470
170
430
300
𝑥2
360000
220900
28900
184900
90000
∑ 𝑥 = 1970
∑ 𝑥 2 = 884700
∑ 𝑥2
𝜎=√
− 𝑥̅ 2
𝑛
884700
=√
− 3942
5
= 147.322… as before.
Page 9 of 12
Advanced Math 45
Statistics #2: Ungrouped Data
For a sample use n-1 instead of n.
Many calculators will work out the values of the summations, given the input data.
Assignment #3 Qu 1-6
5. Measures of Spread from a Frequency Table.
Finding the range from a frequency table doesn’t differ from before.
i.e. Range = largest value – smallest value
x
0
1
2
3
4
5
f
5
9
4
2
3
1
From Table we can see that the range = 5 – 0 = 5
To calculate the standard deviation we have to modify the formula slightly.
∑ 𝑓(𝑥−𝑥̅ )2
𝜎=√
𝑛
∑ 𝑓𝑥2
or 𝜎 = √
𝑛
− 𝑥̅ 2
Let’s see how this works, using the example from above, mean = 12/3
x
0
1
2
3
4
5
f
5
9
4
2
3
1
(𝒙 − 𝒙
̅)
2
-1 /3
-2/3
1/3
11/3
21/3
31/3
(𝒙 − 𝒙
̅)𝟐
7
2 /9
4/9
1/9
17/9
54/9
111/9
̅)𝟐
𝒇(𝒙 − 𝒙
8
13 /9
4
4/9
35/9
161/3
111/9
∑ 𝑓(𝑥 − 𝑥̅ )2 = 49 5⁄9
𝜎=√
Page 10 of 12
∑ 𝑓(𝑥 − 𝑥̅ )2 √49 5⁄9
=
= 1.43 …
𝑛
24
Advanced Math 45
Statistics #2: Ungrouped Data
or, using the alternate formula
x
0
1
2
3
4
5
f
5
9
4
2
3
1
fx
0x5=0
1x9=9
2x4=8
3x2=6
4x3=12
5x1=5
x2
0
1
4
9
16
25
∑ 𝑓 = 24
∑ 𝑓𝑥 = 40
fx2
0
9
16
18
48
25
∑ 𝑓𝑥 2 = 116
∑ 𝑓𝑥2
𝜎=√
− 𝑥̅ 2
𝑛
116
2 2
=√
− (1 )
24
3
= 1.43…
Assignment #3 Qu 7
6. Percentiles
A Percentile a percentile is 1/100 of a given set of elements arranged in order of
magnitude, hence the median is the 50th percentile. They are used in health care to
compare individual scores with what is considered to be “normal” eg weight of
babies, in education to determine grades, the top 5% get an A, the next 10% a B etc.
or in industry to find the best candidates for a job.
They can be calculated using the formula:
𝑃𝑟 =
1
(𝐵 + 2 𝐸)
𝑛
× 100%
where B = number below score
E = Number with the same score
n = total number of scores
Example: Students is a test scored as follows.
54
76
Page 11 of 12
65
76
71
77
71
80
71
80
71
80
74
92
75
98
Advanced Math 45
Statistics #2: Ungrouped Data
Jane scored 71. What was her percentile rank?
𝑃𝑟 =
1
(2 + 2 (4))
16
× 100% = 25%
i.e. 25% score less or the same as Jane.
Janes friend John scored 77 on this test. He will get an interview for a job if he is the
top 30% . Will he get an interview?
𝑃𝑟 =
1
(10 + 2 (1))
16
× 100% = 65.625%
Percentile are always rounded up to the next whole number so Johns percentile
rank would be 66. i.e. 66% of the class scored less than or equal than him, so 34%
were better than him, so no he won’t get the interview!
Assignment #4
Page 12 of 12