Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MATH 20812: PRACTICAL STATISTICS I
SEMESTER 2
INTRODUCTION TO STATISTICS
1.1
A Definition of Statistics
Statistics is the analysis of numerical data for the purpose of reaching a decision or
communicating information in the face of uncertainty (variability).
1.2
Role of Statistics
1. Description: summary calculations and graphical displays (e.g. minimum,
maximum, average, histograms) necessary for statistical evaluations.
2. Inference/Induction: generalisation of the whole by a thorough examination of
the part (e.g. political polls).
3. Deduction: ascribe properties of specific cases from general situation (e.g. we can
establish 0.04 as the probability that a randomly chosen student will be a
mechanical engineer if we know that 4% of all students on campus are taking that
major).
1.3
Elements of Statistical Data
1. Observation: the basic statistical element or single data point (e.g. starting salary
of a graduating engineer, status of a machine as “defective”/“nondefective”).
2. Population: collection of all possible observations of a specified characteristic of
interest (e.g. starting salaries of all graduating engineers).
3. Sample: collection of observations representing only a portion of the population
(e.g. starting salaries of all graduating engineers from Plymouth).
1.4
Data Types
1.
2.
quantitative data: numerical values
qualitative data: attributes (e.g. gender, occupation)
1
1.5
Data Displays
1.5.1 Histogram (Frequency Distribution)
 Place the observations into a series of contiguous blocks called class intervals.
 Determine the number of observations falling into each interval. The respective
count provides the class frequency.
 Draw a bar chart having frequency as the vertical axis and the observed values as
the horizontal axis.
Cost per pound of raw materials used in processing batches of a chemical feedstock
12.01 14.9
11.24 14.4
12.87 11.3
16.98 15.23 12.38 12.9
12.29 11.51 12.06 15.06 11.16 10.12 14.82 17.02 12.56 13.95
11.13 12.14 13.41 15.67 12.17 12.15 12.57 14.21 13.21 16.5
13.81 15.58 11.1
12.03 11.2
12.27 17.05 12.57 13.11 13.31
13.84 11.62 12.33 16.31 12.74 14.25 18.63 13.34 13.43 14.78
Class intervals
(Cost per pound)
10.0-under 11.0
11.0-under 12.0
12.0-under 13.0
13.0-under 14.0
14.0-under 15.0
15.0-under 16.0
16.0-under 17.0
17.0-under 18.0
18.0-under 19.0
Number of raw materials
Class frequency
Cumulative
frequency
1
1
8
1+ 8 = 9
16
9 + 16 = 25
9
25 + 9 = 34
6
34 + 6 = 40
4
40 + 4 = 44
3
44 + 3 = 47
2
47 + 2 = 49
1
49 + 1 = 50
Total 50
Relative
frequency
1/50
8/50
16/50
9/50
6/50
4/50
3/50
2/50
1/50
Total 1
10
5
0
Number of Raw Materials
15
Histogram
10
11
12
13
14
15
Cost (£)
2
16
17
18
19
1.5.2 Frequency Polygon
 Plot a dot (‘’) above the midpoint of each interval at a height matching the class
frequency.
 Connect the dots with outside line segments touching the horizontal axis one-half
of an interval width below and above the lowest and highest intervals.
Frequency Polygon
10
•
•
5
•
•
•
•
•
0
Number of Raw Materials
15
•
•
•
•
10
11
12
13
14
15
Cost (£)
3
16
17
18
19
1.5.3 Ogive (Cumulative Frequency Distribution)
 Plot a dot (‘’) above the upper class limit at a height equal to the cumulative
frequency for that interval.
 Connect the dots by line segments, with the lowest line touching the horizontal
axis at the lower limit of the smallest class.
50
•
•
•
18
19
40
•
•
30
•
10
20
•
0
Cumulative Sum of Number of Raw Materials
Cumulative Frequency Distribution
•
•
•
10
11
12
13
14
15
Cost (£)
4
16
17
1.5.4 Relative Frequency Distribution: same as the frequency polygon with the
ordinates divided by the total number of observations.
Relative Frequency Distribution
0.20
•
•
0.10
•
•
•
•
0.0
Number of Raw Materials/50
0.30
•
•
•
•
•
10
11
12
13
14
15
Cost (£)
5
16
17
18
19
1.5.5 Stem-and-Leaf Plot: arrange the data tabularly by separating the value of each
observation into a stem digit and a leaf digit.
2.9
3.0
1.8
4.2
1.2
4.5
4.0
1.6
3.6
3.7
3.2
1.5
5.4
3.6
2.3
3.6
3.6
2.7
Precipitation levels during the month of April
3.2
4.0
3.9
2.1
2.9
2.9
1.0
2.2
5.4
3.5
3.6
4.0
4.0
4.0
0.3
2.2
3.3
3.8
4.8
3.3
2.7
1.8
4.4
2.6
3.9
0.8
3.1
3.1
3.7
0.3
1.5
3.4
3.4
3.3
1.2
5.9
5.0
3.4
2.6
3.3
5.8
0.6
0.7
2.9
3.1
2.9
2.0
3.2
3.4
2.9
0.5
2.4
LEAF
(Number to the right
of decimal point)
STEM
(Number to the left
of decimal point)
0 : 334
0 : 56778
1 : 01222
1 : 556888
2 : 001222344
2 : 66677899999999
3 : 011122233334444
3 : 56666667778899
4 : 00000124
4 : 58
5 : 044
5 : 89
6
1.1
0.7
2.6
2.9
3.7
3.6
2.9
1.2
0.4
2.8
2.2
2.0
4.1
3.8
2.4
1.8
1.5.6 Bar Chart: when each category or attribute occurs with some frequency, the
summary in a bar chart.
Number of professional women employed (in thousands) in 1986
Category
Frequency
Engineering/Computer Science
347
Health care
1937
Education
2833
Social/Legal
698
Arts/Athletics/Entertainment
901
All others
355
2000
1000
500
0
Number of Professional Women Employed
Bar Chart
Eng
Health
Edu
Social
7
Arts
Other
1.5.7 Pie Chart: size the piece of pie according to the category’s relative frequency,
with the angle of the slice corresponding to its proportion of 360 degrees.
Number of professional women employed (in thousands) in 1986
Category
Frequency
Angle (degrees)
Engineering/Computer Science
347
(347/7071)*360 = 18
Health care
1937
(1937/7071)*360 = 99
Education
2833
(2833/7071)*360 = 144
Social/Legal
698
(698/7071)*360 = 36
Arts/Athletics/Entertainment
901
(901/7071)*360 = 46
All others
355
(355/7071)*360 = 18
Hea
lth
Pie Chart
Edu
Social
8
Ar
Eng
Other
ts
1.5.8 Scatter Diagram: Plot of two quantitative variables (one versus the other).
Year
Average Distance
(1,000 miles)
9,450
9,390
9,980
9,630
9,760
10,050
9,480
9,140
9,000
9,530
9,650
9,790
9,830
1960
1965
1970
1975
1976
1978
1979
1980
1981
1982
1983
1984
1985
Average Fuel Consumption
(gallons)
661
667
735
712
711
715
664
603
579
587
578
553
549
Scatter Diagram
700
•
600
650
•
•
•
• •
•
•
•
•
550
Average Fuel Consumption (Gallons)
•
• •
9000
9200
9400
9600
Average Distance (Miles)
9
9800
10000
1.5.9 Box Plot: Shows variability of the data (will revisit later).
10
4
Number of busy teleports in a computer network
15
17
6
12
9
13
Number of Busy Teleports
Box Plot
17
16
15
14
13
12
11
10
9
8
7
6
5
4
Maximum
3/4 way
1/2 way
1/4 way
Minimum
10
15
5
1.6
Summary Data Measures
1. Parameter: summary of data when data constitute a population (e.g. average
salary of all graduating engineers).
2. Statistic: summary of sample data (e.g. average salary of graduating engineers
from Plymouth).
1.7
Summary Measures of Location
1.7.1 The Mean: The mean is the most commonly used measure of central
tendency, indicating the central point around which observations tend to cluster. For
individual data such as:
755
613
584
693
622,
mean = (sum of observations)/(number of observations)
= (755 + 613 + 584 + 693 + 622)/5
= 653.4.
For grouped data such as:
Class interval Class interval Frequency
(Age)
midpoint
20-under 25
22.5
5
25-under 30
27.5
13
30-under 35
32.5
12
35-under 40
37.5
8
40-under 80
60
12,
mean = (sum of (frequency*midpoint))/(number of observations)
= (5*22.5 + 13*27.5 + 12*32.5 + 8*37.5 + 12*60)/50
= 37.6.
1.7.2 The Median: Another measure of central tendency, especially useful when
data has a skewed frequency distribution. For individual data such as:
755
613
584
693
622,
the median is located as follows:
 order the data by increasing magnitude to get:
584
613
622
693
755;
 the median is the value above and below which an equal number of observations
lie (basically the value at middle)  median = 622;
 when there is an even number of observations the median is the average of the
middle two.
11
For grouped data such as:
Class interval
Frequency Cumulative
(Age)
Frequency
20-under 25
5
5
25-under 30
13
18
30-under 35
12
30
35-under 40
8
38
40-under 80
12
50
use the following scheme:
 calculate (number of observations + 1)/2 = (50 + 1)/2 = 25.5;
 find the greatest cumulative class frequency less than or equal to 25.5 ( 18) and
note down the upper limit of the corresponding interval ( 30);
 find the cumulative frequency and the upper class limit of the next higher interval
( 30 and 35 respectively);
 then, median = 30 + (25.5 - 18)*(35 - 30)/(30 - 18) = 33.1.
1.7.3 The Mode: most frequently occurring value. For individual data such as:
0
0
3
7
1
0
2,
mode = 0.
For grouped data it is the midpoint of the class interval with the highest frequency (for
the age data, mode = 27.5).
1.7.4 Percentile: is a point below which a stated percentage of the observations lie,
e.g. the median is the 50th percentile. To find the 100pth percentile, where p is any
value between 0 and 1, use the following scheme:
 let n represent the total number of observations and let k be the largest integer less
than or equal to (n + 1)*p;
 arrange the data values by increasing magnitude;
 find the kth and the (k + 1)st largest values. Denote them by a and b respectively;
 then, 100pth percentile = a + ((n + 1)*p - k)*(b - a).
For grouped data a similar scheme applies:
 let k be the greatest cumulative frequency less than or equal to (n + 1)*p and let a
be the upper limit of the corresponding interval;
 let h and b denote the cumulative frequency and upper limit of the next higher
interval;
 then, 100pth percentile = a + ((n + 1)*p - k)*((b - a)/(h - k)).
12
1.8
Summary Measures of Variability
1.8.1 The Range: the largest observation minus the lowest observation. For
individal data such as: 755 613
584
693
622, the range = 755 - 584 = 171.
For grouped data it is the difference between the lowest and highest class limits (for
the age data, the range = 80 - 20 = 60). In a box plot the range is the overall length of
the plot.
1.8.2 Interquartile Range: difference between the 75th percentile and the 25th
percentile - representing the scatter in the middle 50% of the observations. In a box
plot it is the length of the box.
1.8.3 Variance: most useful measure of variability, based on deviations of
individual observations about the central value of mean. For individual data such as:
755
613
584
693
622,
we calculate it as follows:
 compute the mean, we know it to be 653.4 from above;
 take the deviation of each observation from the mean:
755 - 653.4 = 101.5
613 - 653.4 = -40.4
584 - 653.4 = -69.4
693 - 653.44 = 39.6
622 - 653.4 = -31.4;
 then,
variance= (sum of the (deviation) 2 )/(number of observations))
= ((101.6) 2 + (-40.4) 2 + (-69.4) 2 + (39.6) 2 + (-31.4) 2 )/5
= 3,865.04;
 if the observations are from a sample, then
variance = (sum of the (deviation) 2 )/(number of observations - 1).
For grouped data such as:
Class interval Class interval Frequency
(Age)
midpoint
20-under 25
22.5
5
25-under 30
27.5
13
30-under 35
32.5
12
35-under 40
37.5
8
40-under 80
60
12,
use the following scheme:
 compute the mean, we know it to be 37.6;
 take the deviation of each class interval midpoint from the mean:
22.5 - 37.6 = -15.1
27.5 -37.6 = -10.1
32.5 - 37.6 = -5.1
37.5 - 37.6 = -0.1
60 - 37.6 = 56.4;
13


then,
variance= (sum of (class frequency*(deviation) 2 ))/(number of observations)
= (5*(-15.1) 2 +13*(-10.1) 2 +12*(-5.1) 2 +8*(-0.1) 2 +12*(56.4) 2 )/50
= 818.998;
Replace (number of observations) by (number of observations - 1) if observations
are from a sample.
1.8.4 Standard Deviation: positive square root of the variance, more convenient
summary than variance as it is in the same units as the observations themselves. It is
useful in describing frequency distributions of many populations (especially when the
normal curve fits the frequency distribution): at least 68% of all population values lies
within mean  standard deviation; at least 95% within mean  2*standard deviation;
and at least 99% within mean  3*standard deviation.
1.9
Composite Measures: Coefficient of variation = (standard deviation/mean)
measures the variability relative to mean. Coefficient of skewness = 3*(mean median)/(standard deviation) expresses the direction of and the degree to which a
frequency distribution is skewed.
1.10 Formulae for Sample Mean and Variance: mathematical representations of
the forms introduced in sections 1.7.1 and 1.8.3. First, some notation:
Xi
ith observation when the data are individual and the midpoint of the ith class
interval when the data are grouped;
fi
frequency of the ith class interval when the data are grouped;
n
number of observations;
m
number of class intervals when the data are grouped;
sample mean;
X
2
sample variance.
s
We have the following:
n
X 
X
i
i 1
for individual data;
n
m
X 
f
i
i 1
n
Xi
m
(where n   f i )
n
s2 
 ( Xi  X )2
i 1
n 1
m
s2 

for grouped data;
i 1
n

X
i 1
n 1
for individual data;
n 1
m
fi ( Xi  X )2
i 1
 nX 2
2
i

f
i
X i2  nX 2
i 1
for grouped data.
n 1
14
15
Related documents