Download Basic Statistics 1.1 Statistics in Engineering (collect, organize

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Basic Statistics
 Statistics in Engineering
(collect, organize, analyze, interpret)
 Collecting Engineering Data
 Data Presentation and Summary
 Types of Data
 Graphical Data Presentation
 Numerical Data Presentation
 Probability Distributions
 Discrete Probability Distribution
 Continuous Probability Distribution
Collecting Engineering Data
Direct observation
Experiments
 better way to produce data
Surveys
 depends on the response rate
Personal Interview:
 higher expected response rate and
 fewer incorrect respondents
 Population: the entire collection of objects or outcomes about which data
are collected.
 Sample: subset of the population containing the observed objects or the
outcomes.
 Parameter: Summary measure about population,   ,  , p .
 Statistics: Summary measure about sample,  x , s, pˆ  .
Population
vs
Sample
Parameter
vs
Statistics
 Statistics can be divided into two.
 1) Descriptive statistics: describe basic features of data by providing simple
summaries about the sample and measures in a form of suitable graphical
or numerical analysis.
Graphical representatives:
 stem-and-leaf plot
 line chart
 histogram
 boxplot.
Numerical analyses:
 measure of central tendency
 measure of dispersion
 measure of position.
 2) Inferential statistics: draw a conclusion about sample data that would
represent an actual population.
Types of Data
Qualitative
vs
Quantitative
Qualitative/ Categorical Data
Quantitative/ Numeric Data
i. Deals with descriptions.
ii. Data can be observed but not
measured.
i. Deals with numbers.
ii. Data which can be measured.
i.
ii.
iii.
iv.
v.
i.
ii.
iii.
iv.
v.
Defect or no defect
Gender
Ethnic group
Colors
Textures
Income
CGPA
Diameter
Weight
cost
The most popular charts for
qualitative data :
The most popular charts for qualitative
data :
 bar chart/column chart
 pie chart
 line chart.





histogram
frequency polygon
ogive
box plot
stem and leaf plot
Discrete vs Continuous
 Quantitative variables can be further classified as discrete or continuous.
 Discrete variables are usually obtained by counting. There are a finite or
countable number of choices available with discrete data. You can't have
2.63 people in the room.
 Continuous variables are usually obtained by measuring. Length, weight, and
time are all examples of continous variables.
Grouped Vs Ungrouped Data
 Ungrouped/raw data - Data that has not been organized into groups.
 Grouped data - Data that has been organized into groups (into a
frequency distribution).
 Frequency distribution: A grouping of data into mutually exclusive
classes showing the number of observations in each class.
Ungrouped data
1.0, 1.1, 1.2, 1.0, 1.1, 1.3,
1.2, 1.1, 1.0, 1.2, 1.3, 1.4,
1.2, 1.2, 1.1, 1.0, 1.0, 1.2,
1.3, 1.4, 1.0
Group data
Class boundaries
Frequency
0.95 – 1.15
10
1.15 – 1.35
9
1.35 – 1.55
2
Example:
 About 50 UniMAP students were asked about their background and the results are
as follows. Display your data in suitable form.
Respondent Gender
Code used:
Gender:
1 = male, 2 = female
Ethnic group:
1 = Malay, 2 = Chinese,
3 = Indian, 4 = others
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
1
1
1
1
1
1
1
1
1
2
2
2
1
2
2
2
2
1
2
2
2
1
2
1
Ethnic Family
Ethnic Family
CGPA Respondent Gender
CGPA
Group Income
Group Income
1
1
1
4
3
1
2
1
2
1
1
1
3
1
1
2
1
1
2
1
1
1
2
1
3
1000
1600
8000
1360
800
1250
1200
3000
4500
3000
2380
800
2000
2000
1000
3500
1600
1803
3000
1400
3000
4000
4780
4300
2500
3.00
3.37
3.59
2.50
3.19
2.96
3.65
3.04
2.80
3.39
3.16
3.67
3.40
3.10
3.31
3.80
3.16
2.84
3.35
3.20
3.00
2.80
2.78
2.90
3.02
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
2
2
2
2
2
2
2
1
1
1
2
1
2
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
2
1
1
1
1
1
4
1
1
1
3
1
2
1
2
1
1
3
1900
1000
1500
2000
8000
2000
2000
1000
1570
7000
1000
3000
1000
1980
1200
2670
2000
3596
2000
5000
1500
2500
2500
2500
1500
2.82
3.02
3.47
3.60
3.41
3.23
3.25
3.39
3.20
3.01
2.98
3.45
3.13
3.30
2.60
2.89
2.90
3.70
3.11
3.34
3.82
3.61
3.25
3.85
3.67
Frequency table
graphical presentation of
qualitative data
Observation Frequency
Malay
33
Chinese
9
Indian
6
Others
2
Bar Chart: used to display the frequency distribution in graphical form.
 Pie Chart: used to display the frequency distribution. It displays the ratio of
the observations
4%
12%
Malay
18%
Chinese
Indian
66%
Others
 Line chart: used to display the trend of observations. It is a very popular
display for the data which represent time.
Jan
10
Feb
7
Mar
5
Apr
10
May Jun
39
7
Jul
260
Aug
316
Sep
142
Oct
11
Nov Dec
4
9
graphical presentation of
quantitative data
 Histogram: Looks like the bar chart except that the horizontal axis represent the
data which is quantitative in nature. There is no gap between the bars.
Frequency Polygon: looks like the line chart except that the horizontal axis
represent the class mark of the data which is quantitative in nature.
Ogive: line graph with the horizontal axis represent the upper limit of the
class interval while the vertical axis represent the cummulative frequencies.
Data Summary
 Summary statistics are used to summarize a set of observations.
a)
Measures of Central Tendency
 Mean
 Median
 Mode
b)
Measures of Dispersion
 Range
 Variance
 Standard deviation
c)
Measures of Position
 Z scores
 Percentiles
 Quartiles
 Outliers
a) Measures of Central Tendency
Mean
 Mean of a sample is the sum of the sample data divided by the
total number sample.
 Mean for ungrouped data is given by:
_
x
x1  x2  .......  xn x

x
, for n  1,2,..., n or x 
n
n
_
 Mean for group data is given by:
n

x
fx
fx

or
f

f

i 1
n
i 1
i i
i
Example 2 (Ungrouped data):
Mean for the sets of data 3,5,2,6,5,9,5,2,8,6
Solution :
35 2 6595 28 6
x
 5.1
10
Median of ungrouped data:
 The median depends on the number of observations in the data,
n . If n is odd, then the median is the (n+1)/2 th observation of the
ordered observations.
 But if n is even, then the median is the arithmetic mean of the
n/2 th observation and the (n+1)/2 th observation.
Median of grouped data:
 f


F

j 1 
2
x  Lc

f


j


where
L = the lower class boundary of the median class
c = the size of median class interval
Fj 1  the sum of frequencies of all classes lower than the median class
f j  the frequency of the median class
Example 4 (Ungrouped data):
n is odd
Find the median for data 4,6,3,1,2,5,7 ( n = 7)
Rearrange the data : 1,2,3,4,5,6,7
(median = (7+1)/2=4th place)
Median = 4
n is even
Find the median for data 4,6,3,2,5,7 (n = 6)
Rearrange the data : 2,3,4,5,6,7
Median = (4+5)/2 = 4.5
Mode
• Mode of ungrouped data:
 The value with the highest frequency in a data set.
 It is important to note that there can be more than one
mode and if no number occurs more than once in the set,
then there is no mode for that set of numbers
Find the mode for the sets of data 3, 5, 2, 6, 5, 9, 5, 2, 8, 6
Mode = number occurring most frequently = 5
b) Measures of Dispersion
 Range = Largest value – smallest value
 Variance= measures the variability (differences) existing in a set of data.
The variance for the ungrouped data:
For sample

S
For population
2
( x  x)


2 
2
n 1
2
(
x


)

n
The variance for the grouped data:
 For sample
S
2
fx


2
2
nx
or
S 
2
n 1

2
(
fx
)

fx 2 
n
n 1
 For population

2
fx


2
n
 nx
2 or
2 

2
(
fx
)
fx 2  
n
n
Standard deviation: the positive square root of the variance is
the standard deviation

S
 ( x  x)
n 1
2

 fx
2
2
nx
n 1
 A large variance means that the individual scores (data) of
the sample deviate a lot from the mean.
 A small variance indicates the scores (data) deviate little
from the mean.
Example 8 (Ungrouped data)
Find the variance and standard deviation of the
sample data : 3, 5, 2, 6, 5, 9, 5, 2, 8, 6

2
(
x

x
)
S2  
n 1
(3  5.1) 2  (5  5.1) 2  (2  5.1) 2  (6  5.1) 2  (5  5.1) 2  (9  5.1) 2
2
2
2
2

(
5

5
.
1
)

(
2

5
.
1
)

(
8

5
.
1
)

(
6

5
.
1
)
s2 
9
48.9

 5.43
9
s  s 2  5.43
 2.33
Exercise 4 (submit on Thursday)
The following data give the sample number of iPads sold by a mail
order company on each of 30 days. (Hint : 5 number of classes)
8 25
11 15 29 22 10 5
22 13
26 16 18 12 9
23 14
19 23 20 16 27 9
17 21
26 20 16
21 14
a) Construct a frequency distribution table.
b) Find the mean, variance and standard deviation, mode and
median.
c) Construct a histogram.
Rules of Data Dispersion
By using the mean x and standard deviation, we can find the
percentage of total observations that fall within the given interval
about the mean.
Empirical Rule
Applicable for a symmetric bell shaped distribution / normal
distribution.
There are 3 rules:
i. 68% of the data will lie within one standard deviation of the
mean, ( x  s )
ii. 95% of the data will lie within two standard deviation of the
mean,( x  2 s )
iii. 99.7% of the data will lie within three standard deviation of the
mean, ( x  3s )
Example 10
The age distribution of a sample of 5000 persons is bell shaped with a
mean of 40 yrs and a standard deviation of 12 yrs. Determine the
approximate percentage of people who are 16 to 64 yrs old.
Solution:
x  s  40  12  [ 28,52]
x  2 s  40  2.12  [16,64]
x  3s  40  3.12  [ 4,76]
Approximately 68% of the measurements will fall between 28 and
52, approximately 95% of the measurements will fall between 16
and 64 and approximately 99.7% to fall into the interval 4 and 76.
c) Measures of Position
 To describe the relative position of a certain data value
within the entire set of data.
 z scores
 Percentiles
 Quartiles
 Outliers
Quartiles
 Divide data sets into four equal parts where each part account about 25% of
data distribution.
Minimum value
25%
of data
Q1
Q2
25%
of data
Q3
25%
of data
Maximum value
25%
of data
Find Q1, Q2, and Q3 for the following data 15, 13, 6, 5, 12, 50, 22, 18
Step 1: Arrange the data in order
5, 6, 12, 13, 15, 18, 22, 50
Step 2: Find the median (Q2)
5, 6, 12, 13, 15, 18, 22, 50
↑
Q2=(13+15)/2=14
Step 3 Find the median of the data values less than 14.
5, 6, 12, 13
↑
Q1 = (6+12)/2=9
Step 4 Find the median of the data values greater than 14
15, 18, 22, 50
↑
Q3=(18+22)/2=20
Example: 5,
8, 4, 4, 6, 3, 8
(n=7)
1. Arrange the data in order form:
3, 4, 4, 5, 6, 8, 8
Q 2  median  5
2. Q1: Find the median of the data values less than 5.
3, 4, 4
Q1  4
Q1: Find the median of the data values greater than 5.
6,8,8
Q1  8
Therefore, Q1  4, Q 2  5, Q3  8
Exercise:
The following data represent the number of inches of rain in Chicago during
the month of April for 10 randomly years.
2.47
3.97
3.94
4.11
5.22
1.14
4.02
3.41
1.85
0.97
Determine the quartiles.
Exercise:
The following data represent the number of inches of rain in Chicago during
the month of April for 10 randomly years.
2.47
3.97
3.94
4.11
5.22
1.14
4.02
3.41
1.85
0.97
Determine the quartiles.
Answer:
Q1  1.85, Q 2  3.675, Q3  4.02
Outliers
 Extreme observations
 Can occur because of the error in measurement of a variable,
during data entry or errors in sampling.
Checking for outliers by using Quartiles
Step 1:
Determine the first and third quartiles of data.
Step 2:
Compute the interquartile range (IQR), IQR  Q3  Q1 .
Step 3:
Determine the fences. Fences serve as cut off points for determining outliers.
needed for identifying extreme values in the tails of the distribution:
Lower Fence  Q1  1.5( IQR)
Upper Fence  Q3  1.5( IQR)
Lower Outer Fence  Q1  3( IQR)
Upper Outer Fence  Q3  3( IQR)
Step 4:
If data value is less than the lower fence or greater than the upper fence,
considered outlier. A point beyond an outer fence is considered extreme outlier.
Example
2.47
3.97
3.94
4.11
5.22
1.14
4.02
3.41
1.85
0.97
Determine whether there are outliers in data set.
Arrange data in ascending form:
0.97, 1.14, 1.85, 2.47, 3.41, 3.94, 3.97, 4.02, 4.11, 5.22
Q 2   3.41  3.94  / 2
Follow the
steps to find
quartiles
 3.675
0.97, 1.14, 1.85, 2.47, 3.41
3.94, 3.97, 4.02, 4.11, 5.22
Q1  1.185
Q3  4.02
IQR  Q3  Q1  4.02  1.185  2.835
Lower fence  Q1  1.5( IQR )
 1.185  1.5(2.835)
 3.0675
Upper fence  Q3  1.5( IQR)
 4.02  1.5(2.835)
 8.2725
Since all the data are not less than -3.0675 and not greater than
8.2725, then there are no outliers in the data
Boxplot (Graphical presentation for quantitative data)
 The five-number summary can be used to create a simple graph
called a boxplot.
Minimum
Q1
Median
Q3
Maximum
 Form the boxplot, you can quickly detect any skewness in the shape of
the distribution and see whether there are any outliers in the data set.
Outlier
Outlier
Lower
fence
Upper
fence
The Five Number Summary
 Compute the five-number summary and construct the box plot of
the data
2.47
1.14
min  0.97,
Q1  1.185,
Q 2  3.675,
Q3  4.02,
max  5.22
3.97
4.02
3.94
3.41
4.11
1.85
5.22
0.97
IQR  2.835
Q1  1.185
Q 2  3.675 Q3  4.02
min  0.97
- The distribution is skewed to the left
max  5.22
Interpreting Boxplot
- symmetric
- Left skewed or
negatively skewed:
the tail is skewed to
the left
- Right skewed or
positively skewed: the
tail is skewed to the
right
Mean/Median Versus Skewness
Mean < Median < Mode
Mean > Median > Mode
Mean = Median = Mode
STEM-AND-LEAF
41
 Another technique that is used to present quantitative data is the stem-and-leaf
plot.
 An advantage of a stem-and-leaf over a frequency distribution is that by
preparing stem-and-leaf, we do not lose information on individual
observations.
 A stem-and-leaf only for quantitative data.
 In a stem-and-leaf display of quantitative data, each value is divided into two
portions; a stem and leaf. The leaves for each stem are shown separately in a
display.
42 • Stem-and-leaf plot display a set of data usually large data set.
• Stem and leaf plots emphasize place value. Stem is for the largest
place value(s) of a number and leaf is the smallest place value of
a number in your data set.
 Step 1:
Find the least and the greatest number in the set of data
 Step 2:
Make two columns with titles STEM and LEAF.
 Step 3:
Write the digits that form the stem in the STEM column
 Step 4:
Write the digits that form the leaf for each number in the
LEAF column across from the STEM of the number.
Example::
The following are the scores of 30 college students on a statistics test.
75 52 80 96 65 79 71 87 93 95
69 72 81 61 76 86 79 68 50 92
83 84 77 64 71 87 72 92 57 98
 For the score of the first student, which is 75, 7 is the stem and 5 is the leaf.
For the score of the second student, which 52, 5 is the stem and 2 is the leaf.
 Observed from data, the stems for all scores are 5,6,7,8 and 9 because all
scores lie in the range 50 to 98.
 After we have listed the stems, we read the leaves for all scores and record
them next to the corresponding stems at the right side of the vertical line.
44
Now we read all the scores and write the leaves on the right side of the vertical line
in the rows of corresponding stems. By looking at the stem-and-leaf display of test
scores, we can observed how the data values are distributed. For example, the stem
7 has the highest frequency, followed by stems 8,9,6 and 5. The leaf for each stem
of the stem-and-leaf display of test scores are rank in increasing order and presented
as below :
Stem
Leaf
5
0 2 7
6
1 4 5 8 9
7
1 1 2 2 5 6 7 9 9
8
0 1 3 4 6 7 7
9
2 2 3 5 6 8
The distribution of
data seems skewed
to the left tail
* Analyze – There are 9 out of 30 college students score between 71 and 79.
Ranked stem-and-leaf display of test scores.
What you MUST
know?
 Define statistics and its application in engineering.
 Explain the concept of population and sample.
 Compute and interpret the measures of central tendency (MCT), measures of dispersion (MD)
and measures of position (MP).
 Construct and interpret several graphical presentation (histogram, box plot, stem and leaf
plot).
 Explain how graphical presentation are used to compare two or more sets of data.
 Compare MCT, MD and MP for two or more sets of data.