Download Engineering Statistics Mnge 417

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia , lookup

Corecursion wikipedia , lookup

Taylor's law wikipedia , lookup

Generalized linear model wikipedia , lookup

Transcript
Engineering Statistics
Mnge 417
Introduction
©Dr. B. C. Paul 2003
Why Should Engineers Even
Care?
• Engineers Design and Plan
– All heard of significant figures
15.232176451234 gets called 15.25
– Is everything built actually 15.25 ft
• Are our roof bolt spacings in the field really 5 feet?
– Much profession is built around engineering
tolerances - how close do I have to be to make
it work
• reality is then (hopefully) a bunch of minor
variations around acceptable answer
Building to Tolerance
• Design says shafts are machined to 1.25 inches +/some tolerance
• Reality says there is actually a bunch of very
similar sizes that are close to 1.25 inches
– Every so often we will get a dud (we accept the reality
but want to minimize frequency)
• Often can’t check all parts for tolerance but can
check every so many to make sure process is
under control
– Sample is a few values collected from a larger
population
• Statistical Probability Distribution is a model of
this process
Engineers Make Changes as a
Means to Improve Things
• Mining Engineering Example
– Coal production from a face area is critical to
costs and competitiveness
– Make a policy or equipment change - does it
work
• Will Joy’s new high voltage miner really improve
coal production?
• Will change in a ventilation pattern really reduce
dust violations that limit production?
Cause and Effect Relationships
• Very few real results of anything are just
one value
– Coal production has up and down days
• Does the new policy, equipment or practice
result in more up days of higher value?
– How many good results do you actually need to
see before you can feel confident that its not
just coincidental higher (or lower) values?
• Varying effects can be modeled as
probability distribution
Engineering Design Practice
• You have a bunch of equations and formulas
that tell you whether something should
work.
• Next design step is often to consider that
things don’t always work exactly as they
should
– Mining Truck or a Water Treatment Plant
processing train do not work all the time
• Thus real production is different than the design
equation
Modeling
• We build a mathematical model of the
situation and then do the math to see if it is
going to work for us in the real world
• We may not think of it but most of our
engineering design equations are
mathematical models that were fit to actual
data long ago
– Newtonian physics (we call them laws now)
– Darcy’s law and the Bernoulli Equation
How do You Decide if a
Mathematical Model Fits What
You See?
• Because you usually can’t measure 100%
accurate or don’t think of or can’t consider
every minor effect
– Real results tend to be distributed around our
potential mathematical models
• Statistical models consider a distribution of
answers around an underlying trend
Sometimes you don’t know what
is driving a result
• Is absenteeism being driven by work
assignments, health, deer season etc.
• Statistical models can compare variations to
possible causes and help identify what is
driving things.
Spatial Relationships
• We take samples of an ore body
– Do the results mean we have a certain tonnage of ore at
a certain grade?
• We use samples to tell us what material to take to
the processing plant or waste dump
• We may want to tell our mill operator how much a
grade or ore may go up or down
• We can have statistical models built with a spatial
or location relationship.
How Statistics Works
• Often trained to think that answer to real
world problems comes out of an equation
• We actually create mathematical models
that approximately fit reality and then work
off of something predictable
– math that actually is used to study mathematical
models may be something only a French
Mathematician could love
– A lot of the basic ideas are fairly intuitive
Example
• If I have a random number generator that
produces numbers between 1 and 100, what
value is most likely?
• If I take 25 of those random numbers what
will the average value most likely be close
to?
What Did You Assume to Get
Those Answers?
• You assumed how those values were
distributed
– You considered what was called a uniform
distribution (all numbers are equally likely to
come up)
– Statistics begins with a series of standard
mathematical distributions
• We try to pick one that most nearly matches our
reality
Getting Your Answers
• You also assumed that the numbers were
taken from that distribution at random
– ie no one is cherry picking any values
preferentially to any other
– One of the reasons that statisticians get so crazy
if they think someone is Cherry Picking the
sample
• Root of all Statistics is that you assume
reality follows a standard mathematical
distribution and the part we see was picked
at random from that distribution
How Do We Come Up With What
Distribution Closely Resembles Our
Reality?
• Process Starts with Figuring Out Which of Our
Standard Model Distributions it is
• Three Levels of Effort
• Say “I Believe” and assume one
– Most commonly done with “Normal Distribution” “Bell Curve”
– Many things tend to be normally distributed
– Strength of past experience becomes rationale
• Also have people who do it without having any
idea what they have done
– Standard statistics is built around normal distribution
Levels of Effort
• Level 2
– Study the distribution to see if we are doing
something terrible
– Common approach is called a “Histogram”
• it’s a bar graph that we plot our data on so we can
look at it
– Also have things like probability paper where
you plot your data and see if you get a straight
line
Effort Level 3
• Use statistical techniques to test whether
our sample data is like a set that could
reasonably be pulled from some standard
distribution
– Often our goodness of fit tests
• All three levels of effort have some degree
of custom for their use in some practices
Measuring Properties of
Distributions
• Put sample data into a standard equation
that generates a number
– Often actually call that number a statistic
– Measures some property of the distribution that
the data was taken from
• Some statistics have obvious tangible
meaning
– Example - Mean - mathematical average value
of the sample or population
Calculating a Mean (or simple
average)
• Add up all the numbers and then divide by how ever many
numbers you added
• Example
– Numbers 5, 10, 15, 20, 25
– What is the Mean?
• Calculate
–
–
–
–
–
(5 + 10 + 15 + 20 + 25)/5
Numerator totals to 75
Denominator is the number of values I put in
Divide the total by the number of values put in
Answer is 15 (the Mean or Average Value)
Statisticians Need Confusing Ways
to Write Equations
• Xi means a sample value
– The i subscript tells you whether it was the first, second, third etc
sample
• From example on last slide we know X2 was the second number we
looked at which was 10
• Σ means the sum of a series of values
• n means the number of samples considered
• Thus we write the formula for mean as
n
X
i
1
–
n
• We of course also have a special symbol for a mean
–
X
Can Do Problems with Software
(in this case SPSS)
Type in Data
Ready to Enter Data
Type in the Data
Command to Analyze
Pull Down Analyze Menu
Highlight Descriptive Statistics
Highlight Frequencies
Click on Frequencies
It gives me a list of
Variables to use
This list is tough with
Only one variable
Click Statistics
Highlight the variable
And push the arrow
To move it into the
Use area
Choose Your Statistics
Check off Mean
And push continue
Click OK on the Frequencies Screen
Read Off Our
Mean at 2.89
More Measurements
• Mode
– The value that has the greatest chance of
coming up
• Example
–
–
–
–
If I have 10 people who are 5’10”
2 people who are 4’3”
2 people who are 6’10”
If you pick a person at random from my group
what height will person most likely be?
More Measures
• Median
– Half of the values are higher - half are lower
• Mean, Median, and Mode all seem to have
somewhat obvious physical meanings
• Other statistics are less obvious
– Variance
– A number that comes out of a formula that tells
you how spread out the distribution is
• Square root of variance is Standard
Deviation
– Average difference between a sample and the
mean value
The Standard Deviation
• Standard Deviation is the average difference
between individual samples and the mean
s
(X
i
 X)
2
n 1
What does it mean?
Take each sample number, subtract the average sample
Value from it, square the result, do this for every number
And add up the result, then divide the result by one less
Than the number of samples you took, and then take the
Square root of that value.
As a Practical Matter That’s a Pain
• I have to compute the average before I can do the
math for standard deviation
• Alternative Formula


 X i

 1
n
n 1
n
 X 
n
s
2
2
i
1
Tells you keep track of two number
1- Take each number square it and then add the squares
up
2- Take each number and add them up and then square
the total
Getting Standard Deviation
• Statistical Calculators have multiple
memories
–
–
–
–
They add up numbers in one memory
They square and add up numbers in another
They total entries in another
They then apply the standard deviation formula
• Of course can also use SPSS
Doing Standard Deviation with
SPSS
Pull Down Analyze
Highlight Descriptive
Statistics
Highlight and click
frequencies
Check Off Standard Deviation
Push Continue
Push Ok on the
Frequencies
menu
Read Off the Output
Std is 1.12
Variance is also a measure of how
much things differ from their
average
• Variance is just the standard deviation
squared
• To calculate a variance just do the standard
deviation thing without taking the square
root at the end
• Of course I could also check off variance
instead of Standard Deviation in SPSS
Types of Distributions
• Idea is that we try to approximate reality with a
mathematically defined distribution
– Then we can use mathematical operations to predict our
answers
• Distributions that often fit reality
– Normal Distribution (developed in 1733)
• Bell Curve
–
–
–
–
–
Uniform Distribution
Binomial Distribution
T Distribution
Qui Square Distribution
Lognormal Distribution
Derived Distributions
• T distribution, Qui Squared, and Lognormal
Distributions are all derived from the
Normal Distribution for specific types of
situations
Normal Distribution
• Shaped Like
Formula
 x   2

 2 2

 1 
Y  f ( x)  
e

2







Symmetric Distributions with a
Central Tendency
• Normal Distribution is classic example
– Most of the chances are right near the center of the
distribution
• Frequency drops off to sides
• Mode is at the Center of the Distribution
– Distribution is mirror image about its center
• Allows to just compute one side
• Median is Mean is the Mode
• A lot of reality has central tendency with relatively
symmetric sides
– T distribution like that too
• Sides slope off a little differently
Why the Normal Distribution
• One of the first mathematically defined
distributions that was a real good fit
– People developed other formulas and
distributions from calculations done on the
normal distribution
• T distribution and Qui Square Distribution both
result from performing mathematical operations on
samples of a normal distribution
– Normal Distribution was first to press with a
distribution that was heavy at the center and
symmetric
Reality 101 for Statistical
Distributions
• Probably no such thing as a real normal
distribution in life
• Even if there were we almost never count
each and every member of the population so
you’d never know if it was
• Statistical Distributions let us take limited
data – see what it approximately is
– Then use the defined mathematical model to
suddenly know everything about it
Back to Why the Normal
Distribution
• Big part of Real World is Central Tendency
and Symmetric
• Found that calculations done with a normal
distribution were robust
– Minor lack of fit in real world data doesn’t
change the answers much
– Thus works on almost anything with central
tendency and near symmetric
Most Common Lack of Fit
• Not Symmetric
Robustness covers a
Little skewness
Taking square-root will normalize
A few others
This type of shape can be fit with a
Distribution adapted from normal called
lognormal
If you take averages of about 25 samples
From this – the averages will be normal
(averaging normalizes)
Taking logarithms of the data will make
The transformed distribution normal
Multi-Modal Distributions
These types of distributions are often 3 different normally
Distributed families over-lying each other
Finding what is causing the three families often helps us
To better understand our world
Uniform Distribution
• All values within some range (which may or
may not be plus or minus infinity) are
equally likely
• Distribution has no central tendency
• Tends to be associated with truly random
events (or at least events where the
underlying cause is eluding our
mathematical modeling)
Characteristics of Uniform
Distribution
• Because all values are equally likely it has
no mode
• Mean is at the center of the range
• Uniform is still symmetric about Mean so
the Median and Mean are the same
• Standard Deviation is 1/4th the range (if
range is infinite obviously that’s not
defined)
• Variance is Standard Deviation Squared
Binomial Distribution
• Outcomes that are either off or on
– Clearly describes computers and digital data
• Many things either work or they don’t
– Mining dealing with whether our trucks are in
working order
– Water treatment plant – water purification train
is working or not working
– Coin tosses are heads or tails
New Problem
• Can’t talk about means, modes, and medians
because outcome has no continuous distribution
• Want to know what fraction of the outcomes are
“yes”
– P = 0.85 85% of members of bimodal population are
positive
• Usually interested in what chances are that we can
take 5 members out of the population and have
them all positive
– Example if I have 5 mining trucks how much of the
time will all 5 be running?
The Ordinate Problem
• How continuously distributed are our
outcomes?
– Our number line is continuous so at first glance
we almost assumed everything was continuous
• When and what if they are not
• This usually doesn’t take a very smart
statistician to figure out
• Some things are yes or no distributed
– Use binomial distribution model Da!
Some Things are Integer
Distributed
• Continuity really is a function of
observational scale
– According to quantum physics everything is
made of integer numbers of discrete quanta
– At our observation scale the little integer jumps
are perhaps so small we cannot even measure
them
– Many times integer continuity is negligible
What If Integer Continuity is Not
Negligible?
• Happens when have small numbers or
integer distributed data
– How does one deal with teacher rankings in
classes of 5 students?
• Our scale of observation is integer
• Our sample size is small enough we can’t mask it
• If it was a class of 500 students we could probably
model outcomes rather well as if continuous
• Non-Parametric Statistical Models
Summary of Ideas
• Real world data comes as distributions of
answers not one equation numbers
• We can represent these distributions with
mathematical models that fully define how
the data is distributed
– Allows us to approximate things we could
never get enough data to count
• We work on these models and call our work
Statistics