Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Engineering Statistics Mnge 417 Introduction ©Dr. B. C. Paul 2003 Why Should Engineers Even Care? • Engineers Design and Plan – All heard of significant figures 15.232176451234 gets called 15.25 – Is everything built actually 15.25 ft • Are our roof bolt spacings in the field really 5 feet? – Much profession is built around engineering tolerances - how close do I have to be to make it work • reality is then (hopefully) a bunch of minor variations around acceptable answer Building to Tolerance • Design says shafts are machined to 1.25 inches +/some tolerance • Reality says there is actually a bunch of very similar sizes that are close to 1.25 inches – Every so often we will get a dud (we accept the reality but want to minimize frequency) • Often can’t check all parts for tolerance but can check every so many to make sure process is under control – Sample is a few values collected from a larger population • Statistical Probability Distribution is a model of this process Engineers Make Changes as a Means to Improve Things • Mining Engineering Example – Coal production from a face area is critical to costs and competitiveness – Make a policy or equipment change - does it work • Will Joy’s new high voltage miner really improve coal production? • Will change in a ventilation pattern really reduce dust violations that limit production? Cause and Effect Relationships • Very few real results of anything are just one value – Coal production has up and down days • Does the new policy, equipment or practice result in more up days of higher value? – How many good results do you actually need to see before you can feel confident that its not just coincidental higher (or lower) values? • Varying effects can be modeled as probability distribution Engineering Design Practice • You have a bunch of equations and formulas that tell you whether something should work. • Next design step is often to consider that things don’t always work exactly as they should – Mining Truck or a Water Treatment Plant processing train do not work all the time • Thus real production is different than the design equation Modeling • We build a mathematical model of the situation and then do the math to see if it is going to work for us in the real world • We may not think of it but most of our engineering design equations are mathematical models that were fit to actual data long ago – Newtonian physics (we call them laws now) – Darcy’s law and the Bernoulli Equation How do You Decide if a Mathematical Model Fits What You See? • Because you usually can’t measure 100% accurate or don’t think of or can’t consider every minor effect – Real results tend to be distributed around our potential mathematical models • Statistical models consider a distribution of answers around an underlying trend Sometimes you don’t know what is driving a result • Is absenteeism being driven by work assignments, health, deer season etc. • Statistical models can compare variations to possible causes and help identify what is driving things. Spatial Relationships • We take samples of an ore body – Do the results mean we have a certain tonnage of ore at a certain grade? • We use samples to tell us what material to take to the processing plant or waste dump • We may want to tell our mill operator how much a grade or ore may go up or down • We can have statistical models built with a spatial or location relationship. How Statistics Works • Often trained to think that answer to real world problems comes out of an equation • We actually create mathematical models that approximately fit reality and then work off of something predictable – math that actually is used to study mathematical models may be something only a French Mathematician could love – A lot of the basic ideas are fairly intuitive Example • If I have a random number generator that produces numbers between 1 and 100, what value is most likely? • If I take 25 of those random numbers what will the average value most likely be close to? What Did You Assume to Get Those Answers? • You assumed how those values were distributed – You considered what was called a uniform distribution (all numbers are equally likely to come up) – Statistics begins with a series of standard mathematical distributions • We try to pick one that most nearly matches our reality Getting Your Answers • You also assumed that the numbers were taken from that distribution at random – ie no one is cherry picking any values preferentially to any other – One of the reasons that statisticians get so crazy if they think someone is Cherry Picking the sample • Root of all Statistics is that you assume reality follows a standard mathematical distribution and the part we see was picked at random from that distribution How Do We Come Up With What Distribution Closely Resembles Our Reality? • Process Starts with Figuring Out Which of Our Standard Model Distributions it is • Three Levels of Effort • Say “I Believe” and assume one – Most commonly done with “Normal Distribution” “Bell Curve” – Many things tend to be normally distributed – Strength of past experience becomes rationale • Also have people who do it without having any idea what they have done – Standard statistics is built around normal distribution Levels of Effort • Level 2 – Study the distribution to see if we are doing something terrible – Common approach is called a “Histogram” • it’s a bar graph that we plot our data on so we can look at it – Also have things like probability paper where you plot your data and see if you get a straight line Effort Level 3 • Use statistical techniques to test whether our sample data is like a set that could reasonably be pulled from some standard distribution – Often our goodness of fit tests • All three levels of effort have some degree of custom for their use in some practices Measuring Properties of Distributions • Put sample data into a standard equation that generates a number – Often actually call that number a statistic – Measures some property of the distribution that the data was taken from • Some statistics have obvious tangible meaning – Example - Mean - mathematical average value of the sample or population Calculating a Mean (or simple average) • Add up all the numbers and then divide by how ever many numbers you added • Example – Numbers 5, 10, 15, 20, 25 – What is the Mean? • Calculate – – – – – (5 + 10 + 15 + 20 + 25)/5 Numerator totals to 75 Denominator is the number of values I put in Divide the total by the number of values put in Answer is 15 (the Mean or Average Value) Statisticians Need Confusing Ways to Write Equations • Xi means a sample value – The i subscript tells you whether it was the first, second, third etc sample • From example on last slide we know X2 was the second number we looked at which was 10 • Σ means the sum of a series of values • n means the number of samples considered • Thus we write the formula for mean as n X i 1 – n • We of course also have a special symbol for a mean – X Can Do Problems with Software (in this case SPSS) Type in Data Ready to Enter Data Type in the Data Command to Analyze Pull Down Analyze Menu Highlight Descriptive Statistics Highlight Frequencies Click on Frequencies It gives me a list of Variables to use This list is tough with Only one variable Click Statistics Highlight the variable And push the arrow To move it into the Use area Choose Your Statistics Check off Mean And push continue Click OK on the Frequencies Screen Read Off Our Mean at 2.89 More Measurements • Mode – The value that has the greatest chance of coming up • Example – – – – If I have 10 people who are 5’10” 2 people who are 4’3” 2 people who are 6’10” If you pick a person at random from my group what height will person most likely be? More Measures • Median – Half of the values are higher - half are lower • Mean, Median, and Mode all seem to have somewhat obvious physical meanings • Other statistics are less obvious – Variance – A number that comes out of a formula that tells you how spread out the distribution is • Square root of variance is Standard Deviation – Average difference between a sample and the mean value The Standard Deviation • Standard Deviation is the average difference between individual samples and the mean s (X i X) 2 n 1 What does it mean? Take each sample number, subtract the average sample Value from it, square the result, do this for every number And add up the result, then divide the result by one less Than the number of samples you took, and then take the Square root of that value. As a Practical Matter That’s a Pain • I have to compute the average before I can do the math for standard deviation • Alternative Formula X i 1 n n 1 n X n s 2 2 i 1 Tells you keep track of two number 1- Take each number square it and then add the squares up 2- Take each number and add them up and then square the total Getting Standard Deviation • Statistical Calculators have multiple memories – – – – They add up numbers in one memory They square and add up numbers in another They total entries in another They then apply the standard deviation formula • Of course can also use SPSS Doing Standard Deviation with SPSS Pull Down Analyze Highlight Descriptive Statistics Highlight and click frequencies Check Off Standard Deviation Push Continue Push Ok on the Frequencies menu Read Off the Output Std is 1.12 Variance is also a measure of how much things differ from their average • Variance is just the standard deviation squared • To calculate a variance just do the standard deviation thing without taking the square root at the end • Of course I could also check off variance instead of Standard Deviation in SPSS Types of Distributions • Idea is that we try to approximate reality with a mathematically defined distribution – Then we can use mathematical operations to predict our answers • Distributions that often fit reality – Normal Distribution (developed in 1733) • Bell Curve – – – – – Uniform Distribution Binomial Distribution T Distribution Qui Square Distribution Lognormal Distribution Derived Distributions • T distribution, Qui Squared, and Lognormal Distributions are all derived from the Normal Distribution for specific types of situations Normal Distribution • Shaped Like Formula x 2 2 2 1 Y f ( x) e 2 Symmetric Distributions with a Central Tendency • Normal Distribution is classic example – Most of the chances are right near the center of the distribution • Frequency drops off to sides • Mode is at the Center of the Distribution – Distribution is mirror image about its center • Allows to just compute one side • Median is Mean is the Mode • A lot of reality has central tendency with relatively symmetric sides – T distribution like that too • Sides slope off a little differently Why the Normal Distribution • One of the first mathematically defined distributions that was a real good fit – People developed other formulas and distributions from calculations done on the normal distribution • T distribution and Qui Square Distribution both result from performing mathematical operations on samples of a normal distribution – Normal Distribution was first to press with a distribution that was heavy at the center and symmetric Reality 101 for Statistical Distributions • Probably no such thing as a real normal distribution in life • Even if there were we almost never count each and every member of the population so you’d never know if it was • Statistical Distributions let us take limited data – see what it approximately is – Then use the defined mathematical model to suddenly know everything about it Back to Why the Normal Distribution • Big part of Real World is Central Tendency and Symmetric • Found that calculations done with a normal distribution were robust – Minor lack of fit in real world data doesn’t change the answers much – Thus works on almost anything with central tendency and near symmetric Most Common Lack of Fit • Not Symmetric Robustness covers a Little skewness Taking square-root will normalize A few others This type of shape can be fit with a Distribution adapted from normal called lognormal If you take averages of about 25 samples From this – the averages will be normal (averaging normalizes) Taking logarithms of the data will make The transformed distribution normal Multi-Modal Distributions These types of distributions are often 3 different normally Distributed families over-lying each other Finding what is causing the three families often helps us To better understand our world Uniform Distribution • All values within some range (which may or may not be plus or minus infinity) are equally likely • Distribution has no central tendency • Tends to be associated with truly random events (or at least events where the underlying cause is eluding our mathematical modeling) Characteristics of Uniform Distribution • Because all values are equally likely it has no mode • Mean is at the center of the range • Uniform is still symmetric about Mean so the Median and Mean are the same • Standard Deviation is 1/4th the range (if range is infinite obviously that’s not defined) • Variance is Standard Deviation Squared Binomial Distribution • Outcomes that are either off or on – Clearly describes computers and digital data • Many things either work or they don’t – Mining dealing with whether our trucks are in working order – Water treatment plant – water purification train is working or not working – Coin tosses are heads or tails New Problem • Can’t talk about means, modes, and medians because outcome has no continuous distribution • Want to know what fraction of the outcomes are “yes” – P = 0.85 85% of members of bimodal population are positive • Usually interested in what chances are that we can take 5 members out of the population and have them all positive – Example if I have 5 mining trucks how much of the time will all 5 be running? The Ordinate Problem • How continuously distributed are our outcomes? – Our number line is continuous so at first glance we almost assumed everything was continuous • When and what if they are not • This usually doesn’t take a very smart statistician to figure out • Some things are yes or no distributed – Use binomial distribution model Da! Some Things are Integer Distributed • Continuity really is a function of observational scale – According to quantum physics everything is made of integer numbers of discrete quanta – At our observation scale the little integer jumps are perhaps so small we cannot even measure them – Many times integer continuity is negligible What If Integer Continuity is Not Negligible? • Happens when have small numbers or integer distributed data – How does one deal with teacher rankings in classes of 5 students? • Our scale of observation is integer • Our sample size is small enough we can’t mask it • If it was a class of 500 students we could probably model outcomes rather well as if continuous • Non-Parametric Statistical Models Summary of Ideas • Real world data comes as distributions of answers not one equation numbers • We can represent these distributions with mathematical models that fully define how the data is distributed – Allows us to approximate things we could never get enough data to count • We work on these models and call our work Statistics