Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introductory Statistical Concepts Disclaimer – I am not an expert SAS programmer. – Nothing that I say is confirmed or denied by Texas A&M University. 2 Why Are We Here? • Deming – To Learn – To Have Fun Question: Who was Deming? 3 Poll: What type of organization do you work for? • [PlaceWare Multiple Choice Poll. Use PlaceWare > Edit Slide Properties... to edit.] • • • • • Business Government Education Nonprofit Other 4 Purpose of These Lectures • A review of the statistical concepts used in most of the SAS Analytics Lecture Series. • We will look at questions such as the following: – – – – – What is the nature of statistical analyses? Why are population parameters so important? What is really being tested when you see a p-value? Why does regression handle missing data so well? What are residual analyses? 5 Descriptive Statistics The Population (Very important concepts) Variable of Interest The Distribution Parameters Mean Median Mode Range Variance Etc 7 Learning Outcomes • You will learn – basic statistical concepts – the definition of mean, median, mode and standard deviation – the difference between populations and samples – the difference between parameters and estimates – about confidence intervals – how to test a statistical hypothesis – how to run a regression analysis 8 Parameters • Characteristics of the variable of interest • It is how we describe the variable of interest • Parameters are unknown 9 Parameters (Characteristics) • Central Tendency • Measures of Variability • Mode • Range • Median • Variance • Mean • Standard Deviation Click Here for more information on Mode Mean Median Click Here for an applet 10 Variability Change in the Data What is an Index ? How SUNNY is SUNNY? THE UV Index Click Here 12 Air Quality Index What Does It Mean? 13 DOW JONES INDUSTRIAL AVERAGE INDEX What does 10,971.16 really mean? What is “better” a DJIA of 10,000 Or a DJIA of 12,000? 14 Variability Index • A Simple One • Find the Largest Value • Find the Smallest Value • Let Range = R = Largest – Smallest 15 A More Complex Variation Index • The Standard Deviation or S or s • Statisticians use this index to indicate variability • You will see it written as • Widely available from SAS, Excel, and other statistical packages 16 Details of the More Complex Index • • • • Example – Suppose that we observe the following three numbers 1 4 7 The mean of these number is: ( 1 +4+7)/3 = 4 • • We now subtract the mean from each number and square it (1-4)*(1-4) + (4-4)*(4-4) +(7-4)*(7-4) = 18 • The Standard Deviation = sqrt(18/2) = 3 17 What does this Mean? • By itself , it may be confusing to some. • Comparing populations, we can use it to say which population varies the most. • Let us look at an applet – Click Here 18 Using Graphs to Determine Variability • Box Plot • Click Here 400000 Total Violent Crime 300000 200000 100000 0 N= 35 35 CALIFORN NEW_YORK State 19 Distributions Known Distribution • With a known distribution, we know the following: – the shape – the mean – the variability (standard deviation) – and/or some other information 21 Classical Distributions─Normal 22 Normal─Overlay 23 Classical Distributions─Uniform 24 Survey • The following are called parameters of the population: – mean, median, mode – variance, standard deviation, range, inter-quartile range (IQR) • In general, are these known or unknown? – Known = yes (select using your seat indicator) – Unknown = no (select using your seat indicator) 25 MPG─Histogram Compare with “true” values ! 26 Simulated Sample • In this example, we simulated taking a sample of size 1000 from one population of cars weighing 3000 pounds with a normal distribution with mean=24 and standard deviation=1. • You can practice this after class. 27 Section 1.2 Populations and Samples Objectives – Understand the relationships between • populations and samples • parameters and estimates. – Look at an overview of hypotheses testing. 29 Population Parameters Mean, Variance, Median, Mode, Distribution, … 30 Example • Mpg of American-made cars that weigh between 2000 and 3500 pounds and were built in the 1970s. • Parameters – mean, variance, and so on • In general, we do not know the parameters. 31 Purpose of Statistical Analyses – Estimate the parameters. (Make guesses.) • Example: What is the population mean? – Test hypothesis about the parameters. (Ask questions.) • Example: Is the population mean=30mpg? 32 Role of Samples – Taking a sample of the population enables you to • make estimates of the population parameters • answer the questions about the population parameters. 33 Population and Sample Parameters Sample Mean, Variance, Median, Mode, Distribution, … S Sample mean Sample variance Inference: Estimates Test of hypotheses 34 Example: cars_american • This is a sample of American-made cars that weigh between 2000 and 3500 pounds and that were built in the 1970s. • We are interested in the mpg. • Use summary statistics to analyze the data. 35 Results of Summary Statistics 36 Results of Histogram continued... 37 Results of Histogram 38 Sampling Distribution Applet sampling_dist • This demonstration illustrates how to estimate and plot the sampling distribution of various statistics. 39 View/Application Share: Demo: Sampling Distributions Applet • [PlaceWare View/Application Share. Use PlaceWare > Edit Slide Properties... to edit.] 40 http://www.ruf.rice.edu/~lane/stat_si m/sampling_dist/index.h... • [PlaceWare Web Page. Use PlaceWare > Edit Slide Properties... to edit.] 41 Confidence Intervals on the Population Mean • Level of Comfort • 50% {21.57 to 22.21} • 95% {20.96 to 22.82} What does this mean? • 99.9% {20.30 to 23.48} 42 Test That the Population Mean = 30 mpg • Use t-test One Sample t-test • Requirements for running this test: – Large n > 35 – Or leftovers are normal • What is the p-value or sig value? 43 Testing Mean = 30 H o : mpg 30 H A : mpg 30 44 Conclusions of the Test • Choose an alpha level, usually alpha=.05. • If sig<alpha, then reject. • Otherwise, fail to reject. 45 Sig and p-values • When you see a sig value or p-value: – You know that some hypothesis is being tested. – You know whether or not the hypothesis is being rejected. – You probably do not know what the hypothesis really is. • Ask yourself these questions: – What are the population parameters being tested? – How is what is being tested related to those parameters? 46 Requirements for Doing This Test • Large n n > 35 • Or leftovers are normally distributed. • Use Histogram to test for normality. 47 Populations─Which Ones are Similar? 48 Populations─Which Ones are Similar? • Take samples. 49 Take Samples • Use the samples to answer this question: • “Which populations are similar?” • Statistical translations: • “Which populations are similar?” is the same as asking… • Are the following the same: – distribution? – mean? – variance? 50 Background/Requirements • Before we jump into the analysis, we must ask the following questions: – How many populations are there? – How many population parameters are we interested in and what are they? – What tests do we want to do, and what are the requirements for doing those? – Are we using everything we “know?” 51 Example • Suppose that we are interested in the mpg of American andCars European cars. How many American European Cars populations Mpgare there? Mpg Distribution Mean Variance Distribution Mean Variance 52 Poll: How many populations are there? • [PlaceWare Multiple Choice Poll. Use PlaceWare > Edit Slide Properties... to edit.] • One - MPG • Two - American and European • Depends on the sample size 53 Parameters Population 1 Population 2 American Cars European Cars Variable of interest: mpg Variable of interest: mpg Distribution: Normal? Distribution: Normal? Mean: Variance: A 2 A E 2 Variance: E Mean: 54 Analyses 1. We want to look at the distributions. 2. We want to estimate the parameters. 3. We want to answer these questions: • • Are the populations means the same? Are the population variances the same? 55 Example: Our Data Set car_am_eu • Suppose that we are interested in the mpg of American andCars European cars. American European Cars Mpg Distribution Mean Variance Mpg Distribution Mean Variance Sample Sample 56 Results from the Sample continued... 57 Results Tests of Normality a Miles per Gallon Country of Origin American European Kolmogorov-Smirnov Statis tic df Sig. .110 248 .000 .111 70 .033 a. Lilliefors Significance Correction 58 Box Plots American European 59 Histograms American European 60 Poll: Are the populations the same? • [PlaceWare Yes/No Poll. Use PlaceWare > Edit Slide Properties... to edit.] • Yes • No 61 Conclusion Based on Sample Numbers and Graphs • Easy -- Based on the samples, the populations are different—no statistical jargon • But I must have a p-value for my boss, for my paper, and so on. 62 Formal Tests • The classical approach in determining whether two populations are the same is to test to see whether the two population means are equal. • But first we check to see whether the two 2 2 population variances are equal: H : o A E o A E • H : continued... 63 Formal Tests • We use t-test Two Sample. Test 2 Test 1 64 Section 1.3 Simple Linear Regression Objectives – Identify the following: • • • • • • the population parameters the appropriate model number of populations sampled the correct hypotheses what should be tested for normality what “equal variances” means. 66 MPG Example Weight = 3000 Weight = 2600 3 1 2 1 Weight = 2300 4 2 4 Take a sample of size 1 from each population! 2 3 Weight = 2900 2 2 2 67 Data • We should be in deep trouble with one sample from each population. • We have eight unknown population parameters. • Can you name them? • But what do we “know”? 68 Survey • Name the population parameters. 69 Essential Part and Leftovers • We want to “model” the data as follows: • MPG = Essential Part + Leftover • or • MPG = Mean + Leftover 70 First, we "know" that “Know” or Assumptions 2 • First, we 32 “know” 42 that 2 1 2 2 First, we "know" that 12 22 32 42 2 Second, each population mean is related to weight Second, each population mean is related to weight by by• the following: the following: Second, each population mean is related to weight by the following: i a b * weighti • The population means fall on a straight line!! a b * weight i i THE POPULATION MEANS FALL ON A STRAIGHT LINE!!!! • How many unknowns are there now? THE POPULATION MEANS FALL ON A STRAIGHT LINE!!!! 71 Poll: How many unknowns are there? • [PlaceWare Multiple Choice Poll. Use PlaceWare > Edit Slide Properties... to edit.] • • • • • • 1 2 3 4 5 n 72 Graph 73 Observed, Essential Part, Leftover 74 The Official Regression Model mpg = a + b*weight+leftover mpg = a + b*weight+leftover or = a + b*weight+leftover mpg or mpg mpg = = aa + + b*weight+leftover b*weight+error or •mpg or = a + b*weight+error or or = a + b*weight+error mpg or mpg mpg = = aa + + b*weight+error b*weight+ or mpg = a + b*weight+ or oror = a + b*weight+ mpg •or mpg a ++b*weight+ mpg = = *weight+ or o 1 mpg = o + 1*weight+ or mpg = + *weight+ o The errors are “known” to be normal with mean 0 and variance 2. 1 •mpg or = o + 1*weight+ 75 Main Assumptions • The means of the populations fall on a straight 2 line. 2 • All of the variances are equal ( ). • The errors are “known” to be normal with mean 0 and variance . 76 Assumptions for Simple Linear Regression Appendix A • This demonstration illustrates the fundamental concepts of simple linear regression. 77 View/Application Share: Demo: Linear.doc • [PlaceWare View/Application Share. Use PlaceWare > Edit Slide Properties... to edit.] 78 The Principle of Least Squares: How Can We Estimate the Unknown Let leftover mpg (essential part) Parameters? Let leftover mpg (essential The Principle of Least Squares: or The Principle of Least Squares: part) The Principle of Least Squares: i i i i i i •or The Principle of Least Squares: leftoveri mpg i (a+b*weight i ) leftover (a+b*weight Let leftover mpg mpg (essential part) i impg i i ) or Let leftover i i i (essential part) i i ri mpg i (a+b*weight i ) oror •or or ri mpgiimpg mpg (a+b*weight leftover (a+b*weight (a+b*weight ) i) leftover i i i i )i Now oror • or mpg mpg (a+b*weight ) Choose a and b so that ri riNow i i (a+b*weight i )i Choose a and b so that r 2 r 2 r 2 r 2 is as small as possible. 1 2 3 4 •r 2 Now, choose a and b so that is as 2 2 2 r2 r3 r4 is as small as possible. Now Now 1 or small as possible. 2 2 2 2 Choose andbbsosothat that or Choose aaand Minimize (r r r r 1 2 3 4 ) • or 2 2 2 22 2 2 2 as small possible. isras r r )possible. r12r1Minimize r22r2r32r3(r r42r14 is small as 2 3 4 as • Minimize . oror 2 2 2r 2 2r 2 2r 2 ) Minimize (r Minimize (r1 1 r2 2 r3 3 r4 4 ) 79 OUTPUT_0 80 OUTPUT 81 OUTPUT_1 82 OUTPUT_2 83 OUTPUT_3 84 OUTPUT_4 85 Missing Values • Suppose that we want to estimate the mean mpg when weight=2500. • Predicted (Estimated) Mean MPG = 44.05 .0078*weight • Why does this work? 86 Survey • Can anyone explain why this works? 87 Conclusion – Simple linear regression is very powerful. – But it is based on assumptions (what we “know”). – We need to check assumptions (residual analyses). 88