Survey

Document related concepts

Transcript

Introductory Statistics Learning Objectives After the session the students should be able to: Distinguish between different data types Evaluate the central tendency of realistic business data Evaluate the dispersion of data Evaluate test statistics Use a test statistic to formulate a business decisions using regression analysis Types of data • Discrete (A variable controlled by a fixed set of values) • Continuous data (A variable measured on a continuous scale ) • These data may be collected (ungrouped) and then grouped together in particular form so that can be easily inspected • But how would we collect data? Sampling Techniques Simple random sampling Stratified sampling Cluster sampling Quota sampling Systematic sampling Mechanical sampling Convenience sampling Frequency distributions The following are data of ages of a sample of ages managers 42 30 53 50 52 30 55 49 61 74 26 58 40 40 40 28 36 30 31 37 32 37 30 32 23 32 58 43 30 29 34 50 47 31 35 26 64 46 40 43 57 30 49 40 25 50 52 32 60 54 How could we represent these data effectively? Scattering the data 70 60 50 Series1 Scatter Diagrams 40 Series2 30 Series3 Series4 20 10 0 0 20 40 Bar Diagrams 80 70 60 Series1 50 Series2 Series3 40 Series4 30 Series5 20 10 0 1 2 3 4 5 6 7 8 9 10 60 80 The histogram We could group the data into convenient class intervals thus Class Range 20 30 40 50 60 70 29 39 49 59 69 79 Central val 24.5 34.5 44.5 54.5 64.5 74.5 6 17 12 11 3 1 6 23 35 46 49 50 18 and plot these to produce a histogram 16 14 12 10 Series2 8 6 4 2 What measures of the central tendency do we have 0 24.5 34.5 44.5 54.5 64.5 74.5 Measures of the central tendency • • • • Mode • The maximum value of the distribution e.g. the most occurring value (in reality this can be evaluated using a standard formula Median • The central value of a set of data or a distribution. Can be evaluated using a standard method of using the CDF Arithmetic mean • The central value assuming the data are distributed in accordance to an arithmetic progression Geometric mean • The central value assuming the data are distributed according to a geometric progression The mode • For our data this occurs between 30-39 (the modal range) • The construction shown can be employed to home in on the exact value • Or the formula: where L=lower boundary, l=lower freq diff, u=upper freq diff & c=the class boundary width 18 16 14 12 10 Series2 8 6 4 2 0 24.5 34.5 xmode 44.5 54.5 64.5 74.5 l L c l u The mode • • • • • Here, for our data l L=29.5, xmode L c l=5, l u u=1and the class boundary 5 width c=10 xmode 29.5 10 5 1 xmode 50 29.5 37.83 6 The Median For our data we could • 60 Cumulative Frequency • evaluate this quantity two fold Approximate using by plotting the cumulative frequency diagram Via logical inference CDF 50 40 30 Series1 20 10 0 0 20 40 60 80 100 Frequency 18 16 14 12 10 Series2 8 6 4 2 0 24.5 34.5 44.5 54.5 64.5 74.5 12 11 3 1 17 6 4 4 m 39.5 10 41.16 2 12 Measures of Dispersion • The range • Largest value minus Smallest value • Variance • Mean Square variation from the mean R LS f x x f 2 2 i i • Standard Deviation • Square root of the variance 2 NOTE: f i x x 2 f i n fi Use of Computer packages Example: Given the following data use a spreadsheet to produce a grouped histogram using 9 bins also produce a CFD. Hence or otherwise evaluate: a) Three measures of the central tendency and, b) Three measures of the dispersion Decision Processes This is all very well and good however, how does this allow us to make research and managerial & research decisions? To answer this we need to consider the pattern of the data, thus: 12 10 8 6 Series1 4 2 0 20.445 20.545 20.645 20.745 20.845 20.945 21.045 21.15 The Normal distribution • • • • Many sets of data adhere to the normal distribution. The most important distribution of them all It is pretty much this property that allows us to obtain (research) management decisions The normal distribution is usually written N(μ,σ2); with μ the population mean and σ2 the variance Properties of N(μ,σ2) • For any normal curve with mean mu and standard deviation sigma: • 68 percent of the observations fall within one standard deviation sigma of the mean. • 95 percent of observation fall within 2 standard deviations. • 99.7 percent of observations fall within 3 standard deviations of the mean. The Z-Score This is formula that allows us to evaluate the probability of an event if we know that a particular population is normally distributed X Z Example: If a population is N(48,12), find the probability that some value of X<20. Solution Protocol Establish hypothesis 2. Evaluate the Zscore 3. Sketch the distribution 4. Evaluate probability 1. P X 20 | N (48,122 ) p Z 20 48 2.333 12 -2.15 p p 0.5 0.4901 0.0099 1% Spreadsheet Solution Protocol Establish hypothesis 2. Use normal distribution function 3. Perform Check i.e. use Z-function 1. P X 20 | N (48,122 ) p p 0.009815 1% 20 48 Z 2.153 12 p 0.009815 1% Exercise • Example: Using a z score If a population is N(111,33.82), find the probability that some value of 100 <X<150. Pa X b | N (, ) p X Z p Exercise • Using a z score and given that the population is N(37,4.352), find the probability that some value of X>150. Pa X b | N (, ) p X Z p Samples If we are using a sample of values as a consequence of the central limit theorem the z score will change, thus X Z / n Example The mean expenditure per customer at a tire store is £60 and the sd £6. It is known that the nominal customer per day is 40. A new product costs £64, what is the probability of selling such a product per customer Pa X b | N (, ) p 64 60 Z 1.41 6 / 40 p Try one In a store, the average number of shoppers is 448, with an sd of 21. What is the probability that 49 shopping hours have a mean between 441 and446. P441 X 446 | N (, ) p 441 446 Z 21 / 49 p X Z / n Regression & Correlation analysis A scatter diagram can be used to show the relationship between two variables Correlation analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the relationship No causal effect is implied with correlation Scatter diagrams were presented in the last sessions As was Correlation Regression & Correlation analysis A scatter diagram can be used to show the relationship between two variables Correlation analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the relationship No causal effect is implied with correlation Scatter diagrams were presented in the last sessions As was Correlation Introduction to Regression Analysis o Regression analysis is used to: Predict the value of a dependent variable based on the value of at least one independent variable Explain the impact of changes in an independent variable on the dependent variable o Dependent variable: the variable we wish to predict or explain o Independent variable: the variable used to explain the dependent variable Simple Linear Regression Model o Only one independent variable, X o Relationship between X and Y is described by a linear function o Changes in Y are assumed to be caused by changes in X Types of Relationships Linear relationships Y Curvilinear relationships Y X Y X Y X X Types of relationships cont… Weak relationships Strong relationships Y Y X Y X Y X X Types of Relationships No relationship Y X Y X The regression model Population Y intercept Dependent Variable Population Slope Coefficient Independent Variable Random Error term yi A Bx i εi Linear component Random Error component The regression model Y yi A Bx i εi Observed Value of Y for Xi εi Predicted Value of Y for Xi Slope = β1 Random Error for this Xi value Intercept = β0 Xi X The Least Squares approach b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared differences between Y and : 2 ˆ min (yi yi ) min Rendering: (y (A Bx )) 2 S XY B 2 S XX 2 i i A y Bx The proof of these requires the calculus Regression Formulae Thus the formulae can be summarized as: yi A Bx i εi S XY xy yx the covariance n S 2 XX 2 x i 1 S B S n 2 XY 2 XX x the variance 2 A y Bx Where: xy mean of x y yx x y mean x mean of x 2 x 2 square of x and of course : x mean of x y mean of y 2 Regression Example An estate agent wishes to find the relationship between the house prices and size, it is suspected that a linear relationship exists between the house price (the dependent variable Y) and the house size in square metres (the independent variable X). Using linear regression, find the relationship and make a prediction of a house price measuring 200m2. The following data have been collected by the estate agent. Regression data House Price in £k (Y) Area in m sqr (X) 123 156 156 178 140 189 154 208 100 122 110 172 203 261 162 272 160 158 128 189 Regression Solution It is usual to set up a table of results, using an appropriate Excel spreadsheet Mean values: Area in m sqr (X) 156 178 189 208 122 172 261 272 158 189 190.5 House Price in £k (Y) 123 156 140 154 100 110 203 162 160 128 143.6 X×Y 19188 27768 26460 32032 12200 18920 52983 44064 25280 24192 28308.7 2 X×X =X 24336 31684 35721 43264 14884 29584 68121 73984 24964 35721 38226.3 Regression Solution Cont… Now we simply apply the formulae as follows, first the regression coefficient, i.e. the gradient 2 S XY B 2 S XX 2 S XX mean xi2 x 2 2 S XX 38226.3 - 190.52 2 S XX 1936.05 2 S XY xy x y 2 S XY 28308.7 - 190.5 143.6 2 S XY 952.9. 952.9 B 1936.05 B 0.492. Regression Solution Cont… Then we evaluate the regression constant A y Bx A 143.6 - 0.492 190.5 A 49.838 There are various computer methods available which do these calculations for you these are detailed in the handout Regression computer solution • • • There a three methods to evaluate the Regression coefficient and constant using an Excel spreadsheet. These being: Graphical 250 y = 0.4922x + 49.838 Calculation 200 R = 0.5784 Functions 150 House Price (£k) 2 Series1 Linear (Series1) 100 50 0 0 100 200 House size (m sqr) 300 Regression computer solution Cont… This is an example of the graphical method, which is required for a pass grade in the forthcoming assignment! If you want higher grades however you will have to check these answers using the other two methods shown in the handout Summary Have we met out learning objectives? Specifically are you able to: Distinguish between different data types Evaluate the central tendency of realistic business data Evaluate the dispersion of data Evaluate test statistics Use a test statistic to formulate a business decisions using regression analysis