Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics in Research June 01, 2013 Institute of Space Technology By Asma Ali, Ph. D. Define Statistics Statistics is concerned with scientific methods for Collecting Organizing Summarizing Presenting, And Analyzing Data So to draw valid conclusions and make reasonable decisions Population & Sample Population – Entire Group Sample – A subset of population A population can be finite or infinite. For example, Numbers of bolts produced in a day is finite population All possible outcomes of a coin toss (heads, tails) in successive tosses is infinite population Population & Sample (Contd..) Population data is often impossible and impractical to collect For example, Average height of men in Pakistan Average Speed of vehicles on roadways Median Household income Variables A variable is a symbol, x, y, z, ----, that can assume any value. If a variable can assume only one value it is called a “constant”. If a variable can assume any value between two given values is called a “continuous” variable, otherwise it is called “discrete” value. Example, the number of children in a family can assume value such as 0, 1, 2, 3 --- is a discrete variable. The height of an individual can be 62”, 63.8”, 65.7” etc. Organizing Data When data is collected in a large quantity the data can be looked at as individual pieces of data or grouped into classes of data for easy comprehension. Data can be put into an array format, sorted into an ascending or descending format to get a sense of data. Or if the data set is large it can be organized into groups. Organizing Data Frequency Distribution of Heights Recap • Statistics is about collecting, organizing, visualizing, and analyzing data. • Population: Larger dataset • Sample: Subset of the larger dataset • Organizing Data: By sorting in order, by organizing in groups • There are two types of variables • Discrete (1, 2, 3…, ) • Continuous (1.0, 1.1, 1.02, …) Data Visualization Visualizing the pattern in the data. Using different statistical analysis techniques. Plotting Graphs Scatter Plot Crashes Line Graph Year Crashes Crashes Bar Chart Year Year Pie Chart Box Plot Recap • Data visualization • Scatter plot • Bar Chart • Pie Chart • Box Plot Measures of Central Tendency Mean (Arithmetic Mean) The mean of 8, 3, 5, 12, and 10 is Mean = (8 + 3 + 5 + 12 + 10)/5 = 7.6 Mode The mode for the data set, 8, 8, 8, 4, 6, 7, 7, 2, ,2, 2, 2, 5 The most frequently appearing number in this series is 2. Therefore the frequency of this data is 2 Median - The median of a set of numbers arranged in an order is either the middle value or the mean of the middle values. 2, 2, 2, 2, 4, 5, 6, 7, 7, 8, 8 Measures of Dispersion Variance – Defines the variation in the data, the dispersion in the data. Defines the magnitude of variation around the mean. Where, S2 = variance n = sample size, number of observations = mean Measures of Dispersion .6 .4 .2 -.0 -.2 -.4 -.6 4.0 4.2 4.4 4.6 Fitted Value (sec/mile) 4.8 5.0 Running Time Per Mile (sec/mile) 160 TOD = AM TOD = MD TOD = PM 140 120 100 80 60 40 0 500 1000 1500 Traffic Flow Rate (veh/hr/lane) 2000 Measures of Dispersion Standard Deviation – The square root of variance Recap • Measures of Central Tendency • Mean • Mode • Median • Dispersion Frequency Distribution Frequency distribution or Histogram of a data is plotted to visualize if the data follows a normal distribution. Normal Distribution One of the most common statistical distribution. Normal Distribution is known by its characteristics bell shaped curve. A normal distribution is a very important statistical data distribution pattern occurring in many natural phenomena, such as height, blood pressure, lengths of objects produced by machines, etc. Normal Distribution Normal Distributions are symmetrical with a single central peak at the mean of the data. Fifty percent distribution lies to the left of the mean and fifty percent lies to the right of the mean. http://en.wikipedia.org/wiki/Standard_deviation Normal Distribution http://www.regentsprep.org/Regents/math/algtrig/ATS2/NormalLesson.htm Standard Error of the Mean The answers to the two questions is based on the Standard Errors of the mean. Population Sample Population Mean Sample Mean Sample Size In practice we usually do not know the standard deviation of the population (or a very large sample), so we may have to estimate from the sample. For a Confidence Interval of 90 % , Std Error = Interval/1.64 95 %, Std Error = Interval/1.94 99 %, Std Error = Interval/2.58 Shape of the Distribution Kurtosis – Refers to peakedness or flatness of the distribution Shape of the Distribution Analysts are frequently concerned about how data is distributed between the extremes. Symmetrical Distribution (A) Skewed Right Distribution (B) Skewed Left Distribution (C) Skewness = 3(mean – median) standard deviation Recap • Histogram • Normal Curve (68% of the data is clustered around mean) • Standard Error of Mean (The distance between population and sample mean) • Sample Size • Shape of Distribution • Skeweness • Kurtosis Hypothesis Testing Null Hypothesis, Ho ,The hypothesis that we formulate or want to test, for example; The coin is fair, so the probability of a head is p = 0.5 All vehicles on a certain road travel < 40 km/hr There is no difference between medicine 1 and 2 Alternative Hypothesis, H1 ,The hypothesis that differs from or opposite of null hypothesis , for example; The coin is not fair, so the probability of a head, p ≠ 0.5 All vehicles on a certain road travel < 40 km/hr There is a difference between medicine 1 and 2 Type I and Type II Errors If we reject a hypothesis when it should be accepted than we commit Type I error If we accept a hypothesis when it should be rejected then we commit Type II error In either case a wrong decision in judgment has occurred In order for the test of hypothesis to be good, test should be designed to minimize errors An attempt to decrease one type of error causes an increase in other type of error The only way to minimize both types of errors is to increase the sample size which is not always possible Level of Significance In testing a hypothesis the maximum risk that we can take to make Type I error is called Level of Significance or Significance Level The Significance Level is denoted by α Commonly, a significance level of 0.05 or 0.01 is considered If for example, an α of 0.05 is chosen in designing a decision, then there are about 5 chances in 100 that we would reject the hypothesis when it should be accepted, that is we are 95% confident that we have made the right decision In such case we say that the hypothesis has been rejected at the 0.05 significance level, that the hypothesis has a 0.05 probability of being wrong Standard Normal Distribution It is a normal distribution with mean “0” and standard deviation “1”. Standard normal distribution is denoted as z: N[0,1]. Any value of x on any normal distribution, denoted x: N[µ, σ2], can be converted to an equal value of z on the standard normal distribution. z = (x - µ)/σx Where, z = Equivalent statistics on the standard normal distribution x = Statistics on any arbitrary normal distribution µ = Mean σ = Standard error of sampling distribution Hypothesis Testing for Large Sample If the sample size is greater than 30, the sampling distribution is considered to approximate normal distribution. For large sample size we conduct “One-Tail” or “Two-Tail” ztest depending upon the hypothesis. One-Tailed Test: looks for a definite increase or decrease in the parameter. In case of a one-tailed test the critical region is on only one side of the curve. Hypothesis Testing for Large Sample If our calculated value lies within the red region, we reject the null hypothesis in favor of the alternative hypothesis. Example; According to an article in a magazine, the work force’s average age is decreasing. A finance company wants to find out whether its share holder’s age is also decreasing. A survey conducted few years ago showed that the company’s share holder’s average age is 55. The Null hypothesis is H0: µ > 55 The Alternative hypothesis is H1: µ < 55 Hypothesis Testing for Large Sample A sample of 250 people was drawn from the share holder’s list and they were contacted for their age. The sample mean was computed and was found to be 53 years. The standard deviation of the sample was found to be 12 years. The confidence bound for the test was 90%, significance level. σx = σ/√n = 12/ √250 = 0.76 z = 53 – 55/0.76 = -2.63 For a 90% confidence level or 10% significance level, z = -1.28 -2.63 < -1.28, therefore, the null hypothesis is rejected in favor of the alternative hypothesis Hypothesis Testing for Large Sample Z-values Level of Significance, α 0.1 0.05 0.01 One Tailed test -1.28/1.28 -1.645/1.645 -2.33/2.33 Two Tailed test -1.645/1.645 -1.96/1.96 -2.58/2.58 Hypothesis Testing for Large Sample The two-tailed test is a statistical test used in inference, in which a given statistical hypothesis, H0 (the null hypothesis), will be rejected when the value of the test statistic is either sufficiently small or sufficiently large -z < x < z http://en.wikipedia.org/wiki/Two-tailed_test Hypothesis Testing for Large Sample • A telephone mail order company wants to determine if the average time between telephone orders has changed from 3.8 minutes? • The company collects a sample of 100 calls. The mean time between these calls is found to be 4.0 minutes with standard deviation of 0.5 minutes. The significance level for this hypothesis test is selected to be 2%. • The Null Hypothesis Ho; µ = 3.8 seconds • The alternative hypothesis H1 ≠ 3.8 seconds Hypothesis Testing for Large Sample σx = σ/√n = 0.5/ √100 = 0.05 For a 98% confidence level or 2/2 = 1% significance level, z = -2.33 and 2.33 If z < -2.33 or z > 2.33, reject the null hypothesis z = 4 – 3.8/0.05 = 4.0 Since 4.0 > 2.33, null hypothesis is rejected in favor of the alternative hypothesis Hypothesis Testing for Small Sample, t-test • If the sample size is less than 30, and the sample distribution follows the normal distribution, then conduct t-test for hypothesis testing. • For one-sample t-test, t = (x - μ) / (St. Dev/(n)^0.5) • Degrees of freedom. The degrees of freedom (DF) is equal to the sample size (n) minus one. Thus, DF = n - 1. • If “t” value (calculated) > “t” value from the table, reject the null hypothesis. Hypothesis Testing for Small Sample Correlation Analysis Determines the strength of relationship between variables. The co-efficient of correlation, R, indicates the strength of relationship. The sign shows the direction of relationship. + sign shows a positive relationship, i.e., as the independent variable increases, the dependent variable increases as well. - sign shows a negative relationship, i.e., as the independent variable increases, the dependent variable decreases. The value of R ranges between 0 and 1. “o” being no relationship exists. “1” means a very strong relationship exists. Correlation analysis can be conducted between dependent and independent variables and between independent variables Correlation Analysis Speed 85th (mph) Posted Speed (mph) Lane Width (ft) Median Width (ft) Access Density (ft) Segment Length (mile) Pearson Correlation Analysis Speed 85th 1.0 Posted 0.97* Speed Lane Width 0.35* Median 0.58** Width 0.97* 1.0 0.35* 0.32 0.58** 0.55** -0.10 -0.068 0.65** 0.71** 0.32 0.55** 1.0 0.33* 0.34* 1.0 -0.17 -0.25 0.29 0.075 Access Density -0.10 -0.068 -0.17 -0.25 1.0 0.14 Segment Length 0.66 0.71** 0.29 0.075 0.14 1.0 Regression Analysis Regression is a statistical modeling technique that determines the relationship between dependent variable and independent variable(s). The independent variable explains the variation in the dependent variable. A regression model can be linear or non-linear depending upon the relationship between the dependent and the independent variables. A regression model can be a simple single variable model or a multiple variable model (Chaterjee, 2000). Regression Analysis In a single variable model, the variation in dependent variable is explained by a single independent variable. In a multiple variable regression model two or more explanatory variables are used to predict the response variable. A multiple regression equation takes the following form. Y o 1 X 1 2 X 2 .... n X n Residual or error that can not be explained by the model Regression Analysis Observed value 60 Y Residual 55 Y^ Estimated value 50 45 40 35 35 40 45 50 55 Fitted Values of 85th Percentile Speed (mph) 60 Regression Analysis The amount of variation in dependent variable explained by independent variable(s) is determined by co-efficient of determination denoted by R2. The value of R2 may range between 0 and 1 The closer the value or R2 to 1.0, the stronger is the prediction power of the regression model. The value of “0” for R2 means no explanatory power of the model. Regression Analysis Regression model is developed with the following assumptions (Crawley, 2003): Errors (ε) are normally distributed Errors have constant variance Explanatory variables are measured without error All of the unexplained variation is confined to the response variable