Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics Overview Some New, Some Old… Some to come Science of Statistics Descriptive Statistics – methods of summarizing or describing a set of data tables, graphs, numerical summaries Inferential Statistics – methods of making inference about a population based on the information in a sample Levels of Measurement Nominal: The numerical values just "name" the attribute uniquely; no ordering of the cases is implied. Ordinal: Attributes can be rank-ordered; here, distances between attributes do not have any meaning. Interval: The distance between attributes does have meaning. Ratio: There is always an absolute zero that is meaningful; this means that you can construct a meaningful ratio. It's important to recognize that there is a hierarchy implied in the level of measurement idea. At each level up the hierarchy, the current level includes all of the qualities of the one below it and adds something new. In general, it is desirable to have a higher level of measurement. Variables Individuals are the objects described by a set of data; may be people, animals or things Variable is any characteristic of an individual Categorical variable places an individual into one of several groups or categories Quantitative variable takes numerical values for which arithmetic operations make sense Distribution of a variable tells us what values it takes and how often it takes these values Correlation Correlation can be used to summarize the amount of linear association between two continuous variables x and y. A positive association between the x and y variables is indicated by an increase in x accompanied by an increase in y. A negative association is indicated by an increase in x accompanied by a decrease in y. For more information see http://www.anu.edu.au/nceph/surfstat/surfstat-home/1-4-2.html Chi-square A chi square statistic is used to investigate whether distributions of categorical variables differ from one another. The chi square distribution, like the t distributions, form a family described by a single parameter, degrees of freedom. df = (r – 1) X (c – 1) For a detailed example, see http://math.hws.edu/javamath/ryan/ChiSquare.html Hypothesis Testing Hypothesis testing in science is a lot like the criminal court system in the United States… consider – How do we decide guilt? Assume innocence until ``proven'' guilty. Proof has to be ``beyond a reasonable doubt.'' Two possible decisions: guilty or not guilty • Jury cannot declare someone innocent Statistical Hypotheses Statistical Hypotheses are statements about population parameters. Hypotheses are not necessarily true. The hypothesis that we want to prove is called the alternative hypothesis, Ha. Hypothesis formed which contradicts Ha is called the null hypothesis, Ho. After taking the sample, we must either: Reject Ho and believe Ha or Fail to Reject Ho because there was not sufficient evidence to reject it. Type I and II Error Consider the jury trial… If a person is really innocent, but the jury decides (s)he's guilty, then they've sent an innocent person to jail. Type I error. If a person is really guilty, but the jury finds him/her not guilty, a criminal is walking free on the streets. Type II error. In our criminal court system, a Type I error is considered more important than a Type II error, so we protect against a Type I error to the detriment of a Type II error. This is ‘typically’ the same in statistics. Decision Reject Ho Fail to Reject Ho Truth Ho is true Ho is false Type I Error OK OK Type II Error P-value The choice of alpha is subjective. The smaller alpha is, the smaller the critical region. Thus, the harder it is to Reject Ho. The p-value of a hypothesis test is the smallest value of alpha such that Ho would have been rejected. If P-value is less than or equal to alpha, reject Ho. If P-value is greater than alpha, do not reject Ho. Confidence Intervals Statisticians prefer interval estimates. The degree of certainty that we are correct is known as the level of confidence. Common levels are 90%, 95%, and 99%. Increasing the level of confidence, Point Estimate +/- Critical Value * Standard Error Decreases the probability of error increases the critical point widens the interval Increasing n, decreases the width of the interval Gamma This is a statistics utilized in cross-tabulation tables. Typically viewed as a nonparametric statistic. The Gamma statistic is preferable to Spearman R or Kendall tau when the data contain many tied observations. Gamma is a probability; specifically, it is computed as the difference between the probability that the rank ordering of the two variables agree minus the probability that they disagree, divided by 1 minus the probability of ties. It is basically equivalent to Kendall tau, except that ties are explicitly taken into account. Detailed discussions of the Gamma statistic can be found in Goodman and Kruskal (1954, 1959, 1963, 1972), Siegel (1956), and Siegel and Castellan (1988). Gamma This statistic also tells us about the strength of a relationship. Can be used with ordinal or higher level of data. For a more detailed discussion of Lambda, Gamma and Tau, see http://72.14.209.104/search?q=cache:8ZS4_FvVqrgJ:ms. cc.sunysb.edu/~mlebo/_private/Classes/POL501/Lecture %252012.pdf+gamma+AND+lambda+AND+tau+AND+st atistics&hl=en&gl=us&ct=clnk&cd=39 Considering Bias A sample is expected to mirror the population from which it comes, however, there is no guarantee that any sample will be precisely representative of the population from which it comes. The difference between the sample and the population is referred to as bias. Sampling Bias A tendency to favor selecting people that have a particular characteristic or set of characteristics. Sampling bias is usually the result of a poor sampling plan. The most notable is the bias of non response when people of specific characteristics have no chance of appearing in the sample. Non-Sampling Error In surveys of personal characteristics, unintended errors may result from: The manner in which the response is elicted The social desirability of the persons surveyed The purpose of the study The personal biases of the interviewer or survey writer Enjoy the exploration! Questions or comments