Download Statistics Overview2..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Effect size wikipedia , lookup

Transcript
Statistics Overview
Some New, Some Old… Some to come
Science of Statistics

Descriptive Statistics – methods of summarizing or
describing a set of data tables, graphs, numerical
summaries

Inferential Statistics – methods of making inference
about a population based on the information in a
sample
Levels of Measurement

Nominal: The numerical values just "name" the attribute
uniquely; no ordering of the cases is implied.

Ordinal: Attributes can be rank-ordered; here, distances
between attributes do not have any meaning.

Interval: The distance between attributes does have meaning.

Ratio: There is always an absolute zero that is meaningful;
this means that you can construct a meaningful ratio.

It's important to recognize that there is a hierarchy implied
in the level of measurement idea. At each level up the
hierarchy, the current level includes all of the qualities of
the one below it and adds something new. In general, it is
desirable to have a higher level of measurement.
Variables





Individuals are the objects described by a set of data;
may be people, animals or things
Variable is any characteristic of an individual
Categorical variable places an individual into one of
several groups or categories
Quantitative variable takes numerical values for
which arithmetic operations make sense
Distribution of a variable tells us what values it takes
and how often it takes these values
Correlation

Correlation can be used to summarize the amount of linear
association between two continuous variables x and y.

A positive association between the x and y variables is
indicated by an increase in x accompanied by an increase in y.

A negative association is indicated by an increase in x
accompanied by a decrease in y.

For more information see
http://www.anu.edu.au/nceph/surfstat/surfstat-home/1-4-2.html
Chi-square

A chi square statistic is used to investigate whether
distributions of categorical variables differ from one
another.

The chi square distribution, like the t distributions,
form a family described by a single parameter,
degrees of freedom.
 df = (r – 1) X (c – 1)

For a detailed example, see
http://math.hws.edu/javamath/ryan/ChiSquare.html
Hypothesis Testing

Hypothesis testing in science is a lot like the criminal court
system in the United States… consider – How do we decide
guilt?

Assume innocence until ``proven'' guilty.

Proof has to be ``beyond a reasonable doubt.''

Two possible decisions: guilty or not guilty
• Jury cannot declare someone innocent
Statistical Hypotheses

Statistical Hypotheses are statements about population
parameters.

Hypotheses are not necessarily true.

The hypothesis that we want to prove is called the
alternative hypothesis, Ha.

Hypothesis formed which contradicts Ha is called the null
hypothesis, Ho.

After taking the sample, we must either: Reject Ho and
believe Ha or Fail to Reject Ho because there was not
sufficient evidence to reject it.
Type I and II Error

Consider the jury trial…

If a person is really innocent, but the jury decides (s)he's guilty, then
they've sent an innocent person to jail.
 Type I error.

If a person is really guilty, but the jury finds him/her not guilty, a criminal
is walking free on the streets.
 Type II error.

In our criminal court system, a Type I error is considered more
important than a Type II error, so we protect against a Type I error to
the detriment of a Type II error. This is ‘typically’ the same in statistics.
Decision
Reject Ho
Fail to
Reject Ho
Truth
Ho is true
Ho is false
Type I Error
OK
OK
Type II
Error
P-value

The choice of alpha is subjective.

The smaller alpha is, the smaller the critical region. Thus,
the harder it is to Reject Ho.

The p-value of a hypothesis test is the smallest value of
alpha such that Ho would have been rejected.

If P-value is less than or equal to alpha, reject Ho.

If P-value is greater than alpha, do not reject Ho.
Confidence Intervals

Statisticians prefer interval estimates.


The degree of certainty that we are correct is
known as the level of confidence.


Common levels are 90%, 95%, and 99%.
Increasing the level of confidence,




Point Estimate +/- Critical Value * Standard Error
Decreases the probability of error
increases the critical point
widens the interval
Increasing n, decreases the width of the interval
Gamma

This is a statistics utilized in cross-tabulation tables.

Typically viewed as a nonparametric statistic.

The Gamma statistic is preferable to Spearman R or Kendall
tau when the data contain many tied observations. Gamma
is a probability; specifically, it is computed as the difference
between the probability that the rank ordering of the two
variables agree minus the probability that they disagree,
divided by 1 minus the probability of ties.

It is basically equivalent to Kendall tau, except that ties are
explicitly taken into account.

Detailed discussions of the Gamma statistic can be found in
Goodman and Kruskal (1954, 1959, 1963, 1972), Siegel
(1956), and Siegel and Castellan (1988).
Gamma

This statistic also tells us about the strength of a
relationship.

Can be used with ordinal or higher level of data.

For a more detailed discussion of Lambda, Gamma and
Tau, see
http://72.14.209.104/search?q=cache:8ZS4_FvVqrgJ:ms.
cc.sunysb.edu/~mlebo/_private/Classes/POL501/Lecture
%252012.pdf+gamma+AND+lambda+AND+tau+AND+st
atistics&hl=en&gl=us&ct=clnk&cd=39
Considering Bias

A sample is expected to mirror the population from which it comes,
however, there is no guarantee that any sample will be precisely
representative of the population from which it comes. The difference
between the sample and the population is referred to as bias.

Sampling Bias
A tendency to favor selecting people that have a particular characteristic
or set of characteristics. Sampling bias is usually the result of a poor
sampling plan. The most notable is the bias of non response when
people of specific characteristics have no chance of appearing in the
sample.

Non-Sampling Error
In surveys of personal characteristics, unintended errors may result from:
 The manner in which the response is elicted
 The social desirability of the persons surveyed
 The purpose of the study
 The personal biases of the interviewer or survey writer
Enjoy the exploration!

Questions or
comments