Download STATISTICS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Adam Zaborski – handouts for Afghans
STATISTICS
Statistics is the science of the collection, organization and interpretation of data. Statistics also
provides tools for prediction and forecasting using data and statistical models.
“Statistics” – if refers to discipline – is singular
“Statistics” – when refers to quantities calculated from a set of data – is plural
There are three kinds of lies: lies, damned lies, and statistics.
By choosing a certain sample, results can be manipulated. Such manipulations need not be
malicious or devious; they can arise from unintentional biases of the researcher. The graphs
used to summarize data can also be misleading. A difference that is highly statistically
significant can still be of no practical significance.
For instance, we want to start the car engine. We observe that each time the engine cannot
start the lights are like turned out. The possible statement is that breakdown of lights causes
the engine failure. But both phenomena are really caused by low battery level (and possible
malfunction of generator).
Chosen subset of the population is called a sample.
Adam Zaborski – handouts for Afghans
For a sample to be used as a guide to an entire population, it is important that it is truly a
representative of that overall population.
Most studies will only sample part of a population and then the result is used to interpret the
null hypothesis in the context of the whole population.
Descriptive statistics – summarize the population data by describing what was observed in the
sample numerically or graphically (for ex. mean and standard deviation for continuous data
types, while frequency and percentage for categorical data ).
Inferential statistics draws inferences about the population: answering yes/no question,
estimating numerical characteristics, describing associations (correlations), modeling
relationships (regression), extrapolation, interpolation and other modeling techniques.
Common mistake is to take the statistical frequency (expressed in % as the unit) as probability
(also in % as the unit). The same units for both variables doesn’t mean the same variable.
For example, in statistics 95% means, that something will happen 95 times on 100 repetition.
This is slightly different that 95% probability of the event.
Adam Zaborski – handouts for Afghans
Accuracy and precision
Do we always need precise measurement of quantities? Of course not. Sometimes we have to
have some qualitative information or a rough assessment is sufficient.
As an example we can study a case of settlement of structure. Let’s suppose it was caused by
some excavation works nearby. We want to know does the settlement continue? The answer
can be given by simple experiment, not measurement strictly speaking. For this we can use
marble triangles as the most basic device to monitor wall movement:
Adam Zaborski – handouts for Afghans
Marble triangles as indicator of continuously growing crack
Adam Zaborski – handouts for Afghans
If we want to have more precise answer we can use another simple device:
The device installed on a lighthouse.
Adam Zaborski – handouts for Afghans
If we need greater precision we may use the dial gauges or another devices.
The accuracy of a measurement system is the degree of closeness of measurement of a
quantity to its actual (true) value.
The precision, also called reproducibility or repeatability, is the degree to which repeated
measurements under unchanged conditions show the same results.
Although the two words can be synonymous in colloquial use, they are deliberately contrasted
in the context of the scientific method.
A measurement system can be accurate but not precise, precise but not accurate, neither, or
both.
Adam Zaborski – handouts for Afghans
For example, if an experiment contains a systematic error, the accuracy is small. A
measurement system is called valid if it is both accurate and precise. Related terms are bias (if
non-random) and error (random variability).
In addition to accuracy and precision, measurements may have also a measurement resolution,
which is the smallest change in the underlying physical quantity that produces a response in
the measurements. Attention: it is not allowed to increase the resolution of the device over
that prescribed: usually, if not stated otherwise (by a caption: “can be interpolated”), it is the
smallest scale tics.
To better explain accuracy and precision we can use the target analogy:
High accuracy, low precision and high precision , low accuracy
Adam Zaborski – handouts for Afghans
In this analogy, repeated measurements are compared to arrows that are shot at a target.
Accuracy describes the closeness of arrows to the bullseye at the target center. Arrows that
strike closer to the bullseye are considered more accurate. The closer a system's
measurements to the accepted value, the more accurate the system is considered to be.
A common convention in science and engineering is to express accuracy and/or precision by
means of significant digits. For example:
8 x 103 m stands for eight hundred meters (no margin stated)
8.0 x 103 stands for margin of 50 meters
8.00 x 103 stands for margin of 5 meters
8.000 x 103 stands for margin of 50 centimeters.
The accuracy can be said to be the “correctness” of a measurement, while precision could be
identified as the ability to resolve smaller differences.
Adam Zaborski – handouts for Afghans
Normal distribution
(Gaussian distribution) is an continuous probability distribution “bell”-shaped with peak at the
mean:
where:
– the mean, specifies the position of the curve’s central peak
– the variance, specifies the “width” of the curve; some authors use its reciprocal
which is called precision; when it is equal to zero, the density function does not exist: it is the
Dirac delta function, equal to infinity for the mean and zero elsewhere.
The distribution with the mean 0 and variance 1 is called standard normal: total area under the
curve is equal to 1 and ½ in the exponent makes the width of the curve (a half distance
between the inflection points) also equals to one.
Adam Zaborski – handouts for Afghans
Normal distribution – probability density function
The normal distribution is often used to describe, at least approximately, variables that tends
to cluster around the mean. The observational error in an experiment is usually assumed to
follow a normal distribution, and the propagation of uncertainty is computed using this
assumption.
By the central limit theorem, under certain conditions the sum of a number of random
variables with finite means and variances approaches a normal distribution as the number of
Adam Zaborski – handouts for Afghans
variables increases. The observation error in an experiment is usually assumed to follow a
normal distribution, and the propagation of uncertainty is computed using this assumption.
Of all probability distributions, the normal distribution is the one with the maximum entropy.
Two estimators σ and s differ by having (n-1) instead of n.
When the outcome is produced by a large number of small effects acting additively and
independently, its distribution will be close to normal.
Measurement errors in physical experiments are often assumed to be normally distributed.
Central limit theorem
It states that under certain conditions the sum of a large number of random variables will have
an approximately normal distribution.
When the result is produced by a large number of small effects acting additively and
independently, its distribution will be close to normal.
The importance of the central limit theorem cannot be overemphasized.
Another practical consequence of the central limit theorem is that certain other distribution
can be approximated by the normal distribution: binomial, Poisson, chi-squared, Student’s tdistribution when the sample is large.
Adam Zaborski – handouts for Afghans
Mean
-
-
arithmetic mean: arithmetic average of values; it is not necessarily the same as the
middle value (median), or the most likely (mode); for example mean income is
skewed upwards by a small number of people with very large incomes, so the majority
have an income lower than the mean; the median income is the level at which half the
population is below and half is above; the mode income is the most likely income and
favors the larger number of people with lower incomes
geometric mean is an average useful for sets of positive numbers that are interpreted
according to their product and not their sum, e.g. rates of growth
harmonic mean is an average which is useful for sets of numbers which are defined in
relation to some unit, for example speed
AM, GM, HM satisfy these inequalities:
AM >= GM >= HM.
The mean of a population and the sample mean:
- The sample mean of a population is a random variable, not a constant.
Adam Zaborski – handouts for Afghans
-
The sample mean may differ from the population mean, especially for small samples,
but the law of large numbers dictates that the larger the size of the sample, the more
likely it is that sample mean will be close to the population mean.
Variance
It describes how far values lie from the mean. A variance has units that are the square of the
units of the variable itself. For this reason the standard deviation is more frequently used.
Standard deviation
It is the square root of the variance and is widely used measure of the variability or dispersion,
if the data points tend to be very close to the mean or are spread out over a large range of
values.
Standard deviation is commonly used to measure confidence in statistical conclusions. SD is,
unlike variance, expressed in the same units as the data.
Population SD is valid for the complete population. Sample SD is valid for a random sample
drawn from larger population.
Adam Zaborski – handouts for Afghans
The bell curve. Each colored band has a width of one standard deviation (68%, 95%, 99.7%)
Confidence interval
Confidence intervals are used to indicate the reliability of an estimate. Increasing the desired
confidence level will widen the confidence interval.
Adam Zaborski – handouts for Afghans
A confidence interval is always qualified by a particular confidence level, usually expressed
as a percentage; thus one speaks of a “95% confidence interval”. The endpoints of this
interval are called confidence limits.
Formally, a 95% confidence interval means that, if the sampling an analysis were repeated
under the same conditions, the interval would include the true value 95% of the time. This
does not imply that the probability that the true value is in confidence interval is 95%. (This is
true for so-called credible interval from Bayesian statistics).
The calculation of confidence interval requires assumptions about the nature of the estimation
process, for instance, that the distribution of the population from which the sample came is
normal.
The desirable properties are:
- validity (confidence interval should hold)
- optimality (the rule for constructing the confidence interval should make as much data
set as possible)
- invariance (independence of data-presentation coordinates)
Adam Zaborski – handouts for Afghans
An example: If one were to roll two dice and get double six (which happens 1/36th of the time,
or about 3%), few would claim this as proof that the dice were fixed, although statistically
speaking one could have 97% confidence that they were. Similarly, the finding of a statistical
link at 95% confidence is not proof, nor even very good evidence, that there is any real
connection between the links linked.
Skewness
It is a measure of the asymmetry of the probability distribution of a variable.
Adam Zaborski – handouts for Afghans
Student’s t-distribution
It is a continuous probability distribution that arises in the problem of estimating the mean of
a normally distributed population when the sample is small.
Student’s distribution arises when (as in nearly all practical statistical work) the population
standard deviation is unknown and has to be estimated from the data.
The problems are generally of two kinds:
1. those in which the sample size is so large that one may treat a data-based estimate of
the variance as if it were certain
2. the standard deviation is ignored
The probability density function of the t-distribution resembles the bell shape of a normality
distributed variable with mean 0 and variance 1, except that it is a bit lower and wider. As the
number of degrees of freedom grows, the t-distribution approaches the normal distribution
with mean 0 and variance 1.
Adam Zaborski – handouts for Afghans
Probability density function
Adam Zaborski – handouts for Afghans
It is often necessary to fix the degrees of freedom at a fairly low value and estimate the other
parameters taking this as given. Some authors report that values between 3 and 9 are often
good choices.
Practical formulae
mean value x m 
n
xi
i
standard deviation for the sample of population: Sx 
confidence level   t ( n 1)
 x  x 
i
i
2
m
n 1
Sx
,
n
where t ( n 1) - coefficient of t-Student repartition
The results, with the expected frequency % is in the interval of: x  x m   .
Adam Zaborski – handouts for Afghans
The formulae and computation order:
Direct measurement
mean value x m
standard deviation S x
confidence level 
final results x  xm  
Indirect measurement of one variable y  f ( x )
mean value ym  f ( xm )
standard deviation S y  f ' ( x ) S x
confidence level  y
final results x  xm  
Indirect measurement of multiple variables.
For function of many variables y  f ( x1 ,, xk )
Adam Zaborski – handouts for Afghans
mean value ym for i  1,, k
standard deviation S xi for i  1,, k
confidence level  xi for i  1,, k
2
 f

 f

x1    
xk 
confidence interval  y  
  x1

  xk

2
final results y  ym   y
r
1
2
3
4
5
6
7
8
TABLE OF t – STUDENT’S DISTRIBUTION (FOR THE CONFIDENCE LEVEL 95 %)
r
r
r
r
t r
t r
t r
t r
t r
12.706
9
2.262
17
2.110
25
2.060
60
2.000
4.303
10
2.228
18
2.101
26
2.056
3.182
11
2.201
19
2.093
27
2.052 100
1.980
2.776
12
2.179
20
2.086
28
2.048
2.571
13
2.160
21
2.080
29
2.045
1.960

2.447
14
2.145
22
2.074
30
2.042
2.365
15
2.131
23
2.069
2.306
16
2.120
24
2.064
40
2.021
r=n-1
Adam Zaborski – handouts for Afghans
Example
P
P
h
b
a
l
a
M
We have 4 measurements for loading, 4 measurements for unloading for both dial gauges, 16
measurements in total. We calculate the differences (the changes of gauges’ indications), we
note results:
1.06, 1.08, 1.03, 1.04, 1.06, 1.07, 1.08, 1.09, 1.05, 1.06, 1.05, 1.07, 1.07, 1.07, 1.05, 1.04, 1.06
We have the formulae:
,
We calculate:
-
mean
mm
Adam Zaborski – handouts for Afghans
-
mean
GPa
standard deviation for deflection measurement: S x  0.01633 mm
standard deviation for Young modulus measurement: S E  3.208 GPa
confidence interval:
GPa
final result:
GPa