Download RA Fisher and Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

German tank problem wikipedia , lookup

Least squares wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
RA Fisher and Statistics
Paul R. Earl
[email protected]
Facultad de Ciencias Biológicas
Universidad Autónoma de Nuevo León
San Nicolás, NL, 66450, Mexico
A view of the famous biometrician
Sir Ronald Aylmer Fisher (1890-1962) is the
architect of multivariate analysis. BIOMETRICS
of June 1964, Vol 20, No 2 is dedicated to him.
Several books are now strongly recommended:
An Introduction to Probability: Theory and its
Applicaions by William Feller, Time Series
Analysis, Forecasting and Control by George EP
Box and Gwilym M Jenkins. See JP Benzecri
(1982) Bordas, Paris. This much information is
enough for a start. Be certain to read R. A. Fisher:
The Life of a Scientist. Wiley, New York by Joan
Fisher Box, 1978. Another enjoyable book is
Choice and Chance by WA Whitworth, 1901
reprinted in 1942 by GE Stechert, New York.
Regardless, statistics poses a severe problem for
many Latin students. The fault can be poor or little
teaching of arithmetic and algebra. Students that
fear mathematics cannot succeed in probability,
statistics and so forth. On the other hand,
students in engineering likely do very well with
internet and math. Those students want to learn !
Regardless, the original public school problem
seems to be slow reading.
The first problem is data management. The
students may not know how to enter variables X1,
X2, X3... They might use 10 columns (cols) for sex
which is 1 male, 2 female using up ONLY one col.
Suppose X5 in letters takes 10-25 cols. Assign a
number. Then this variable will need only 1-3 cols.
This seems all very simple--but it's not !
Biostatistics lecture references
See
http://www.pitt.edu/~super1/lecture/lec25191/001.htm
on analysis of variance.
See http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html
on maximum likelihood,
Analysis of variance (Fisher's ANOVA)
Three conceptual models of ANOVA are
1) Fixed-effects model,
2) Random-effects model and
3) Mixed effects, depending most on the
number of treatments and levels of the
experiment. Oneway gives differences
among independent sets. Factorial can help
to explain the effects of 2 or more treatment
variables. often using 2 X 2 design.
Multivariate analysis (MANOVA) is used with
more than one dependent variables.
The total sum of squares is partioned into components
related into the effects of the model. The number of
degrees of freedom (df) can also be partioned and
specifies the chi-squared distribution which describes
the associated sum of squares.
Source
Sum of
df
Mean
F
squares
Squares
----------------------------------------------------------------------------Factor A
SSA
a-1
MSA
MSA/MSE
Factor B
SSB
b-1
MSB
MSB/MSE
Error
SSE
a*b(r-1)
MSE
-----------------------------------------------------------------------------Total
SST
a*b*(r-1)
r = repetitions
Is the F ratio an error term ? Let's add probability = p.
What does p = 0.01 mean ?
The idea here is merely to give you a start.
Simplicity is a great blessing !
What do you know about quality control,
loss functions and experimentation ?
Maximum likelihood estimation (MLE)
MLE is a method of fitting statistical models to
observed data. Assuming that each observation is
statistically independent, the joint probability of the
observed data is given by the product of the
individual probabilities:
When considered as a function of the model
parameter(s), this is called the likelihood function of
the observed data. The MLEs of the model
parameter(s) maximize the likelihood function or,
equivalently, maximize the log-likelihood function
and can be calculated by any suitable method.
PAML
PAML is a package of programs for
phylogenetic analyses of DNA or protein
sequences using MLE. It is maintained and
distributed for academic use free of charge
by Ziheng Yang. ANSI C source codes are
distributed for UNIX/Linux/MAC OS X and
executables are provided for MS Windows.
PAML may be useful if you are interested in
the process of sequence evolution.
Some other biostatistics
There are 2 kinds of statistical error
depending on which hypothesis has been
identified as the true state of nature.
A null hypothesis is a speculation to be
nullified or supported. An alternate hypothesis
may prevail. The null hypothesis is presumed
true until tested to be otherwise. A null
hypothesis is a speculation to be nullified or
supported. An alternate hypothesis may
prevail. The null hypothesis is presumed true
until tested to be otherwise.
A chi-squared test is any hypothesis test
where the test statistic has a chi-squared
distribution when the null hypothesis is
true, or any in which the probability
distribution of the test statistic (assuming
the null hypothesis is true) can be made
to approximate a chi-squared distribution
as closely as desired by MAKING THE
SAMPLE SIZE LARGE ENOUGH.
Important ! ! Another popular smallsample test is Student's T-test.
Type I error or error of the first kind or alpha error
( error) is a false positive. The null hypothesis was
rejected when it was actually true.
Type II error or error of the second kind or beta
error ( error) is a false negative. The error was not
rejecting the null hypothesis when the alternate
hypothesis is true.
Actual condition
True
False
Test
result
Positive
Negative
True Positive
(i.e. correct result)
False Positive (Type I)
(i.e. wrong result)
False Negative (Type II)
(i.e. wrong result)
True Negative
(i.e. correct result)
Type I is usually set at 0.05 or 0.01. Refer to the F ratio in
ANOVA.
Fisher information
The Fisher information is the variance of the
score. It is the amount of information that a
random variable X carries above an
unobservable variable (parameter)  on
which the likelihood function of X, L () = F
(X, ) depends. The likelihood function is the
joint probability of the data. As the
expectation of the score is zero, the
variance is the second moment of the score,
the derivative of the log of the likelihood
function with respect to .
The following modified biography of Fisher mostly
by PC Mahalanobis appeared in Sankhy, 4, 1958.
Fisher was born on the 17th of February in 1890 in East
Finchley, one of the northern suburbs of London. Love
of mathematics dominated his education. Harrow was a
model school available to Fisher since his family was
rich. The only difficulty was very bad eyesight. This kept
him out of World War I.
He entered Gonville and Caius College, Cambridge in
1909, and passed the Mathematical Tripos Part II in 1912
as a Wrangler. Under the negative influence of Bateson
against Darwinism, Fisher became keenly interested in
Mendelism. He thus wrote The General Theory of Natural
Selection by 1930 containing the survival of Darwinism..
Fundamentals
Exact measurements compactly tabled should
make up the the experiment. Replication,
randomization and block division are fundamental
statistical elements. Accuracy is the deviation
between the experimental and true results.
Precision is related to the spread of the sample.
This dispertion is given by the standard deviation.
Replication is essential because it is the sole
source of the estimate of error, while
randomization is necessary to guarantee the
validity of the estimate, i. e., to ensure that the
estimate will be unbiased.
The general theory of natural selection
Fisher had been working for a long time on
Mendelism and genetics. His book on
Genetical Theory of Natural Selection was
published in 1930 and constituted a milestone.
The 3 men that founded the theory of
population genetics and thereby NeoDarwinism
were Sewall Wright, JBS Haldane and RA
Fisher.
More biography
He was appointed Galton professor in the University
of London on the retirement of Karl Pearson in 1933.
Fisher was always interested in eugenics--human
genetics--associated with the Eugenics Society first
as Honorary Secretary and later as Vice-President.
He also took over from Karl Pearson the editorial
charge of the Annals of Eugenics in 1933.
In 1935, Fisher published Design of Experiments. He
visited the USA in 1936 and received an honorary
degree from Harvard. In 1937, he accepted a
fellowship from the Indian Statistical Institute.
The main currents of Fisher's
work are shown next.
Fisher's work falls naturally into
3 main streams:
-Contributions to the mathematical theory of
statistics
-Application of statistical theory to
agriculture and the design of experiments
-Contributions to genetics.
The theory of sampling distribution
The idea of the random sampling distribution of
statistics is fundamenal. Introduce the algebra
of point sets. The probability of the entire
sample space is unity, or P = 1. The earliest
example of the modern type of distribution was
that of 2 (Chi squared) by Karl Pearson in
1900. Several years later Student gave the
correct distribution of the sample variance and
his now famous T-statistic. Student was the
pename of William S Gossett..
The theory of estimates and statistical inference
Fierce controvercy has raged over this subject
since 1768 of Bayes' An Essey towards Solving a
Problem in the Doctrine of Chance (Phil Trans liii,
p 370) in which he proposed to solve this problem
with the help of the principle of equal distribution
of ignorance. Fisher enlightened the probllem in
his notable memoir On the Mathematical
Foundation of Theoretical Statistics (1921). This
work laid the foundations of statistical inference
by emphasizing the importance of exact solutions
of sampling problems.
Human blood groups by RR Race of the
Lister Institute, London in Biometrics
One outcome of Fisher's long interest in
human genetics was his setting up in 1935
a blood grouping department in the Galton
Laboratory. This was made possible by a
grant from the Rockefeller Foundation and
by the willingness of the late Dr GL Taylor,
an expert serologist, to leave the
Department of Pathology at Cambridge and
devote himself to blood groups, reviewed by
RR Race in the 1964 memoirial issue of
Biometry.
Fisher saw in the blood groups a hopeful tool
for the improvement of knowledge of human
genetics. WL Stevens and DJ Finney were
then at the Galton Lab. Stevens applied
Fisher's method of maximum liklihood to
calculations of the A1A2BO and MN
frequencies (Edwards AWF (1972) Likelihood.
Cambridge University Press and Sham PC
(1998) Statistics in Human Genetics. Arnold,
London ). Finney extended and applied
Fisher's u-statistics to linkage tests involving
blood groups.