Download Generalized linear models

Document related concepts

Linear regression wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Introduction to Probability Theory in
Machine Learning: A Bird View
Mohammed Nasser
Professor, Dept. of Statistics, RU,Bangladesh
Email: [email protected]
Content of Our Present Lecture
 Introduction
Problem of Induction and Role of Probability
 Techniques of Machine Learning
Density Estimation
Data Reduction
 Classification and Regression Problems
 Probability in Classification and Regression
Introduction to Kernel Methods
Introduction
The problem of searching for patterns in data is the basic
problem of science.
the extensive astronomical observations of Tycho Brahe in
the 16th century allowed Johannes Kepler to discover the
empirical laws of planetary motion, which in turn provided a
springboard for the development of classical mechanics.
Johannes
Kepler
1571 - 1630
Brahe, Tycho
1546-1601
Introduction
Darwin’s(1809-1882) study of nature in five years
voyage on HMS Beagle revolutionized biology.
 the discovery of regularities in atomic spectra played a
key role in the development and verification of quantum
physics in the early twentieth century.
Off late the field of pattern recognition has become
concerned with the automatic discovery of regularities in
data through the use of computer algorithms and with the
use of these regularities to take actions such as classifying
the data into different categories.
Problem of Induction
The inductive inference process:
Observe a phenomenon
Construct a model of the phenomenon
Make predictions
→This is more or less the definition of natural
sciences !
→The goal of Machine Learning is to
automate this process
→The goal of Learning Theory is to formalize it.
Problem of Induction
Let us suppose somehow we have x1,x2,- - -xn
measurements
n is very large, e.g. n=1010000000.. Each of x1,x2,- - -xn
measurements satisfies a proposition, P
Can we say that( n+1) (1010000000+1)th obsevation
satisfies P. certainly..
No
Problem of Induction
Let us consider P(n)=
1010 - n
10
The question: Is P(n) >0 ?
It is positive upto very very large number, but after
that becomes negative.
What can we do now?
Probabilistic framework to the
rescue!
Problem of Induction
What is the probability, p that the sun will rise
tomorrow?
p is undefined, because there has never been an
experiment that tested the existence of the sun tomorrow
The p = 1, because the sun rose in all past experiments.
p = 1-ε, where ε is the proportion of stars that explode per
day.
p = d+1/d+2, which is Laplace rule derived from Bayes
rule.(d = past # days sun rose, )
Conclusion: We predict that the sun will rise tomorrow with high
probability independent of the justification.
The Sub-Fields of ML
Classification
• Supervised Learning
Regression
Clustering
Unsupervised Learning
Density
estimation
Data
reduction
Reinforcement Learning
Unsupervised Learning: Density
Estimation
What is the wt of the
elephant?
What is the wt/distance of
sun?
What is the wt/size of
baby in the womb?
What is the wt of a DNA
molecule?
Solution of the Classical Problem
Let us suppose somehow we have x1,x2,- - -xn
measurements
One million dollar question:
How can we choose the optimum one among
infinite possible alternatives to combine these n
obs. to estimate the target,μ
What is the optimum n?
We need the concepts:
ith observations
Probability distributions
-
X i    i ,  ~ F (x /  )
Target that we want to
estimate
Probability measures,
Meaning of Measure
An ) 
n 1

 ( A
n 1
n
)
Whenever An  A
Probability
Measures
Discrete P(A)=1,
#(A)=finite or 
Continuous
Absolutely
Continous
P{x}=0 for
all x
Non A.C.
On any sample space
(

Discrete Distributions
On Rk
We have special concepts
Continuous Distributions
of the
Models
Is sample mean appropriate for all the
models?
Different
Shapes
I know the
means - - -
I know Pr(a<X<b) for every a and b.
population
Approaches of Model
Estmation
Bayesian
• Parametric
Non-Bayesian
Cdf estimation
Nonparametric
Semiparametric
Density
estimation
Infinite-dimensional Ignorance
Generally any function space is infinite-dimensional
Parametric modeling assumes our ignorance is
finite-dimensional
Semi-parametric modeling assumes our
ignorance has two parts: one finite-dimensional and
the other, infinite-dimensional
Non-parametric modeling assumes our
ignorance is infinite-dimensional
Parametric Density
Estmation
Nonparametric Density Estmation
Semiparametric /Robust
Density Estmation
Parametric
model
Nonparametric
model
Application of Density
Estmation
Picture of Three Objects
Distribution of three objects
Curse of Dimension
Courtesy: Bishop(2006)
Curse of Dimension
Courtesy: Bishop(2006)
4
2
-2
0
If population model is
MVN with high corr.,
it works well.
-4
v
Unsupervised Learning: Data
Reduction
-3
-2
-1
0
1
2
3
Unsupervised Learning: Data
Reduction
8
???????
0
2
4
u
6
One dimensional manifold
-3
-2
-1
0
z
1
2
3
Problem-2
• Fisher’s Iris Data (1936): This data set gives the measurements in
cm (or mm) of the variables
–
–
–
–
–
Sepal length
Sepal width
Petal length
Petal width and
Species (setosa, versicolor, and virginica)
There are 150 observation with 50 from each species.
We want to predict the class of a new observation .
What is the available method to do job?
LOOK! DEPENDENT VARIABLE IS
CATEGORICAL***
INDEPENDENT VARIABLES ARE CONTINUOUS***
Problem-3
• BDHS (2004): The dependent variable is
–childbearing risk with two values (High Risk and
Low Risk).
The target is to predict the childbearing risk based
some socio economic and demographic variables.
The complete list of the variables is given in the next
slide.
Again we are in the situation where the dependent variable
is categorical, independent variables are mixed.
Problem-4
Face Authentication (/ Identification)
• Face Authentication/Verification (1:1 matching)
• Face Identification/Recognition (1:N matching)
Applications
 Access
Control
www.viisage.com
www.visionics.com
Applications
 Video
Surveillance (On-line or off-line)
Face Scan at Airports
www.facesnap.de
Why is Face Recognition Hard?
Inter-class Similarity
Twins
Father and son
Intra-class Variability
Handwritten digit recognition
We want to recognize the postal codes automically
Problem 6: Credit Risk Analysis
• Typical customer: bank.
• Database:
– Current clients data, including:
– basic profile (income, house ownership,
delinquent account, etc.)
– Basic classification.
• Goal: predict/decide whether to grant
credit.
Problem 7: Spam Email
Detection, Search Engine etc
traction.tractionsoftware.com
www.robmillard.com
Problem 9:
40
Genome-wide data
mRNA
expression data
hydrophobicity data
protein-protein
interaction data
sequence data
(gene, protein)
Problem 10: Robot control
• Goal: Control a robot in an unknown environment.
• Needs both
– to explore (new places and action)
– to use acquired knowledge to gain benefits.
• Learning task “control” what is observed!
Problem-11
•
Wisconsin Breast Cancer Database (1992): This breast cancer databases was obtained
from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. Or
you get it from (http://www.potschi.de/svmtut/breast-cancer-wisconsin.data). The
variables are:
– Clump Thickness
1 - 10
– Uniformity of Cell Size
1 - 10
– Uniformity of Cell Shape
1 - 10
– Marginal Adhesion
1 - 10
– Single Epithelial Cell Size
1 – 10
– Bare Nuclei
1 - 10
– Bland Chromatin
1 - 10
– Normal Nucleoli
1 - 10
– Mitoses
1 - 10
– Status: (benign, and malignant)
There are 699 observations are available.
Now we want to predict the status of a patient weather it benign or malignant.
DEPENDENT VARIABLE IS CATEGORICAL.
Independent variables???
Problem 12:Data
Description
Cardiovascular diseases affect
the heart and blood vessels and
include shock, heart failure, heart
valve disease, congenital heart
disease etc.
 Despres et al. pointed out
that the topography of
adipose tissue (AT) is
considered as risk factors
for cardiovascular
disease.
It is important to measure the amount
of intra-abdominal AT as part of the
evaluation of the cardiovasculardisease risk of an individual.
Adipose Tissue
Data Description
• Problem: Computed
tomography of AT is ---- very
costly
• ----- requires irradiation of the
subject
----- not available to many
physicians.
Not available to Physician
• Materials: The simple anthropometric
measurements such as waist
circumference which can be obtained
cheaply and easily.
• Variables:
Y= Number
of can
deep abdominal
AT estimate
How
well we
predict and
Waist Circumference
X=The Waist Circumference (in cm.)
the deep abdominal AT
Total observation is 109 (men)
from the knowledge of waist circumference ?
Data sources: W. W. Daniel (2003)
Complex Problem 13
Hypothesis: The infant’s size at birth is associated with the
maternal characteristics and SES
Variables: X
Maternal & SES
1. Age (x1)
2. Parity (x2)
3. Gestational age (x3)
4. Mid-Upper Arm
Circumference MUAC (x4)
5. Supplementation
group (x5)
6. SES index (x6)
Infant’s Size at birth: Y
1. Weight (y1)
2. Length (y2)
3. MUAC (y3)
4. Head circumference (HC) (y4)
5. Chest Circumference (CC) (y5)
CCA, KCCA, MR, PLS etc give us some solutions to this
complex problem.
Data
Vectors
Collections of features
e.g. height, weight, blood pressure, age, . . .
Can map categorical variables into vectors
Matrices
Images, Movies
Remote sensing and satellite data
(multispectral)
Strings
Documents
Gene sequences
Structured Objects
XML documents
Graphs
Let US Summarize!!
Classification (reminder)
Y=g(X)
X!Y
Anything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
•…
• discrete:
– {0,1}
binary
– {1,…k}
multi-
class
– tree, etc.
structured
Classification (reminder)
X
Perceptron
Logistic Regression
Support Vector Machine
Anything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
•…
Kernel trick
Decision Tree
Random Forest
Regression
Y=g(X)
X!Y
Anything:
• continuous (,
d,
…)
• continuous:
– , d
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
•…
Not Always
Regression
X
Perceptron
Normal Regression
Support Vector regression
Anything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
•…
Kernel trick
GLM
Which is better?
Which is better?
Some Necessary Terms
Training data: the X;Y we are given.
Testing data: the X;Y we will see in future.
 Training error: the average value of loss on the
training data.
Test error: the average value of loss on the test
data.
How do we calculate them?
Our Real Goal
What is our real goal? To do well on the data we
have seen already?
Usually not. We already have the answers for that
data.
We want to perform well on future unseen data.
So ideally we would like to minimize the test error.
How to do this if we don't have test data?
Probabilistic framework to rescue us!
Idea: look for regularities in the observed phenomenon
These can be generalized from the observed past to the
future
No Free Lunch
If there is no assumption on how the past is related
to the future, prediction is impossible
If there is no restriction on the possible phenomena,
generalization is impossible
We need to make assumptions
• Simplicity is not absolute • Data will never replace
knowledge • Generalization = data + knowledge
Two types of assumptions:
Future observations related to past ones:
Stationarity of the phenomenon
Constraints on the phenomenon :Notion of simplicity
Probabilistic Model
Relationship between past and future observations:
Sampled independently from the same distribution
Independence: each new observation yields maximum
information
Identical distribution: the observations give
information about the underlying phenomenon (here a
probability distribution)
We consider an input space X and output space Y.
Assumption: The pairs (X, Y ) ε X × Y are distributed
according to P (unknown).
Data: We observe a sequence of n i.i.d. pairs (Xi, Yi) sampled
according to P.
Goal: construct a function g : X Y which predicts Y from X.
Problem Revisited
• The distribution of baby
weights at a hospital
~ N(3400, 360000)
Your “Best guess” at a random
baby’s weight, given no
information about the baby, is
what? 3400 grams???
But, what if you have relevant information? Can you
make a better guess?
At 30 weeks…
Y=birthweight
3000
(x,y)=
(g)
(30,3000)
X=gestation
time (weeks)
30
At 30 weeks…
• The babies that gestate for 30 weeks appear
to center around a weight of 3000 grams.
• In Math-Speak…
• E(Y/X=30 weeks)=3000 grams
Note the conditional
expectation ???
Show that V(Y)> V[E(Y/X)] if Y and X are not
independent.
R(g)=∫∫(y-g(x))2 p(x,y)dxdy is minimum when
g(x)=E(Y/X=x)
Risk Functional

Risk functional, RL,P(g)=
L( x, y, g ( x))dP( x, y)
X Y
   L( x, y, g ( x))dP( y / x)dPX
XY
Population Regression functional /classifier, g* RL,P ( g * )  inf RL,P ( g )
g: X Y
 P is chosen by nature , L is chosen by the scientist
 Both RL,P(g*) and g* are uknown
 From sample D, we will select gD by a learning
method(???)
Empirical risk minimization
Empirical Risk functional,

Problems of
empirical risk
minimization
R L ,P ( g )
n
=

L( x, y, g ( x))dPn ( x, y)
X Y
   L( x, y, g ( x))dPn ( y / x) dPn X
XY
1 n
  L( xi , yi , g ( xi ))
n i 1
What Can We Do?
 We can restrict the set of functions over which we
minimize empirical risk functionals
modify the criterion to be minimized (e.g. adding a penalty
for `complicated‘ functions). We can combine two.
Regularization
Structural risk
Minimization
Probabilistic vs ERM Modeling
We Do Need More
 We want RL,P(gD) should be very close to RL,P(g*)
Does closeness of RL,P(gD) to RL,P(g*) imply closeness of
gD to g*?
Both RL,P(gD) and gD should be smooth in P.
To measure closeness of RL,P(gD) to RL,P(g*) we need
limit Theorems and inequalities of Probability Theory.
To ensure convergence of gD to g* we need very strong
form of convergence of RL,P(gD) to RL,P(g*) .
We Do Need More
To check smoothness of gD w.r. t. P we need tools of
Functional Calculus
How do we find gD? Does g* exists?
It is a problem of Optimization in Function Space
Universally Consistent:
RL ,P (g )  RL ,P (g ), n  
D
P
*
SVM’s are often universally consistent.
Rate of convergence??
Consistency of ERM
Remp  
   
Remp n*  R n*
R 
Key Theorem of VC-theory
• For bounded loss functions, the ERM principle is consistent
if and only if the empirical risk Remp ( ) converges uniformly
to the true risk R ( ) in the following sense
lim n  P[sup  R( )  Remp ( )   ]  0,   0
• consistency is determined by the worst-case approximating
function, providing the largest discrepancy btwn the
empirical risk and true risk
Note: this condition is not useful in practice. Need conditions
for consistency in terms of general properties of a set of
loss functions (approximating functions)
68
Empirical Risk Minimization
Vs Structural Risk minmization
 Empirical risk minimization is not good from the
viewpoint of generalization error
Vapnik and Chervonenkis( 1995, 1998) studied under
what conditions uniform convergence of Empirical risk
to expected takes place. The results are formulated in
terms of three importan tquantities measures—The VC
entropy, the annealed VC entropy and the growth
function– related two topological concepts, ε-net and
covering number. These concepts lead to working idea
of VC dimension and SRM
Empirical Risk Minimization
Vs Structural Risk minmization
• Principle: Minimize upper bound of the true risk
True Risk <= Empirical Risk + Complexity Penalty
1
2n

R (g)  Remp (g) 
(h ln(  1)  ln )
n
h
4
With probability 1-δ
Structural Risk Minimization
Kernel methods: Heuristic View
Traditional or
non
traditional
73
Steps for Kernel Methods
[k(xi ,xj)]
A positive
semi
definite
matrix
what K????
f(x)=∑αik(xi, x)
K=
Pattern function
DATA MATRIX
Kernel
Matrix,
Why
p.s.d??
74
Kernel methods: Heuristic View
f
f
f
Original Space
Feature Space
Kernel methods: basic ideas
• The kernel methods approach is to stick with linear
functions but work in a high dimensional feature
space:
• The expectation is that the feature space has a much
higher dimension than the input space.
• Feature space has a inner-product like
k(xi, xj)=(Φ(xi), Φ(xj))
75
76
Kernel methods: Heuristic View
Form of functions
• So kernel methods use linear functions in
a feature space:
• For regression this could be the function
• For classification require thresholding
Kernel methods: Heuristic View
Feature spaces
 : x  ( x), R  F
d
non-linear mapping to F
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)
example:
( x, y )  ( x , y , 2 xy)
2
2

77
Kernel methods” Heuristic View
78
Kernel trick
Note: In the dual representation we used the Gram matrix
to express the solution.
Kernel Trick:
Replace : x  ( x),
kernel
Gij  xi , x j  Gij  ( xi ), ( x j )  K ( xi , x j )
If we use algorithms that only depend on the Gram-matrix, G,
then we never have to know (compute) the actual features
Φ(x)
Gist of Kernel methods
 Choice of a Kernel Function
 Through choice of a kernel function we choose a Hilbert
space
 We then apply the linear method in this new space
without increasing computational complexity using
mathematical niceties of this space
Acknowledgement
Alexander J. Smola
Statistical Machine Learning Program
Canberra, ACT 0200 Australia
[email protected]
Statistical Learning Theory
Olivier Bousquet
Department of Empirical Inference
Max Planck Institute of Biological Cybernetics
[email protected]
Machine Learning Summer School, August 2003