Download Neural networks in data analysis - KCETA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Airborne Networking wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

IEEE 1355 wikipedia , lookup

Transcript
ISAPP Summer Institute 2009
Neural networks in data analysis
Michal Kreps
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 1/38
Outline
What are the neural networks
1
2
Basic principles of NN
3
For what are they useful
4
1
Training of NN
5
Available packages
6
Comparison to other methods
7
8
Some examples of NN usage
input layer
hidden layer
output layer
I’m not expert on neural networks, only advanced user who
understands principles.
Goal: Help you to understand principles so that:
→ neural networks don’t look as black box to you
→ you have starting point to use neural networks by yourself
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 2/38
Tasks to solve
Different tasks in nature
Some easy to solve
algorithmically, some difficult or
impossible
Reading hand written text
Lot of personal specifics
Recognition often needs to
work only with partial
information
Practically impossible in
algorithmic way
But our brains can solve this
tasks extremely fast
Use brain as inspiration
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 3/38
Tasks to solve
Why (astro-)particle physicist would care about recognition
of hand written text?
Because it is classification task on not very regular data
Do we see now letter A or H?
→ Classification tasks are very often in our analysis
Is this event signal or background?
→ Neural networks have some advantages in this task:
Can learn from known examples (data or simulation)
Can evolve to complex non-linear models
Easily can deal with complicated correlations amongs
inputs
Going to more inputs does not cost additional examples
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 4/38
Down to principles
Detailed look shows huge
number of neurons
Neurons are interconnected,
each having many inputs
from other neurons
To build artificial neural network:
Zoom in
Build model of
neuron
Connect
many
neurons together
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 5/38
Model of neuron
Get inputs
Give output
Combine input to single quantity
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 6/38
Model of neuron
Give output
Get inputs
Combine input to single quantity
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 7/38
Model of neuron
Give output
Get inputs
Combine
Pinput to single quantity
S = i wi xi
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 7/38
Model of neuron
Give output
Get inputs
Pass S through
activation function
Most common:
A(x) = 1+e2−x − 1
Combine
Pinput to single quantity
S = i wi xi
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 7/38
Building neural network
Complicated structure
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 8/38
Building neural network
Even more complicated structure
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 8/38
Building neural network
1
2
3
4
1
5
6
7
8
input layer
ISAPP Summer Institute 2009
Or simple structures
hidden layer
M. Kreps, KIT
output layer
Neural networks in data analysis – p. 8/38
Output of Feed-forward NN
Output of
k in
second layer:
Pnode
1→2
ok = A
w
xi
i ik
1
2
3
4
1
5
Outputof node j in output layer:
P 2→3
xk
oj = A
k wkj
i
hP
P
1→2
2→3
w
xi
A
oj = A
w
i ik
k kj
More layers usually not needed
6
If more layers involved, output
follows same logic
7
8
input layer
hidden layer
output layer
In following, stick to only three
layers as here
Knowledge is stored in weights
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 9/38
Tasks we can solve
Classification
Binary decision whether event belongs to one of the two
classes
Is this particle kaon or not?
Is this event candidate for Higgs?
Will Germany become soccer world champion in 2010?
→ Output layer contains single node
Conditional probability density
Get probability density for each event
What is decay time of a decay
What is energy of particle given the observed shower
What is probability that customer will cause car accident
→ Output layer contains many nodes
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 10/38
NN work flow
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 11/38
Neural network training
After assembling neural networks, weights wik1→2 and wkj2→3
unknown
Initialised to default value
Random value
Before using neural network for planned task, need to find
best set of weights
Need some measure, which tells which set of weights
performs best ⇒ loss function
Training is process, where based on known examples of
events we search for set of weights which minimise loss
function
Most important part in using neural networks
Bad training can result in bad or even wrong solutions
Many traps in the process, so need to be careful
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 12/38
Loss function
Practically any function which allows you to decide, which
neural network is better
→ In particle physics it can be one which results in best
measurement
In practical applications, two commonly used functions (Tji
is true value with −1 for one class and 1 for other, oji is
actual neural network output)
Quadratic
loss function
P
P 1P
2
2
χ = j wj χj = j 2 i (Tji − oji )2
→ χ2 = 0 for correct decisions and χ2 = 2 for wrong
Entropy loss function
P
P P
j
ED = j wj ED = j i − log( 21 (1 + Tji oji ))
→ ED = 0 for correct decisions and ED = ∞ for wrong
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 13/38
Conditional probability density
Classification is easy case, we need single output node and
if output is above chosen value, it is class 1 (signal)
otherwise class 2 (background)
Now question is how to obtain probability density?
Use many nodes in output layer, each answering question,
whether probability is larger than some value
As example, lets have n = 10 nodes and true value of 0.63
Vector of true values for output nodes:
(1, 1, 1, 1, 1, 1, −1, −1, −1, −1)
Node j answers question whether probability is larger
than j /10
The output vector that represents integrated probability
density
After differentiation we obtain desired probability density
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 14/38
Interpretation of NN output
In classification tasks in general higher NN output means
more signal like event
→ NN output rescaled to interval [0; 1] can be interpreted as
probability
Quadratic/entropy loss function is used
NN is well trained ⇔ loss function minimised
→ Purity P(o) = Nselected signal (o)/Nselected (o) should be
→ NN is function, which maps ndimensional space to 1-dimensional
space
1.0
0.9
0.8
0.7
purity
linear function of NN output
Looking
to expression for output
hP
i
P
2→3
1→2
A
w
w
xi
oj = A
k kj
i ik
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-1.0
-0.5
0.0
0.5
1.0
Network output
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 15/38
How to check quality of training
Define quantities:
Signal efficiency:
N(selected signal)
ǫs =
N(all signal)
Efficiency:
N(selected)
ǫ=
N(all)
Purity:
N(selected signal)
P=
N(selected)
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 16/38
How to check quality of training
Define quantities:
Distribution of NN output
for two classes
Signal efficiency:
N(selected signal)
ǫs =
N(all signal)
×103
160
Efficiency:
N(selected)
ǫ=
N(all)
140
120
100
Purity:
N(selected signal)
P=
N(selected)
80
60
40
20
0
-1.0
-0.5
0.0
0.5
1.0
Network output
How good is separation?
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 16/38
How to check quality of training
Define quantities:
1.0
Signal efficiency:
N(selected signal)
ǫs =
N(all signal)
0.8
0.7
purity
Efficiency:
N(selected)
ǫ=
N(all)
0.9
0.6
0.5
0.4
0.3
Purity:
N(selected signal)
P=
N(selected)
0.2
0.1
0.0
-1.0
-0.5
0.0
0.5
1.0
Network output
Are we at minimum of the
loss function?
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 16/38
How to check quality of training
Each point has requirement on out > cut
Signal efficiency:
N(selected signal)
ǫs =
N(all signal)
Efficiency:
N(selected)
ǫ=
N(all)
Signal purity
Define quantities:
1.0
0.9
Best point is (1,1)
0.8
0.7
Training S /B
0.6
0.5
Purity:
N(selected signal)
P=
N(selected)
0.4
0.3
0.2
0.1
0.00.0
0.2
0.4
0.6
0.8
1.0
Signal efficiency
Each point has requirement on out < cut
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 16/38
How to check quality of training
Each point has requirement on out > cut
Define quantities:
Signal efficiency:
N(selected signal)
ǫs =
N(all signal)
Best achievable
Signal efficiency
Efficiency:
N(selected)
ǫ=
N(all)
Purity:
N(selected signal)
P=
N(selected)
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Random decision
0.1
0.00.0
0.2
0.4
0.6
0.8
1.0
Efficiency
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 16/38
Issues in training
Main issue: Finding minimum in high dimensional space
5 input nodes with 5 nodes in hidden layer and one output node
gives 5x5 + 5 = 30 weights
Imagine task to find deepest valley in the Alps (only 2 dimensions)
Being put to some random
place, easy to find some
minimum
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 17/38
Issues in training
Main issue: Finding minimum in high dimensional space
5 input nodes with 5 nodes in hidden layer and one output node
gives 5x5 + 5 = 30 weights
Imagine task to find deepest valley in the Alps (only 2 dimensions)
Being put to some random
place, easy to find some
minimum
But what is in next valley?
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 17/38
Issues in training
In training, we start with:
Large number of weights which are far from optimal value
Input values in unknown ranges
→ Large chance for numerical issues in calculation of loss
function
Statistical issues coming from low statistics in training
In many places in physics not that much issue as training
statistics can be large
In industry can be quite big issue to obtain large enough
training dataset
But also in physics it can be sometimes very costly to
obtain large enough training sample
Learning by heart, when NN has enough weights to do that
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 18/38
How to treat issues
All issues mentioned can be treated by neural network
package in principle
Regularisation for numerical issues at the beginning of
training
Weight decay - forgetting
Helps in overtraining on statistical fluctuations
Idea is that effects which are not significant are not
shown often enough to keep weight to remember them
Pruning of connections - remove connections, which don’t
carry significant information
Preprocessing
numerical issues
search for good minimum by good starting point and
decorrelation
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 19/38
Neurobayes preprocessing
input node 5
6000
4000
3000
flat
events
5000
2000
1000
0
0.2
0.4
0.6
0.8
1
29.709
8.7501
7.121
6.2022
5.5871
5.1186
4.7391
4.4197
4.1511
3.9122
3.706
3.52
3.3514
3.1934
3.045
2.905
2.7733
2.649
2.5321
2.4212
2.3178
2.2173
2.1242
2.0361
1.9542
1.882
1.8114
1.7462
1.6865
1.6316
1.5801
1.5328
1.4886
1.4482
1.4096
1.3739
1.3409
1.3099
1.2805
1.2531
1.2272
1.2021
1.1779
1.1551
1.133
1.1107
1.0887
1.0667
1.0449
1.0224
1
0
Bin to bins of
equal statistics
1.2
spline fit
purity
1
0.8
0.6
0.4
0.2
0
10
20
30
40
50
60
70
80
90
100
bin n.
50000
final
events
40000
30000
20000
10000
0
−3
ISAPP Summer Institute 2009
−2
−1
0
1
M. Kreps, KIT
2
Calculate purity
for each bin and
do spline fit
3
Use purity instead of original
value, plus transform to µ = 0
and RMS = 1
Neural networks in data analysis – p. 20/38
Electron ID at CDF
Comparing two different NN packages with exactly same inputs
Preprocessing is important and can improve result significantly
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 21/38
Available packages
Not exhaustive list
NeuroBayes
http://www.phi-t.de
Best I know
Commercial, but fee is small for scientific use
Developed originally in high energy physics experiment
ROOT/TMVA
Part of the ROOT package
Can be used for playing, but I would advise to use
something else for serious applications
SNNS: Stuttgart Neural Network Simulator
http://www.ra.cs.uni-tuebingen.de/SNNS/
Written in C, but ROOT interface exist
Best open source package I know about
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 22/38
NN vs. other multivariate techniques
Several times I saw statements of type, my favourite method
is better than other multivariate techniques
Usually lot of time is spend to understand and properly
train favourite method
Other methods are tried just for quick comparison without
understanding or tuning
Often you will even not learn any details of how other
methods were setup
With TMVA it is now easy to compare different methods often I saw that Fischer discriminant won in those
From some tests we found that NN implemented in TMVA
is relativily bad
No wonder that it is usually worst than other methods
Usually it is useful to understand method you are using, none
of them is black box
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 23/38
Practical tips
No need for several hidden layers, one should be sufficient
for you
Number of nodes in hidden layer should not be very large
→ Risk of learning by heart
Number of nodes in hidden layer should not be very small
→ Not enough capacity to learn your problem
→ Usually O(number of inputs) in hidden layer
Understand your problem, NN gets numbers and can dig information out of them, but you know meaning of those numbers
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 24/38
Practical tips
Don’t throw away knowledge you acquired about problem
before starting using NN
→ NN can do lot, but you don’t need that it discovers what you
already know
If you already know that given event is background, throw
it away
If you know about good inputs, use them rather than raw
data from which they are calculated
Use neural network to learn what you don’t know yet
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 25/38
Physics examples
Only mentioning work done by Karlsruhe HEP group
"Stable" particle identification at Delphi, CDF and now Belle
Properties (E, φ, θ, Q-value) of inclusively reconstructed B
mesons
Decay time in semileptonic Bs decays
Heavy quark resonances studies: X (3872), B ∗∗ , Bs∗∗ , Λ∗c , Σc
B flavour tagging and Bs mixing (Delphi, CDF, Belle)
CPV in Bs → J /ψφ decays (CDF)
B tagging for high pT physics (CDF, CMS)
Single top and Higgs search (CDF)
B hadron full reconstruction (Belle)
Optimisation of KEKB accelerator
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 26/38
∗∗
Bs
1.0 fb-1
1.0 fb-1
CDF Run 2
+
+ B**
s → B K → [J/ψ K ] K
400
Candidates per 1.25 MeV/c
2
CDF Run 2
Candidates
at CDF
-
Signal
Background
350
300
250
200
150
100
B+K+
∗
Bs2
35
Signal
30
Background
25
20
Bs1
15
10
5
50
0
-1.0
B+K-
40
-0.5
0.0
0.5
0
0.00
1.0
0.10
+
Neural Network Output
-
+
0.15
-
0.20
2
M(B K )-M(B )-M(K ) [GeV/c ]
First observation of Bs1
Most precise masses
Background from data
Signal from MC
ISAPP Summer Institute 2009
0.05
M. Kreps, KIT
Neural networks in data analysis – p. 27/38
X(3872) mass at CDF
×103
2.4 fb-1
CDF II Preliminary
100
Signal
Background
2.4 fb-1
CDF II Preliminary
60
Total
Rejected
Selected
60000
40
Candidates per 2 MeV/c2
Candidates
80
X (3872) first of charmonium like
states
Lot of effort to understand its
nature
50000
20
40000
0
-1.0
-0.5
0.0
0.5
1.0
Network Output
30000
20000
Powerful selection at CDF of
largest sample leading to best
mass measurement
ISAPP Summer Institute 2009
M. Kreps, KIT
10000
0
3.4
3.6
3.8
4.0
J/ψ ππ Mass (GeV/c )
2
Neural networks in data analysis – p. 28/38
Single top search
0.1
0
-1
Tagging whether jet is b-quark
jet or not using NN
-0.5
0
0.5
1
KIT Flavor Sep. Output
Final data plot
Part of big program leading
to discovery of single top
production
ISAPP Summer Institute 2009
M. Kreps, KIT
All Channels
-1
CDF II Preliminary 3.2 fb
single top
tt
Wbb+Wcc
Wc
Wqq
Diboson
Z+jets
QCD
data
300
200
80
70
60
50
40
30
20
10
0
0.6
0.7
0.8
0.9
1
100
0
-1
-0.5
0
MC normalized to SM prediction
0.2
Complicated
analysis
with
many different multivariate
techniques
Candidate Events
single top
tt
Wbb+Wcc
Wc
Wqq
Diboson
Z+jets
QCD
normalized to unit area
Event Fraction
-1
TLC 2Jets 1Tag CDF II Preliminary 3.2 fb
0.5
1
NN Output
Neural networks in data analysis – p. 29/38
B flavour tagging - CDF
Flavour tagging at hadron
colliders is difficult task
Lot of potential for
improvements
In plot, concentrate on
continuous lines
Example of not that good
separation
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 30/38
Training with weights
-1
CDF Run II preliminary, L = 2.4 fb
Candidates per 0.4 MeV
7000
#Signal Events
≈ 23300
#Background Events ≈ 472200
6000
5000
4000
3000
2000
1000
0
0.15
0.155
0.16
0.165
0.17
0.175
0.18
0.185
0.19
2
Mass(Λ c π)-Mass(Λ c) [GeV/c ]
One knows statistically whether event is signal or background
Or has sample of mixture of signal and background and pure
sample for one class
Can use data only as example of application
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 31/38
Training with weights
-1
CDF Run II preliminary, L = 2.4 fb
-1
CDF Run II preliminary, L = 2.4 fb
#Signal Events
≈ 23300
#Background Events ≈ 472200
6000
5000
4000
3000
2000
1000
0
0.15
0.155
0.16
0.165
0.17
0.175
0.18
0.185
0.19
2
Mass(Λ c π)-Mass(Λ c) [GeV/c ]
Candidates per 1.30 MeV
Candidates per 0.4 MeV
7000
3000
Data
Fit Function
++
Σc(2455) → Λ+c π+
++
Σc(2520) → Λ+c π+
2500
2000
1500
1000
500
→ Use only data in selection
→ Can compare data to
simulation
→ Can reweight simulation to match data
ISAPP Summer Institute 2009
(data - fit)
data
0
0.15
0.20
0.25
0.15
0.20
0.25
2
0
-2
2
Mass(Λ+c π+)-Mass(Λ+c) [GeV/c ]
M. Kreps, KIT
Neural networks in data analysis – p. 32/38
Examples outside physics
Medicine and Pharma research
e.g. effects and undesirable effects of drugs early tumor
recognition
Banks
e.g. credit-scoring (Basel II), finance time series prediction,
valuation of derivates, risk minimised trading strategies,
client valuation
Insurances
e.g. risk and cost prediction for individual clients, probability
of contract cancellation, fraud recognition, justice in tariffs
Trading chain stores:
turnover prognosis for individual articles/stores
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 33/38
Data Mining Cup
World‘s largest
students competition:
Data-Mining-Cup
Very successful in
competition with other
data-mining methods
2005: Fraud detection
in internet trading
2006: price prediction
in ebay auctions
2007: coupon redemption prediction
2008: lottery customer
behaviour prediction
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 34/38
NeuroBayes Turnover prognosis
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 35/38
Individual health costs
Pilot project for a large private health insurance
Prognosis of costs in following year for each person
insured with confidence
intervals
4 years of training, test
on following year
Results: Probability density for each customer/tariff
combination
Very good test results!
Potential for an objective cost reduction in health management
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 36/38
Prognosis of sport events
Results: Probabilities for
ISAPP Summer Institute 2009
home - tie - guest
M. Kreps, KIT
Neural networks in data analysis – p. 37/38
Conclusions
Neural network is powerful tool for your analysis
They are not genius, but can do marvellous things for genius
Lecture of this kind cannot really make you expert
Hope that I was successful with:
Give you enough insight so that neural network is not
black box anymore
Teaching you enough not to be afraid to try
Feel free to talk to me, I’m at Campus Süd, building 30.23,
room 9-11
ISAPP Summer Institute 2009
M. Kreps, KIT
Neural networks in data analysis – p. 38/38