Download Machine Learning Techniques to Identify Higgs Boson Events from

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Lepton wikipedia , lookup

ALICE experiment wikipedia , lookup

Data analysis wikipedia , lookup

Higgs boson wikipedia , lookup

Mathematical formulation of the Standard Model wikipedia , lookup

Grand Unified Theory wikipedia , lookup

Elementary particle wikipedia , lookup

ATLAS experiment wikipedia , lookup

Predictive analytics wikipedia , lookup

Future Circular Collider wikipedia , lookup

Large Hadron Collider wikipedia , lookup

Compact Muon Solenoid wikipedia , lookup

Higgs mechanism wikipedia , lookup

Minimal Supersymmetric Standard Model wikipedia , lookup

Search for the Higgs boson wikipedia , lookup

Standard Model wikipedia , lookup

Transcript
Machine Learning Techniques to Identify Higgs Boson Events from CERN Data
Department of Mathematics and Statistics, University of Wisconsin-La Crosse, WI 54601
Elizabeth R. McMahon, Chad Vidden
The discovery of the Higgs Boson, the theorized particle that gives all other particles
mass, was announced by CERN physicists on July 4th, 2012. 1,2The existence of the
Higgs particle (the quantum excitation of the Higgs field) had been hypothesized
since the 1960s, however, its large mass-125.09 GeV-is correlated to an extremely
short lifespan of 1.56 x10-22 s. Thus, identification of Higgs can only be made
through observing the decay particles of high energy collisions, a difficult process.
Scientists at CERN created a simulated data set that mimicked the results from the
ATLAS experiments, and released this data to the public in the form of a Machine
Learning Challenge using the popular data science website Kaggle.com. Machine
learning methods such as decision trees, random forests, logistic regression, and
Naïve Bayes were tested for their ability to accurately distinguish Higgs events from
background data in the CERN dataset, with varying results.
Figure 1. The Standard Model of
Particle Physics, with the
addition of the Higgs Boson (in
yellow).
Eleven variables had some degree of
missing data (indicated by the yellow
color in Figure 8). Missing data means
that variable was not observed in the p+p+ collision (PRI) or was a meaningless
calculation (DER). Missing data was
filled in using an Amelia function that
predicted values based on existing data.
This function was used to predict existing
data, in order to analyze how well it
Figure 8. Visualization of missing data. worked (Figure 9).
•
•
•
•
•
•
•
•
•
•
•
DER pt ratio lep tau
DER met phi centrality
DER lep eta centrality
PRI tau
PRI tau
PRI tau
PRI lep
PRI lep
PRI lep phi
PRI met
PRI met phi
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Decision Tree (Figure 10)
CI Decision Tree (Figure 11)
Random Forest (Figure 12)
Logistic Regression (Figure 13)
Naïve Bayes (Figure 14)
PRI met sumet
PRI jet num
PRI jet leading pt
PRI jet leading eta*
PRI jet leading phi*
PRI jet subleading pt
PRI jet subleading eta*
PRI jet subleading phi*
PRI jet all pt
Weight*˘
Label˘
3 depicts a p+-p+ collision, to
illustrate the complexity of the
measurements and the difficulty in
interpreting such events. 4Figure 4 shows
the Higgs signal found at 13 TeV
collisions (the small bump at ~125 GeV),
emphasizing the necessity of optimizing
data analysis.
Equations 1. Mathematical formulas for
determining precision, recall, and the F1 Score.
1a. Precision (P) =
1b. Recall (R) =
1c. F1 Score =2
Figure 12. N number of decision trees are
created from random subsets of the data. Each
tree begins with a random variable, after which
the same “biggest splitter” procedure is
followed. The outputs are then averaged to
produce the probability that a collision resulted
in the formation of a Higgs Boson. .
3
4
Figure 15. Example of
how variable importance
might be determined for
tree-based models.
PR
P+R
Table 2. F1 Score
Table 1. Model Accuracy
Model
Accuracy
Model
Logistic Regression
82.27
Logistic Regression
0.97514 0.66473
0.79055
Random Forest
81.88
Random Forest
0.77589 0.66375
0.71545
Decision Tree
80.40
Decision Tree
0.69231 0.72449
0.70799
CI Decision Tree
76.20
CI Decision Tree
0.65037 0.66667
0.65842
Naïve Bayes
74.90
Naïve Bayes
0.63961 0.61539
0.62727
Decision
Tree
DER_mass_transverse_met_lep
Precision Recall
Figure 10. Decision tree model. The beginning node
is the variable (and corresponding value) that results
in the biggest split between signal and background
classes. At each subsequent node, the process of
determining the biggest splitting variable is reiterated.
Outputs of each branch path are the probability that a
Higgs boson (signal) was produced.
Figure 2. Variable correlation. Bolder
colors indicate stronger relationships.
Figure 13. A logistic regression model is useful
for discreet classification problems (i.e. Signal vs.
Background).
The signal:background (S:B) ratio of the training set was approximately 1:2 (see Figure 5). The numerical spread of
all variables is depicted in Figure 6. Those with more spread have a greater influence potential due to the
increased comparative variance ability. The stacked quartile graphs (as in Figure 7) show the S:B ratio for equal
chunks of variable data. Greater change in the S:B ratio across the quartiles indicates a more influential variable.
Figure 11. A CI tree works much in the same way as a
regular decision tree. However, the model selects
variables and associated values based on the statistical
significance related to the classification that a Higgs
boson was produced in a collision. The most
significant variable (the one with the smallest p-value)
begins the tree.
F1 Score
Figure 14. A Naïve Bayes model looks at base,
evidence, and likelihood probabilities when
determining the appropriate classification of a
collision.
𝑷 𝒔 𝑷 𝒔 𝑽𝟏 𝑷 𝒔 𝑽𝟐 … 𝑷 𝒔 𝑽𝒏
𝑷(𝒔|𝑽𝟏 , 𝑽𝟐 … , 𝑽𝒏 ) =
𝑷 𝑽𝟏 𝑷 𝑽𝟐 … 𝑷 𝑽𝒏
1 ATLAS;
CMS (26 March 2015). "Combined Measurement of the Higgs Boson Mass in pp Collisions at √s=7 and 8 TeV with the ATLAS and CMS
Experiments". arXiv:1503.07589
2LHC
Naïve
Bayes
DER_mass_MMC
1
1
19
DER_met_phi_centrality
3
4
DER_mass_vis
4
DER_pt_ratio_lep_tau
5
PRI_tau_pt
x
Mean
1x
Median
1.67
2
4
6.25
2.5
5
2
3.50
3.5
3
4
10
5.25
4
8
3x
5.33
5
6
21
1
9.33
6
DER_deltar_tau_lep
7
5
2
17
7.75
6
DER_pt_h
8
9
16
5
9.50
8.5
6
11
22
3
10.50
8.5
11
20
6
9
11.50
10
10
14
7
10.33
10
x
Future Work
References
Figure 7 A quartile depiction of
the S:B ratio suggests DER mass
transverse met lep is an influential
variable.
Logistic
Regression
2
PRI_met_sumet
Figure 4. Higgs signal of 13 TeV
collisions.
Random
Forest
2
PRI_jet_num
Figure 6. Distribution of numerical
variables.
2
# true positives
# actual positives
DER_sum_pt
Figure 3. Particle decay of a p+p+ collision (ATLAS detector).
1
# true positives
#predicted positives
Variable
3Figure
Figure 5. Signal (Higgs) to Background
Ratio.
The importance of variables in each model was determined using built in functions with the
various model packages (see Figure 15 for an example). The package (‘party’) used to construct
the CI decision tree did not have a variable importance function, and thus there is no variable
ranking for that model. Variables with a median score of 1-10 are found in Table 3.
Table 3. Variable Importance
Training Set: 250K collisions
Test Set: 500K collisions
*Indicates variables removed from the dataset for model building/testing
˘Indicates variables only contained in the training set
Figure 9. Graphs of existing data points vs. predicted
values (for the same points) indicates how well filling
in data worked on a variable-to-variable basis (good fit
on left, poor fit on right).
Five models were built/tested using 5000 random
collisions from the training set, and 25 variables
(indicated in Data Description/Exploration).
The numerical variables are composed of energies, angles, momentums, and similar measurements of the particles
formed in proton-proton (p+-p+) collisions. The variables are listed below; their definitions can be found at
https://higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf. Figure 2. Shows the relationships between
variables.
“PRI” designates a (primitive) raw detection (typically related to the momenta of particles). “DER” indicates a
derived variable, one created by CERN physicists using the current understanding of particle physics to analyze the
data. The Label variable classifies a collision as either signal, “s,” (a Higgs-producing collision) or background, “b.”
EventId
DER mass MMC
DER mass transverse met lep
DER mass vis
DER pt h
DER deltaeta jet jet*
DER mass jet jet
DER mass jet jet
DER deltar tau lep
DER pt tot
DER sum pt
Model performance was analyzed using two metrics, accuracy and F1 Score. Accuracy is the
percentage of correctly classified collisions. See Table 1 for a compilation of the accuracies of
each model. F1 Scores were determined by calculating the precision and recall, before
determining the F1 Score itself (see Equations 1 for the mathematical formalism). Table 2
contains the results of this test.
Models
Data Description/Exploration
•
•
•
•
•
•
•
•
•
•
•
Results
Missing Data
Abstract
Higgs Cross Section Working Group; Dittmaier; Mariotti; Passarino; Tanaka; Alekhin; Alwall; Bagnaschi; Banfi (2012). "Handbook of LHC Higgs Cross
Sections: 2. Differential Distributions". CERN Report 2 (Tables A.1 – A.20). 1201: 3084. arXiv:1201.3084
• Focus on azeotropic trends and other interesting chemical phenomena.
• Use machine learning techniques to analyze trends and make predictions
about various chemical processes.
Acknowledgments
• Dr. Vidden for both machine learning and coding assistance.
• Dr. Ragan for insight on the Higgs mechanism and the particle physics
background.