* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Machine Learning Techniques to Identify Higgs Boson Events from
Survey
Document related concepts
ALICE experiment wikipedia , lookup
Data analysis wikipedia , lookup
Higgs boson wikipedia , lookup
Mathematical formulation of the Standard Model wikipedia , lookup
Grand Unified Theory wikipedia , lookup
Elementary particle wikipedia , lookup
ATLAS experiment wikipedia , lookup
Predictive analytics wikipedia , lookup
Future Circular Collider wikipedia , lookup
Large Hadron Collider wikipedia , lookup
Compact Muon Solenoid wikipedia , lookup
Higgs mechanism wikipedia , lookup
Minimal Supersymmetric Standard Model wikipedia , lookup
Transcript
Machine Learning Techniques to Identify Higgs Boson Events from CERN Data Department of Mathematics and Statistics, University of Wisconsin-La Crosse, WI 54601 Elizabeth R. McMahon, Chad Vidden The discovery of the Higgs Boson, the theorized particle that gives all other particles mass, was announced by CERN physicists on July 4th, 2012. 1,2The existence of the Higgs particle (the quantum excitation of the Higgs field) had been hypothesized since the 1960s, however, its large mass-125.09 GeV-is correlated to an extremely short lifespan of 1.56 x10-22 s. Thus, identification of Higgs can only be made through observing the decay particles of high energy collisions, a difficult process. Scientists at CERN created a simulated data set that mimicked the results from the ATLAS experiments, and released this data to the public in the form of a Machine Learning Challenge using the popular data science website Kaggle.com. Machine learning methods such as decision trees, random forests, logistic regression, and Naïve Bayes were tested for their ability to accurately distinguish Higgs events from background data in the CERN dataset, with varying results. Figure 1. The Standard Model of Particle Physics, with the addition of the Higgs Boson (in yellow). Eleven variables had some degree of missing data (indicated by the yellow color in Figure 8). Missing data means that variable was not observed in the p+p+ collision (PRI) or was a meaningless calculation (DER). Missing data was filled in using an Amelia function that predicted values based on existing data. This function was used to predict existing data, in order to analyze how well it Figure 8. Visualization of missing data. worked (Figure 9). • • • • • • • • • • • DER pt ratio lep tau DER met phi centrality DER lep eta centrality PRI tau PRI tau PRI tau PRI lep PRI lep PRI lep phi PRI met PRI met phi • • • • • • • • • • • • • • • • Decision Tree (Figure 10) CI Decision Tree (Figure 11) Random Forest (Figure 12) Logistic Regression (Figure 13) Naïve Bayes (Figure 14) PRI met sumet PRI jet num PRI jet leading pt PRI jet leading eta* PRI jet leading phi* PRI jet subleading pt PRI jet subleading eta* PRI jet subleading phi* PRI jet all pt Weight*˘ Label˘ 3 depicts a p+-p+ collision, to illustrate the complexity of the measurements and the difficulty in interpreting such events. 4Figure 4 shows the Higgs signal found at 13 TeV collisions (the small bump at ~125 GeV), emphasizing the necessity of optimizing data analysis. Equations 1. Mathematical formulas for determining precision, recall, and the F1 Score. 1a. Precision (P) = 1b. Recall (R) = 1c. F1 Score =2 Figure 12. N number of decision trees are created from random subsets of the data. Each tree begins with a random variable, after which the same “biggest splitter” procedure is followed. The outputs are then averaged to produce the probability that a collision resulted in the formation of a Higgs Boson. . 3 4 Figure 15. Example of how variable importance might be determined for tree-based models. PR P+R Table 2. F1 Score Table 1. Model Accuracy Model Accuracy Model Logistic Regression 82.27 Logistic Regression 0.97514 0.66473 0.79055 Random Forest 81.88 Random Forest 0.77589 0.66375 0.71545 Decision Tree 80.40 Decision Tree 0.69231 0.72449 0.70799 CI Decision Tree 76.20 CI Decision Tree 0.65037 0.66667 0.65842 Naïve Bayes 74.90 Naïve Bayes 0.63961 0.61539 0.62727 Decision Tree DER_mass_transverse_met_lep Precision Recall Figure 10. Decision tree model. The beginning node is the variable (and corresponding value) that results in the biggest split between signal and background classes. At each subsequent node, the process of determining the biggest splitting variable is reiterated. Outputs of each branch path are the probability that a Higgs boson (signal) was produced. Figure 2. Variable correlation. Bolder colors indicate stronger relationships. Figure 13. A logistic regression model is useful for discreet classification problems (i.e. Signal vs. Background). The signal:background (S:B) ratio of the training set was approximately 1:2 (see Figure 5). The numerical spread of all variables is depicted in Figure 6. Those with more spread have a greater influence potential due to the increased comparative variance ability. The stacked quartile graphs (as in Figure 7) show the S:B ratio for equal chunks of variable data. Greater change in the S:B ratio across the quartiles indicates a more influential variable. Figure 11. A CI tree works much in the same way as a regular decision tree. However, the model selects variables and associated values based on the statistical significance related to the classification that a Higgs boson was produced in a collision. The most significant variable (the one with the smallest p-value) begins the tree. F1 Score Figure 14. A Naïve Bayes model looks at base, evidence, and likelihood probabilities when determining the appropriate classification of a collision. 𝑷 𝒔 𝑷 𝒔 𝑽𝟏 𝑷 𝒔 𝑽𝟐 … 𝑷 𝒔 𝑽𝒏 𝑷(𝒔|𝑽𝟏 , 𝑽𝟐 … , 𝑽𝒏 ) = 𝑷 𝑽𝟏 𝑷 𝑽𝟐 … 𝑷 𝑽𝒏 1 ATLAS; CMS (26 March 2015). "Combined Measurement of the Higgs Boson Mass in pp Collisions at √s=7 and 8 TeV with the ATLAS and CMS Experiments". arXiv:1503.07589 2LHC Naïve Bayes DER_mass_MMC 1 1 19 DER_met_phi_centrality 3 4 DER_mass_vis 4 DER_pt_ratio_lep_tau 5 PRI_tau_pt x Mean 1x Median 1.67 2 4 6.25 2.5 5 2 3.50 3.5 3 4 10 5.25 4 8 3x 5.33 5 6 21 1 9.33 6 DER_deltar_tau_lep 7 5 2 17 7.75 6 DER_pt_h 8 9 16 5 9.50 8.5 6 11 22 3 10.50 8.5 11 20 6 9 11.50 10 10 14 7 10.33 10 x Future Work References Figure 7 A quartile depiction of the S:B ratio suggests DER mass transverse met lep is an influential variable. Logistic Regression 2 PRI_met_sumet Figure 4. Higgs signal of 13 TeV collisions. Random Forest 2 PRI_jet_num Figure 6. Distribution of numerical variables. 2 # true positives # actual positives DER_sum_pt Figure 3. Particle decay of a p+p+ collision (ATLAS detector). 1 # true positives #predicted positives Variable 3Figure Figure 5. Signal (Higgs) to Background Ratio. The importance of variables in each model was determined using built in functions with the various model packages (see Figure 15 for an example). The package (‘party’) used to construct the CI decision tree did not have a variable importance function, and thus there is no variable ranking for that model. Variables with a median score of 1-10 are found in Table 3. Table 3. Variable Importance Training Set: 250K collisions Test Set: 500K collisions *Indicates variables removed from the dataset for model building/testing ˘Indicates variables only contained in the training set Figure 9. Graphs of existing data points vs. predicted values (for the same points) indicates how well filling in data worked on a variable-to-variable basis (good fit on left, poor fit on right). Five models were built/tested using 5000 random collisions from the training set, and 25 variables (indicated in Data Description/Exploration). The numerical variables are composed of energies, angles, momentums, and similar measurements of the particles formed in proton-proton (p+-p+) collisions. The variables are listed below; their definitions can be found at https://higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf. Figure 2. Shows the relationships between variables. “PRI” designates a (primitive) raw detection (typically related to the momenta of particles). “DER” indicates a derived variable, one created by CERN physicists using the current understanding of particle physics to analyze the data. The Label variable classifies a collision as either signal, “s,” (a Higgs-producing collision) or background, “b.” EventId DER mass MMC DER mass transverse met lep DER mass vis DER pt h DER deltaeta jet jet* DER mass jet jet DER mass jet jet DER deltar tau lep DER pt tot DER sum pt Model performance was analyzed using two metrics, accuracy and F1 Score. Accuracy is the percentage of correctly classified collisions. See Table 1 for a compilation of the accuracies of each model. F1 Scores were determined by calculating the precision and recall, before determining the F1 Score itself (see Equations 1 for the mathematical formalism). Table 2 contains the results of this test. Models Data Description/Exploration • • • • • • • • • • • Results Missing Data Abstract Higgs Cross Section Working Group; Dittmaier; Mariotti; Passarino; Tanaka; Alekhin; Alwall; Bagnaschi; Banfi (2012). "Handbook of LHC Higgs Cross Sections: 2. Differential Distributions". CERN Report 2 (Tables A.1 – A.20). 1201: 3084. arXiv:1201.3084 • Focus on azeotropic trends and other interesting chemical phenomena. • Use machine learning techniques to analyze trends and make predictions about various chemical processes. Acknowledgments • Dr. Vidden for both machine learning and coding assistance. • Dr. Ragan for insight on the Higgs mechanism and the particle physics background.