Download ChepDataMining

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) 1. Abstract At a data rate of 200 hertz, event metadata records ("TAGs," in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEPspecific selection or classification rules to event records and to label such an exercise "data mining," but our interest is different. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation. 2. Details TAG scale is attractive as a data mining testbed Reconstruction AOD merging DPD making TAG production Primary DPD making D1PD Data Event Filter • 2 x 109 events per year (real data) • equivalent amount of Monte Carlo Raw Data ESD Data AOD Data TAG DNPD Data TAG extract/copy TAG ‒ have more than 0.5 x 109 tags coming from Monte Carlo already in the next months • also millions of TAGs from cosmic ray commissioning ‒ content simpler, smaller than that of physics TAGs Tier 0 Tier 1 Tier 2 Figure 1: Data flow at ATLAS. First steps: neutral data format (HDF5) • ATLAS TAG format is relatively simple, readily mapped to other technologies ‒ a few TAG attributes depend upon run-dependent encoding, though, so using _every_ attribute for potential mining poses challenges ‒ but mining every attribute is not essential for most purposes • current TAG storage technology in ATLAS is ROOT via POOL Collections, Oracle via POOL, MySQL via POOL (limited) ‒ not well suited to use of off-the-shelf mining tools, or to porting to high-performance hardware • HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. • some data mining tools, both open-source and proprietary, can read HDF5 (e.g. rattle, mathworks) Standard TAG processing provides opportunities for mining Example: one-pass algorithms • learn from data as they go by--no iterative processing TAG mining and BlueGene/P ‒ distributional characteristics, patterns • can detect outliers and anomalies, shifts in machine or detector behavior Figure 3: BlueGene/P. ‒ not for discovery physics, but we have already detected software bugs by anomaly detection in TAG data • this is how network traffic is mined • tag processing provides several loci for nonintrusive insertion of stream mining tools ‒ e.g., tag upload or transfer TAG content and mining potential Example: association rules • physics data contain known associations behavior ‒ e.g., between trigger and physics content • testbed and tool validation ‒ What associations are found by statistical tools without domain-specific knowledge? Example: clustering • we already cluster data when we do streaming Figure 2: KMeans Clustering example ‒ e.g., by trigger • testbed and tool validation: ‒ if we treat, e.g., a subset of event attributes (say, N physics attributes, no trigger information) as points in N-dimensional space, what clusters are found by statistical tools? • With TAGs in HDF5, we do not need to port LCG or ATLAS software to BlueGene in order to begin • First steps: statistical tools for distributional characteristics, outliers, anomalies run on each node, processing disjoint subsets of data ‒ no inter-process communication until an aggregation phase at the end ‒ aggregation stage (can be done hierarchically, e.g., binary tree) • Next steps: use large-scale parallelism for mining problems that have a large search space ‒ e.g., O(N2) possible pair-wise associations in N attributes ‒ each node may process the entire sample ‒ parallel I/O challenges are different than those for partitioned samples • Later steps: Address mining problems that require inter-process communication in parallel implementations ‒ iterative process, exchange, process, exchange models • e.g., clustering and classification We are only beginning.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ChepDataMining