Download ChepDataMining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Event Metadata Records
as a Testbed for Scalable
Data Mining
David Malon, Peter van Gemmeren
(Argonne National Laboratory)
1. Abstract
At a data rate of 200 hertz, event metadata records ("TAGs," in ATLAS parlance) provide fertile grounds
for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEPspecific selection or classification rules to event records and to label such an exercise "data mining," but
our interest is different.
This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National
Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the
development of scalable tools for data mining. Using a domain-neutral scientific data format may also
enable us to take advantage of existing data mining components from other communities.
Advanced statistical methods and tools such as classification, association rule mining, and cluster
analysis are common outside the high energy physics community. These tools can prove useful, not for
discovery physics, but for learning about our data, our detector, and our software.
There is, further, a substantial literature on the topic of one-pass algorithms and stream mining
techniques, and such tools may be inserted naturally at various points in the event data processing and
distribution chain.
A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5
straightforward.
This paper describes early experience with event metadata records from ATLAS simulation and
commissioning as a testbed for scalable data mining tool development and evaluation.
2. Details
TAG scale is attractive as a
data mining testbed
Reconstruction
AOD merging
DPD
making
TAG production
Primary DPD
making
D1PD
Data
Event
Filter
• 2 x 109 events per year (real data)
• equivalent amount of Monte Carlo
Raw Data
ESD Data
AOD
Data
TAG
DNPD
Data
TAG
extract/copy
TAG
‒ have more than 0.5 x 109 tags coming from Monte Carlo
already in the next months
• also millions of TAGs from cosmic ray commissioning
‒ content simpler, smaller than that of physics TAGs
Tier 0
Tier 1
Tier 2
Figure 1: Data flow at ATLAS.
First steps: neutral data format
(HDF5)
• ATLAS TAG format is relatively simple, readily
mapped to other technologies
‒ a few TAG attributes depend upon run-dependent
encoding, though, so using _every_ attribute for potential
mining poses challenges
‒ but mining every attribute is not essential for most purposes
• current TAG storage technology in ATLAS is ROOT
via POOL Collections, Oracle via POOL, MySQL via
POOL (limited)
‒ not well suited to use of off-the-shelf mining tools, or to
porting to high-performance hardware
• HDF5 is a unique technology suite that makes
possible the management of extremely large and
complex data collections.
• some data mining tools, both open-source and
proprietary, can read HDF5 (e.g. rattle, mathworks)
Standard TAG processing
provides opportunities for
mining
Example: one-pass algorithms
• learn from data as they go by--no iterative processing
TAG mining and BlueGene/P
‒ distributional characteristics, patterns
• can detect outliers and anomalies, shifts in machine
or detector behavior
Figure 3: BlueGene/P.
‒ not for discovery physics, but we have already detected
software bugs by anomaly detection in TAG data
• this is how network traffic is mined
• tag processing provides several loci for nonintrusive
insertion of stream mining tools
‒ e.g., tag upload or transfer
TAG content and mining
potential
Example: association rules
• physics data contain known associations behavior
‒ e.g., between trigger and physics content
• testbed and tool validation
‒ What associations are found by statistical tools without
domain-specific knowledge?
Example: clustering
• we already cluster data when we do streaming
Figure 2: KMeans
Clustering
example
‒ e.g., by trigger
• testbed and tool validation:
‒ if we treat, e.g., a subset of event attributes (say, N physics
attributes, no trigger information) as points in N-dimensional
space, what clusters are found by statistical tools?
• With TAGs in HDF5, we do not need to port LCG or
ATLAS software to BlueGene in order to begin
• First steps: statistical tools for distributional
characteristics, outliers, anomalies run on each node,
processing disjoint subsets of data
‒ no inter-process communication until an aggregation phase
at the end
‒ aggregation stage (can be done hierarchically, e.g., binary
tree)
• Next steps: use large-scale parallelism for mining
problems that have a large search space
‒ e.g., O(N2) possible pair-wise associations in N attributes
‒ each node may process the entire sample
‒ parallel I/O challenges are different than those for
partitioned samples
• Later steps: Address mining problems that require
inter-process communication in parallel
implementations
‒ iterative process, exchange, process, exchange models
•
e.g., clustering and classification
We are only beginning.