Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Grid Computing in Data Mining and Data Mining on Grid Computing David Cieslak ([email protected]) Advisor: Nitesh Chawla ([email protected]) University of Notre Dame 1 Grid Computing in Data Mining How you help me 2 Data Mining Primer Data Mining:"The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data". -Fayyad, Piatetsky-Shapiro & Smyth, 1996. Classifier: Learning algorithm which trains a predictive model from data Ensemble: A set of classifiers working together to improve prediction 3 Applications of Data Mining Network Intrusion Detection Categorizing Adult Income Finding Calcifications in Mammography Looking for Oil Spills Identifying Handwritten Digits Predicting Job Failure on a Computing Grid Anticipating Successful Companies 4 Condor Makes DM Tractable I use a small set of algorithms in high volume Ex: Run same classifier on many datasets A single data mining operation may have easily parallelized segments Ex: Learn an ensemble of 10 classifiers on dataset Introducing simple parallelism into data mining conserves time significantly 5 Common DM Task: 10 Fold CV Original Data Network Traffic Dataset ~30 MB Data 6 Common DM Task: 10 Fold CV Original Data 10 Training Folds ~27 MB Data ~30 MB Data 10 Testing Folds ~3 MB Data 7 Common DM Task: 10 Fold CV Training Fold i Learning Algorithm Train Classifier ~27 MB Data Evaluate Classifier ~3 MB Data RIPPER ~2 Hours Testing Fold i < 1 min 8 Common DM Task: 10 Fold CV ~27 MB Data ~3 MB Data Average and aggregate various statistics and measures across folds 9 Using Condor on 10 Folds Condor Pool Local Host Local Host •Receive Results •Aggregate/Average •Splits Data •Upload Data and Task to Pool •Learn Classifier •Evaluate Classifier •Return results ~ 5 mins ~ 2 Hours ~ 5 mins •If there is 1 Hour Learn/Eval time, Condor saves up to 18 hours in real time 10 A More Complex DM Task Over/Under Sampling Wrapper 1. 2. 3. 4. 5. 6. 7. 8. Split data into 50 folds Generate 10 undersamplings and 20 oversamplings per fold Learn classifier on each undersampling Evaluate and select best undersampling Learn classifier combing best undersampling with each oversampling Evaluate best combination Obtain results on test folds Aggregate/Average results (single) (pool) (pool) (single) (pool) (single) (pool) (single)11 Condor Speed-Ups & Usage 10 Fold CV Evaluation Over/Under Sampling Wrapper Single Machine: roughly one day Using Condor: under one hour Single Machine: days to weeks Using Condor: under a day In 2006, I used 471,126 CPU hours via Condor I am “slacking” in 2007: 13,235 CPU hours 12 A Data Miner’s Wishlist User specifies task to system Outlines serial task phases System “smartly” divides labor What is the logical task granule based on: Condor Pool Performance Upload/download latency Data size Algorithm Complexity 13 Data Mining on Grid Computing How I help you 14 It’s Ugly in the Real World Machine related failures: Job related failures: Missing libraries, not enough disk/cpu/mem, wrong software installed, wrong version installed, wrong memory layout... Load related failures: Crash on some args, bad executable, missing input files, mistake in args, missing components, failure to understand dependencies... Incompatibilities between jobs and machines: Power outages, network outages, faulty memory, corrupted file system, bad config files, expired certs, packet filters... Slow actions induce timeouts; kernel tables: files, sockets, procs; router tables: addresses, routes, connections; competition with other users... Non-deterministic failures: Multi-thread/CPU synchronization, event interleaving across systems, random number generators, interactive effects, cosmic rays... 15 A “Grand Challenge” Problem: A user submits one million jobs to the grid. Half of them fail. Now what? Examine the output of every failed job? Login to every site to examine the logs? Resubmit and hope for the best? We need some way of getting the big picture. Need to identify problems not seen before. 16 An Idea: We have lots of structured information about the components of a grid. Can we perform some form of data mining to discover the big picture of what is going on? User: Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on version 12.2. Admin: User “joe” is running 1000s of jobs that transfer 10 TB of data that fail immediately; perhaps he needs help. Can we act on this information to improve the system? User: Avoid resources that are working for you. Admin: Assist user in understand and fixing the problem. 17 Job ClassAd Machine ClassAd MyType = "Job" MyType = "Machine" TargetType = "Machine" TargetType = "Job" ClusterId = 11839 Name = "ccl00.cse.nd.edu" User Job Log QDate = 1150231068 Job 1 submitted. CpuBusy = ((LoadAvg - CondorLoadAvg) CompletionDate = 0 Job 2 submitted. >= 0.500000) Owner = "dcieslak“ MachineGroup = "ccl" JobUniverse = 5 MachineOwner = "dthain" Job 1 placed on ccl00.cse.nd.edu Cmd = "ripper-cost-can-9-50.sh" CondorVersion = "6.7.19 May 10 2006" Job 1 evicted. LocalUserCpu = 0.000000 Job 1 placed on smarty.cse.nd.edu. CondorPlatform = "I386-LINUX_RH9" LocalSysCpu = 0.000000 Job 1 completed. VirtualMachineID = 1 ExitStatus = 0 ExecutableSize = 20000 ImageSize = 40000 JobUniverse = 1 Job 2 placed on dvorak.helios.nd.edu DiskUsage = 110000 Job 2 suspended NiceUser = FALSE NumCkpts = 0 VirtualMemory = 962948 Job 2 resumed NumRestarts = 0 Memory = 498 Job 2 exited normally with status 1. NumSystemHolds = 0 Cpus = 1 CommittedTime = 0 Disk = 19072712 ExitBySignal = FALSE CondorLoadAvg = 1.000000 PoolName = "ccl00.cse.nd.edu" LoadAvg = 1.130000 CondorVersion = "6.7.19 May 10 2006" … … ... Job Job Job Ad Job Ad Ad Ad User Job Log Success Class Machine Machine Ad Machine Ad Machine Ad Ad Failure Class Failure Criteria: exit !=0 core dump evicted suspended bad output DATA MINING Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on version 12.2. ------------------------- run 1 ------------------------Hypothesis: exit1 :- Memory>=1930, JobStart>=1.14626e+09, MonitorSelfTime>=1.14626e+09 (491/377) exit1 :- Memory>=1930, Disk<=555320 (1670/1639). default exit0 (11904/4503). Error rate on holdout data is 30.9852% Running average of error rate is 30.9852% ------------------------- run 2 ------------------------Hypothesis: exit1 :- Memory>=1930, Disk<=541186 (2076/1812). default exit0 (12090/4606). Error rate on holdout data is 31.8791% Running average of error rate is 31.4322% ------------------------- run 3 ------------------------Hypothesis: exit1 :- Memory>=1930, MonitorSelfImageSize>=8.844e+09 (1270/1050). exit1 :- Memory>=1930, KeyboardIdle>=815995 (793/763). exit1 :- Memory>=1927, EnteredCurrentState<=1.14625e+09, VirtualMemory>=2.09646e+06, LoadAvg>=30000, LastBenchmark<=1.14623e+09, MonitorSelfImageSize<=7.836e+09 (94/84). exit1 :- Memory>=1927, TotalLoadAvg<=1.43e+06, UpdatesTotal<=8069, LastBenchmark<=1.14619e+09, UpdatesLost<=1 (77/61). default exit0 (11940/4452). Error rate on holdout data is 31.8111% Running average of error rate is 31.5585% Unexpected Discoveries Purdue Teragrid (91343 jobs on 2523 CPUs) Jobs fail on machines with (Memory>1920MB) Diagnosis: Linux machines with > 3GB have a different memory layout that breaks some programs that do inappropriate pointer arithmetic. UND & UW (4005 jobs on 1460 CPUs) Jobs fail on machines with less than 4MB disk. Diagnosis: Condor failed in an unusual way when the job transfers input files that don’t fit. 21 Many Open Problems Strengths and Weaknesses of Approach Acting on Information Correlation != Causation -> could be enough? Limits of reported data -> increase resolution? Not enough data points -> direct job placement? Steering by the end user. Applying learned rules back to the system. Evaluating (and sometimes abandoning) changes. Creating tools that assist with “digging deeper.” Data Mining Research Continuous intake + incremental construction. Creating results that non-specialists can understand. 22 Acknowledgements Dr. Thain (University of Notre Dame) Local Condor expert Use of some slides for this presentation Cooperative Computing Lab Maintain/Improve local Condor Pool Provide computing resources 23 Condor Related Publications D. Cieslak, D. Thain, N. Chawla, "Troubleshooting Distributed Systems via Data Mining," (HPDC-15), June 2006 N. Chawla, D. Cieslak, "Evaluating Calibration of Probability Estimation Trees," AAAI Workshop on the Evaluation Methods in Machine Learning, July 2006 N. Chawla, D. Cieslak, L. Hall, A. Joshi, “Killing Two Birds with One Stone: Countering Cost and Imbalance,” Data Mining and Knowledge Discovery, Under Revision 24 Questions? 25