Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
0.0 0.2 0.4 0.6 0.8 1.0 Precision Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Ma Mb VE Presentor: Prof. Longbin Cao 0.0 0.2 0.4 0.6 Recall Wei Fan, Kun Zhang, and Xiaojing Yuan 0.8 1.0 What is the business problem and broadbased areas Problem: ozone pollution day detection Ground ozone level is a sophisticated chemical, physical process and “stochastic” in nature. Ozone level above some threshold is rather harmful to human health and our daily life. 8-hour peak and 1-hour peak standards. 8-hour average > 80 ppt (parts per billion) 1-hour average > 120 ppt It happens from 5 to 15 days per year. Broad-area: Environmental Pollution Detection and Protection Drawback of alternative approaches Simulation: consume high computational power; customized for a particular location, so solutions not portable to different places Physical model approach: hard to come up with good equations when there are many parameters, and changes from place to place What are the research challenges that cannot be handled by the state-of-the-art? Dataset is sparse, skewed, stochastic, biased and streaming in the same time. High dimensional Very few positives Under similar conditions: sometimes it happens and sometimes it doesn’t P(x) difference between training and testing Training data from past, predicting the future Physical model is not well understood and cannot be customized easily from location to location what is the main idea of your approach? Non-parametric models are easier to use when “physical or generative mechanism” is unknown. Reliable “conditional probabilities” estimation under “skewed, biased, high-dimensional, possibly irrelevant features Estimate “decision threshold” to predict on the unknown distribution of the future Random Decision Tree Super fast implementation Formal Analysis: Bound analysis MSE reduction Bias and bias reduction P(y|x) order correctness proof A CV based procedure for decision threshold selection Estimated 1 probability + values 1 fold 3 + Estimated TrainingSet Algorithm Precision 1 2 + 2 + - 3 0.0 0.2 0.4 0.6 0.8 1.0 Decision Threshold when P(x) is different and P(y|x) is non-deterministic + + Ma Mb VE 0.0 0.2 - “Probability- probability P(y=“ozoneday”|x,θ) Lable Distribution Testing Training Distribution TrueLabel” values7/1/98 0.1316 Normal file 2 fold ….. 7/3/98 0.5944 7/2/98 0.6245 Estimated probability values 10 fold ……… Ozone Ozone P(y=“ozoneday”|x,θ) 0.4 0.6 Recall 1.0 PrecRec plot Decision threshold VE Lable 7/1/98 0.1316 Normal 7/2/98 0.6245 Ozone 7/3/98 0.5944 Ozone ……… 0.8 Random Decision Tree B1: {0,1} B1 chosen randomly B1 == 0 B2: {0,1} B3: continuous B2: {0,1} Y Random threshold 0.3 N B2 == 0? B3 chosen randomly B2: {0,1} B3 < 0.3? B3: continuous B3: continuous Y N B2 chosen randomly ……… B3 < 0.6? RDT vs Random Forest B2: {0,1} B3 chosen randomly 1. Original Data vs Bootstrap B3: continous 2. Random pick vs. Random Subset + info gain 3. Probability Averaging vs. Voting Random threshold 0.6 4. RDT: superfast Optimal Decision Boundary from Tony Liu’s thesis (supervised by Kai Ming Ting) what is the main advantage of your approach, how do you evaluate it? Fast and Reliable Compare with State-of-the-art data mining algorithms: Decision tree NB Logistic Regression SVM (linear and RBF kernel) Boosted NB and Decision Tree Bagging Random Forest Physical Equation-based Model Actual streaming environment on daily basis what impact has been made in particular, changing the real world business? From 4-year studies on actual data, the proposed data mining approach consistently outperforms physical model-based method can your approach be widely expanded to other areas? and how easy would it be? Other known application using proposed approach Fraud Detection Manufacturing Process Control Congestion Prediction Marketing Social Tagging Proposed method is general enough and doesn’t need any tuning or re-configuration Hidden Variable Limitation of GUIDE Need to decide grouping variables and independent variables. A non-trivial task. If all variables are categorical, GUIDE becomes a single CART regression tree. Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results. Data Mining Challenges Application: more accurate solution to predict “ozone days” than physical models Interesting and Difficult Data Mining Problem: High dimensionality and some could be irrelevant features: 72 continuous, 10 verified by scientists to be relevant Skewed class distribution : either 2 or 5% “ozone days” depending on “ozone day criteria” (either 1-hr average peak and 8-hr average peak) Streaming: data in the “past” collected to train model to predict the “future”. “Feature sample selection bias”: hard to find many days in the training data that is very similar to a day in the future Stochastic true model: given measurable information, sometimes target event happens and sometimes it doesn’t. Key Solution Highlights Physical model is not known. Do not know well what factors are really contributing Non-parametric models are easier to use when “physical or generative mechanism” is unknown. Reliable conditional probabilities estimation under “skewed, high-dimensional, possibly irrelevant features”, … Estimate decision threshold predict the unknown distribution of the future Seriousness of Ozone Problem Ground ozone level is a sophisticated chemical and physical process and “stochastic” in nature. Ozone level above some threshold is rather harmful to human health and our daily life. Drawbacks of current ozone forecasting systems Traditional simulation systems Regression-based methods Consume high computational power Customized for a particular location, so solutions not portable to different places E.g. Regression trees, parametric regression equations, and ANN Limited prediction performances Physical Model: hard to come up with good equations when there are many parameters, and changes from place to place Challenges as a Data Mining Problem Rather skewed and relatively sparse distribution 1. 3500+ examples collected over 10 year period 72 continuous features with missing values Huge instance space If binary and uncorrelated, 272 is an astronomical number 2% and 5% true positive ozone days for 1-hour and 8-hour peak respectively Many factors contribute to ozone pollution. Some we know and some we do not know well. It is suspected that true model for ozone days are stochastic in nature. 2. Given all relevant features XR, P(Y = “ozone day”| XR) < 1 Predictive mistakes are inevitable A large number of unverified physical features 3. Only about 10 out of 72 features verified to be relevant, No information on the relevancy of the other 62 features For stochastic problem, given irrelevant features Xir , where X=(Xr, Xir), P(Y|X) = P(Y|Xr) only if the data is exhaustive. May introduce overfitting problem, and change the probability distribution represented in the data. P(Y = “ozone day”| Xr, Xir) P(Y = “normal day”|Xr, Xir) 1 0 “Feature sample selection bias”. 4. Given 72 continuous features, hard to find many days in the training data that is very similar to a day in the future Given these, 2 closely-related challenges 1 1 1. 2. How to train an 2 2 accurate model + + + + How to effectively use a model to predict the 3 a different 3 unknown future with and yet + + distribution Training Distribution Testing Distribution List of methods: • Logistic Regression • Naïve Bayes • Kernel Methods List of methods: • Linear Regression • Decision Trees • RBF • RIPPER mixture rule learner • Gaussian • CBA: association Ma rule models Mb • clustering-based methods VE •…… Skewed and stochastic distribution Probability distribution estimation Precision Parametric methods Non-parametric methods use a family of “free-form” functions to “match the data” Decision threshold given determination some “preference criteria”. Highly accurate if through the data is indeed generated from that model you use! 0.0 0.2 0.4 0.6 0.8 1.0 Addressing Challenges 0.0 0.2 0.4 0.6 0.8 1.0 optimization of some Recall given criteria But how about, you don’t know which to choose or use the wrong one? Compromise between precision and recall • free form function/criteria is appropriate. • preference criteria is appropriates Reliable probability estimation under irrelevant features Recall that due to irrelevant features: P(Y = “ozone day”| Xr, Xir) P(Y = “normal day”|Xr, Xir) 1 0 Construct multiple models Average their predictions P(“ozone”|xr): true probability P(“ozone”|Xr, Xir, θ): estimated probability by model θ MSEsinglemodel: MSEAverage Difference between “true” and “estimated”. Difference between “true” and “average of many models” Formally show that MSEAverage ≤ MSESingleModel A CV based procedure for decision threshold selection Estimated 1 probability + values 1 fold 3 TrainingSet Algorithm 1 0.0 0.2 0.4 0.6 0.8 1.0 Prediction with feature sample selection bias Precision Ma Mb VE 2 + 2 + - 3 + 0.0 0.2 0.4 0.6 0.8 1.0 + - Recall + Estimated “Probabilityprobability P(y=“ozoneday”|x,θ) Lable Distribution Testing Training Distribution PrecRec TrueLabel” values7/1/98 0.1316 Normal plot file 2 fold ….. 7/3/98 0.5944 7/2/98 0.6245 Estimated probability values 10 fold ……… Ozone Ozone P(y=“ozoneday”|x,θ) Lable 7/1/98 0.1316 Normal 7/2/98 0.6245 Ozone 7/3/98 0.5944 Ozone ……… Decision threshold VE Addressing Data Mining Challenges Prediction with feature sample selection bias Future prediction based on decision threshold selected Whole Training Set Classification if P(Y = “ozonedays”|X,θ ) ≥ VE on future θ Predict “ozonedays” days Probabilistic Tree Models Single tree estimators C4.5 (Quinlan’93) C4.5Up,C4.5P C4.4 (Provost’03) Ensembles RDT (Fan et al’03) Member tree trained randomly Average probability Bagging Probabilistic Tree (Breiman’96) Bootstrap Compute probability Member tree: C4.5, C4.4 Illustration of RDT B1: {0,1} B1 chosen randomly B1 == 0 B2: {0,1} B3: continuous B2: {0,1} Y Random threshold 0.3 N B2 == 0? B3 chosen randomly B2: {0,1} B3 < 0.3? B3: continuous B3: continuous Y N B2 chosen randomly ……… B3 < 0.6? RDT vs Random Forest B2: {0,1} B3 chosen randomly 1. Original Data vs Bootstrap B3: continous 2. Random pick vs. Random Subset + info gain 3. Probability Averaging vs. Voting Random threshold 0.6 4. RDT: superfast Optimal Decision Boundary from Tony Liu’s thesis (supervised by Kai Ming Ting) Target Distribution SVM RBF kernel (1 day) Single Decision Tree (5 sec to train) RDT (5 sec) SVM Linear kernel (over night) Hidden Variable Hidden Variable Limitation of GUIDE Need to decide grouping variables and independent variables. A non-trivial task. If all variables are categorical, GUIDE becomes a single CART regression tree. Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results. Baseline Forecasting Parametric Model O3 Upwind EmFact or T max T b SRd WSa 0.1 WSp 0.5 1 in which, • O3 - Local ozone peak prediction • Upwind - Upwind ozone background level • EmFactor - Precursor emissions related factor • Tmax - Maximum temperature in degrees F • Tb - Base temperature where net ozone production begins (50 F) • SRd - Solar radiation total for the day • WSa - Wind speed near sunrise (using 09-12 UTC forecast mode) • WSp - Wind speed mid-day (using 15-21 UTC forecast mode) Model evaluation criteria Precision and Recall At the same recall level, Ma is preferred over Mb if the precision of Ma is consistently higher than that of Mb Coverage under PR curve, like AUC 0.0 0.2 0.4 0.6 0.8 1.0 Precision Ma Mb 0.0 0.2 0.4 0.6 0.8 1.0 Recall Some Coverage Results 8-hour: recall = [0.4,0.6] 0.09 BC4.4 RDT Para 0.06 C4.4 0.03 0 Coverage under PR-Curve System Results Annual test Previous years’ data for training • 1. 8-hour: thresholds selected at • 1-hour: thresholds selected at the 2. Nextthe yearrecall for testing = 0.6 recall = 0.6 3. Repeated 6 times using 7 years of data 0.7 0.6 0.6 0.5 0.5 0.4 0.4 Recall 0.3 0.3 Precision 0.2 0.2 0.1 0.1 0 0 BC4.4 RDT C4.4 Para BC4.4 RDT C4.4 Para 1. C4.4 best among single trees 2. BC4.4 and RDT best among tree ensembles 1. BC4.4 and RDT more accurate than baseline Para 2. BC4.4 and RDT “less surprise” than single tree SVM: 1-hr criteria CV AdaBoost: 1-hr criteria CV Intuition The true distribution P(y|X) is never known. Is it an elephant? Every random tree is not a random guess of this P(y|X). Their structure is, but not the “node statistics” Each tree looks at the elephant from a different angle. Every tree is consistent with the training data. Each tree is quite strong. Expected Error Reduction Quadratic loss: for probability estimation: regression problems ( P(y|X) – P(y|X, θ) )2 ( y – f(x))2 Theorem 1: the “expected quadratic loss” of RDT is less than any combined model chosen at random. Bias and Variance Reduction Summary When physical model is hard to build, data mining is one of the top choices Procedures to formulate as a data mining problem How to collect data Analysis of combination of technical challenges: Process to search for the most suitable solutions. Model averaging of probability estimators can effectively approximate the true probability Skewed problem Sample selection bias Many features Stochastic problem A lot of irrelevant features Feature sample selection bias A CV based guide for decision threshold determination for stochastic problems under sample selection bias Random Decision Tree (Fan et al’03) ICDM’06 Best Application Award ICDM’08 Data Mining Contest Championship Thank you! Questions?