Download PowerPoint

0.0 0.2 0.4 0.6 0.8 1.0 Precision Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Ma Mb VE 0.0 0.2 0.4 0.6 Recall Kun Zhang, Wei Fan, Xiaojing Yuan, Ian Davidson, and Xiangshang Li 0.8 1.0 What this Paper Offers   Application: more accurate (higher recall & precision) solution to predict “ozone days” Interesting and Difficult Data Mining Problem:      High dimensionality and some could be irrelevant features: 72 continuous, 10 verified by scientists to be relevant Skewed class distribution : either 2 or 5% “ozone days” depending on “ozone day criteria” (either 1-hr peak and 8-hr peak) Streaming: data in the “past” collected to train model to predict the “future”. “Feature sample selection bias”: hard to find many days in the training data that is very similar to a day in the future Stochastic true model: given measurable information, sometimes target event happens and sometimes it doesn’t. Key Solution Highlights    Non-parametric models are easier to use when “physical or generative mechanism” is unknown. Reliable conditional probabilities estimation under “skewed, high-dimensional, possibly irrelevant features”, … Estimate decision threshold predict the unknown distribution of the future Seriousness of Ozone Problem   Ground ozone level is a sophisticated chemical and physical process and “stochastic” in nature. Ozone level above some threshold is rather harmful to human health and our daily life. Drawbacks of current ozone forecasting systems  Traditional simulation systems    Consume high computational power Customized for a particular location, so solutions not portable to different places Regression-based methods   E.g. Regression trees, parametric regression equations, and ANN Limited prediction performances Ozone Level Prediction: Problems we are facing  Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ) Challenges as a Data Mining Problem Rather skewed and relatively sparse distribution 1.     2500+ examples over 7 years (1998-2004) 72 continuous features with missing values Huge instance space  If binary and uncorrelated, 272 is an astronomical number 2% and 5% true positive ozone days for 1-hour and 8-hour peak respectively True model for ozone days are stochastic in nature. 2.   Given all relevant features XR, P(Y = “ozone day”| XR) < 1 Predictive mistakes are inevitable A large number of irrelevant features 3. Only about 10 out of 72 features verified to be relevant, No information on the relevancy of the other 62 features For stochastic problem, given irrelevant features Xir , where X=(Xr, Xir), P(Y|X) = P(Y|Xr) only if the data is exhaustive.    May introduce overfitting problem, and change the probability distribution represented in the data.    P(Y = “ozone day”| Xr, Xir) P(Y = “normal day”|Xr, Xir) 1 0 “Feature sample selection bias”. 4.   Given 7 years of data and 72 continuous features, hard to find many days in the training data that is very similar to a day in the future Given these, 2 closely-related challenges 1 1 1. 2. How to train an 2 2 accurate model + + + + How to effectively use a model to predict the 3 a different 3 unknown future with and yet + + distribution Training Distribution Testing Distribution List of methods: • Logistic Regression • Naïve Bayes • Kernel Methods List of methods: • Linear Regression • Decision Trees • RBF • RIPPER mixture rule learner • Gaussian • CBA: association Ma rule models Mb • clustering-based methods VE •…… Skewed and stochastic distribution Probability distribution estimation   Precision Parametric methods  Non-parametric methods use a family of “free-form” functions to “match the data” Decision threshold given determination some “preference criteria”. Highly accurate if through the data is indeed generated from that model you use!   0.0 0.2 0.4 0.6 0.8 1.0 Addressing Challenges 0.0 0.2 0.4 0.6 0.8 1.0 optimization of some Recall given criteria But how about, you don’t know which to choose or use the wrong one?  Compromise between precision and recall • free form function/criteria is appropriate. • preference criteria is appropriates Reliable probability estimation under irrelevant features Recall that due to irrelevant features:  P(Y = “ozone day”| Xr, Xir) P(Y = “normal day”|Xr, Xir)     1 0 Construct multiple models Average their predictions  P(“ozone”|xr): true probability  P(“ozone”|Xr, Xir, θ): estimated probability by model θ MSEsinglemodel:   MSEAverage    Difference between “true” and “estimated”. Difference between “true” and “average of many models” Formally show that MSEAverage ≤ MSESingleModel  A CV based procedure for decision threshold selection Estimated 1 probability + values 1 fold 3 TrainingSet Algorithm 1 0.0 0.2 0.4 0.6 0.8 1.0 Prediction with feature sample selection bias Precision  Ma Mb VE 2 + 2 + - 3 + 0.0 0.2 0.4 0.6 0.8 1.0 + - Recall + Estimated “Probabilityprobability P(y=“ozoneday”|x,θ) Lable Distribution Testing Training Distribution PrecRec TrueLabel” values7/1/98 0.1316 Normal plot file 2 fold ….. 7/3/98 0.5944 7/2/98 0.6245 Estimated probability values 10 fold ……… Ozone Ozone P(y=“ozoneday”|x,θ) Lable 7/1/98 0.1316 Normal 7/2/98 0.6245 Ozone 7/3/98 0.5944 Ozone ……… Decision threshold VE Addressing Data Mining Challenges  Prediction with feature sample selection bias  Future prediction based on decision threshold selected Whole Training Set Classification if P(Y = “ozonedays”|X,θ ) ≥ VE on future θ Predict “ozonedays” days Probabilistic Tree RDT: Random Decision Tree (Fan Models et al’03)  “Encoding data” in trees. Single tree estimators C4.5 node, (Quinlan’93)  At each an un-used feature is chosen  C4.5Up,C4.5P randomly  C4.4 (Provost’03) A discrete feature is un-used if it has never been chosen  Ensembles previously on a given decision path starting from the root to  RDT (Fan et al’03) the current node.  Member tree trained  A continuous feature can be chosen multiple times on the randomly same decision probability path, but each time a different threshold  Average value is chosen  Bagging Probabilistic  Stop when one of the following happens:   Tree (Breiman’96) 1. Original Data vs Bootstrap  A node becomes too small 3 examples). 2. (<= Random pick vs. Random Subset + info gain  Bootstrap Averaging  Or the total height of the 3. treeProbability exceeds some limits:vs. Voting  Compute probability  Member tree: C4.5, Different from Random Forest C4.4 Optimal Decision Boundary from Tony Liu’s thesis (supervised by Kai Ming Ting) Baseline Forecasting Parametric Model O3  Upwind  EmFact or T max T b  SRd WSa  0.1  WSp  0.5  1 in which, • O3 - Local ozone peak prediction • Upwind - Upwind ozone background level • EmFactor - Precursor emissions related factor • Tmax - Maximum temperature in degrees F • Tb - Base temperature where net ozone production begins (50 F) • SRd - Solar radiation total for the day • WSa - Wind speed near sunrise (using 09-12 UTC forecast mode) • WSp - Wind speed mid-day (using 15-21 UTC forecast mode) Model evaluation criteria Precision and Recall  At the same recall level, Ma is preferred over Mb if the precision of Ma is consistently higher than that of Mb Coverage under PR curve, like AUC 0.0 0.2 0.4 0.6 0.8 1.0  Precision  Ma Mb 0.0 0.2 0.4 0.6 0.8 1.0 Recall Some Coverage Results  8-hour: recall = [0.4,0.6] 0.09 BC4.4 RDT Para 0.06 C4.4 0.03 0 Coverage under PR-Curve Some “Action” Results  Annual test Previous years’ data for training • 1. 8-hour: thresholds selected at • 1-hour: thresholds selected at the 2. Nextthe yearrecall for testing = 0.6 recall = 0.6 3. Repeated 6 times using 7 years of data 0.7 0.6 0.6 0.5 0.5 0.4 0.4 Recall 0.3 0.3 Precision 0.2 0.2 0.1 0.1 0 0 BC4.4 RDT C4.4 Para BC4.4 RDT C4.4 Para 1. C4.4 best among single trees 2. BC4.4 and RDT best among tree ensembles 1. BC4.4 and RDT more accurate than baseline Para 2. BC4.4 and RDT “less surprise” than single tree Summary     Procedures to formulate as a data mining problem, Analysis of combination of technical challenges Process to search for the most suitable solutions. Model averaging of probability estimators can effectively approximate the true probability    A lot of irrelevant features Feature sample selection bias A CV based guide for decision threshold determination for stochastic problems under sample selection bias Choosing the Appropriate PET come to our other talk 10:30 RM 402 Signal-noise separability estimation through RDT or BPET Given dataset < 0.9 Low signal-noise separability Single Tree Ensembl e or Single trees Single Ensemble Trees (AUC,MSE, ErrorRate) RDT >=0.9 AUC Score AUC MSE Error Rate (AUC,MSE, ErrorRate) AUC CFT CFT High signalnoise separability Ensemble or Single trees MSE, ErrorRate Ensemble Feature types and value Continuous characteristic features or s categorical feature AUC, MSE, with a large ErrorRate number of values C4.5 or C4.4 RDT ( BPET) Categorical feature with AUC, MSE, limited ErrorRate values BPET Thank you! Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PowerPoint