Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
0.0 0.2 0.4 0.6 0.8 1.0 Precision Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Ma Mb VE Presentor: Prof. Longbin Cao 0.0 0.2 0.4 0.6 Recall Wei Fan, Kun Zhang, and Xiaojing Yuan 0.8 1.0 What is the business problem and broadbased areas Problem: ozone pollution day detection Ground ozone level is a sophisticated chemical, physical process and “stochastic” in nature. Ozone level above some threshold is rather harmful to human health and our daily life. 8-hour peak and 1-hour peak standards. 8-hour average > 80 ppt (parts per billion) 1-hour average > 120 ppt It happens from 5 to 15 days per year. Broad-area: Environmental Pollution Detection and Protection Drawback of alternative approaches Simulation: consume high computational power; customized for a particular location, so solutions not portable to different places Physical model approach: hard to come up with good equations when there are many parameters, and changes from place to place What are the research challenges that cannot be handled by the state-of-the-art? Dataset is sparse, skewed, stochastic, biased and streaming in the same time. High dimensional Very few positives Under similar conditions: sometimes it happens and sometimes it doesn’t P(x) difference between training and testing Training data from past, predicting the future Physical model is not well understood and cannot be customized easily from location to location what is the main idea of your approach? Non-parametric models are easier to use when “physical or generative mechanism” is unknown. Reliable “conditional probabilities” estimation under “skewed, biased, high-dimensional, possibly irrelevant features Estimate “decision threshold” to predict on the unknown distribution of the future Random Decision Tree Super fast implementation Formal Analysis: Bound analysis MSE reduction Bias and bias reduction P(y|x) order correctness proof A CV based procedure for decision threshold selection Estimated 1 probability + values 1 fold 3 + Estimated TrainingSet Algorithm Precision 1 2 + 2 + - 3 0.0 0.2 0.4 0.6 0.8 1.0 Decision Threshold when P(x) is different and P(y|x) is non-deterministic + + Ma Mb VE 0.0 0.2 - “Probability- probability P(y=“ozoneday”|x,θ) Lable Distribution Testing Training Distribution TrueLabel” values7/1/98 0.1316 Normal file 2 fold ….. 7/3/98 0.5944 7/2/98 0.6245 Estimated probability values 10 fold ……… Ozone Ozone P(y=“ozoneday”|x,θ) 0.4 0.6 Recall 1.0 PrecRec plot Decision threshold VE Lable 7/1/98 0.1316 Normal 7/2/98 0.6245 Ozone 7/3/98 0.5944 Ozone ……… 0.8 Random Decision Tree B1: {0,1} B1 chosen randomly B1 == 0 B2: {0,1} B3: continuous B2: {0,1} Y Random threshold 0.3 N B2 == 0? B3 chosen randomly B2: {0,1} B3 < 0.3? B3: continuous B3: continuous Y N B2 chosen randomly ……… B3 < 0.6? RDT vs Random Forest B2: {0,1} B3 chosen randomly 1. Original Data vs Bootstrap B3: continous 2. Random pick vs. Random Subset + info gain 3. Probability Averaging vs. Voting Random threshold 0.6 4. RDT: superfast Optimal Decision Boundary from Tony Liu’s thesis (supervised by Kai Ming Ting) what is the main advantage of your approach, how do you evaluate it? Fast and Reliable Compare with State-of-the-art data mining algorithms: Decision tree NB Logistic Regression SVM (linear and RBF kernel) Boosted NB and Decision Tree Bagging Random Forest Physical Equation-based Model Actual streaming environment on daily basis what impact has been made in particular, changing the real world business? From 4-year studies on actual data, the proposed data mining approach consistently outperforms physical model-based method can your approach be widely expanded to other areas? and how easy would it be? Other known application using proposed approach Fraud Detection Manufacturing Process Control Congestion Prediction Marketing Social Tagging Proposed method is general enough and doesn’t need any tuning or re-configuration