Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Learning Introduction: Modeling Examples Visualization example: Fraud by customer type 60 Our goal is to build model to predict fraud in advance 50 % 40 Legitimate (n=5000) 30 Fraud (n=200) 20 We can see associations between customer type and fraudulent behavior. 10 Are they legitimate? Data leakage? 0 Type A Type B Type C • Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements • Identify the risk factors for prostate cancer (lpsa), based on clinical and demographic variables. ESL Chap1 - Introduction • Classify a recorded phoneme, based on a log-periodogram. A restricted model (red) does much better than an unrestricted one (jumpy black) • Customize an email spam detection system. X = which words appear and how much Y = Spam or not? • Identify the numbers in a handwritten zip code, from a digitized image X = color of each pixel Y = which digit is it? • Classify a tissue sample into one of several cancer classes, based on a gene expression profile. X = expression levels of genes Y = which cancer? • Classify the pixels in a LANDSAT image, according to usage: Y = {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil} X = values of pixels in several wavelength bands October 2006 Announcement of the NETFLIX Competition USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: • • • • • • • Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on absolute rating error prior to 2011 $50K for the annual progress price (relative to baseline) Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies Performance is evaluated on holdout movies-users pairs NETFLIX competition has attracted 45878 contestants on 37660 teams from 180 different countries Tens of thousands of valid submissions from thousands of teams Conclusion: in 2009, an international team attained the goal and won the prize! More later… Data Overview: NETFLIX Internet Movie Data Base 17K Selection unclear 480 K At least 20 Ratings by end 2005 NETFLIX Competition Data All users (6.8 M) All movies (80K) Fields Title Year 100 M ratings 4 5 1 Actors Awards 3 2 Revenue 4 … NETFLIX data generation process User Arrival 17K movies Movie Arrival Training Data 1998 Time 2005 4 5 ? Qualifier Dataset 3M 3 2 ? Netflix and us • We will have a modeling challenge in our course which will use the Netflix data. The winners will get a grade boost! • The $1M was won in 2009 by a collaboration of several leading teams – The strongest team, which won both yearly $50K prizes, was founded at AT&T, with an Israeli participant (Yehuda Koren) – Yehuda was one of the major driving forces on the final winning team – He is now back in Israel, and may come give us a talk! • While I was at IBM Research, our team won a related competition in KDD-Cup 2007 (same data, more “standard” modeling tasks) – We may have a “case study” lecture on that as well Project evolution and relevance to our course Business problem definition Modeling problem definition Targeting, Sales force mgmt. Wallet / opportunity estimation Model generation & validation Programming, Simulation, IBM Wallets Statistical problem definition Quantile est., Latent variable est. Implementation & application development Modeling methodology design Quantile est., Graphical model Outside scope Keep in mind OnTarget, MAP This is our domain!