Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Inductive Software Engineering Manifesto Principles for Industrial Data Mining Presentation By: Ebeid Soliman & Mason Schoolfield Paper Authored By: Menzies & Kocaganeli – Lane Dept of CS/EE, WVU Bird, Zimmerman, & Schulte – Microsoft Research Motivation • • • • This paper is a reflection of the authors’ applied data mining work, discussions with researchers, and software engineering practitioners. Document methods and experience from industrial practitioners The principal question is : what characterizes the difference between academic and industrial data mining ? Motivation: Successful data-mining projects in industry Inductive Software Engineering • “A branch of software engineering that focuses on the delivery of data mining based software applications to users” • Understand user goals to inductively generate the models that most matter to the user • Industrial practitioners are focused on users, whereas academic data mining research is focused on algorithms Industrial Data Mining 7 Principles • Users before algorithms • Plan for scale • Early feedback • Be open-minded • Do smart learning • Live with the data you have • Broad skill set, big toolkit Users before algorithms • Guiding Principle – Users Before Algorithms • Mining algorithms are only good if users fund their use in real-world applications Users before Algorithms Hallmarks of good interaction meetings • • • • Users bring senior management to the meetings Users keep interrupting (you or each other) and debating your results • • Indicates the users understand your explanation of the results Your results are touching on issues that concern them User begin to offer more data sources for analysis Users invite you to their workspace to show how to do part of the analysis Plan for scale Knowledge Discovery in Databases • KDD – Knowledge (KDD) Discovery In Databases • • The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data Repetition Required Steps that compose the KDD process - Fayyad • • • • Plan for scale Most data mining is data pre-processing Gaining access to databases in business groups is time consuming To ensure repeatability automate as many KDD steps as possible Data mining methods are repeated multiple times • • • Answer user questions Enhance data mining method or Fix bugs Deploy to different user groups Plan for scale • Observed Phases • • • • Scout - rapid prototyping, apply many methods to data, explore range of hypotheses, gain user interest (get feedback) Survey - experiment to find stable models - focusing on user goals Build - integrate models into a deployment framework – suitable for target user base Team size doubles after scouting, doubles after surveying – time implications! Early feedback • • • Simplicity first: before conducting very elaborate studies, try applying very simple tools to gain rapid early feedback Get Feedback Early and Often Discretize continuous attributes (determine what is ignorable) Be open-minded • Avoid a fixed hypothesis • Avoid a fixed approach, particularly for data not been mined before • Initial results are important and can change goals Smart Learning • Inductive agents, human or otherwise, make errors • • Don’t torture the data to meet preconceptions, but it can be ok to go “fishing” Important outcomes are riding on your conclusions - check & validate! • • • Check the variance before concluding, it may be based on statistical noise Check conclusion stability against different sample sizes Check conclusion support to avoid conclusions based on a small percent of the data Smart Learning • Prevent spurious conclusions by carefully controlling data collection and focusing on a small space of hypotheses (IF YOU CAN) • Rule learners – RIPPER and INDUCT check against randomly generated alternatives (if probabilities are the same you can delete the rule) Live with the data you have • • Collecting data comes at a cost! • Remove spurious data - conduct instance or feature selection studies • • Go mining with the data you have, not the data you hope to have at a later date 80 to 90% of rows and all but the square root of columns can be deleted before compromising performance of the learned model Be respectful but doubtful to all user-suggested domain hypotheses Broad skill set, big toolkit • Try multiple inductive technologies • Inductive Engineers generate novel and insightful feedback for users • Researchers can work to perfect a single algorithm • Big ecology: Use tools supported by a large ecosystem of developers who are constantly building new modules (e.g. R, WEKA, MATLAB) What does this mean for Industry? • Implications for Project Management • Scouting takes weeks, Surveying takes months, and Building takes years • Implications for Training • • • Communications skills Results briefing Scripting Research to help Industry • Research themes to benefit industrial data mining • • • • • Analysis patterns for inductive engineers (like design patterns for developers) Design pattern for data miners Optimizations of learning algorithms Anomaly detectors Business-aware learners Final Notes • Conclusion – Be user-focused, keep these principles in mind • Hopefully these generalities will be helpful • Share your experiences and knowledge so that Industrial Inductive Engineering can mature