Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar Overview • What can we gain from data? • Business and marketing applications • Public policy decision-making • Scientific research • Why do we need the KDD process? • Increasing use of data analytics • Size of databases involved • Being able to access raw data isn’t enough The KDD Process Part 1: Selection • Formulating the target dataset • What kinds of records to consider? • Desired fields? • Incorporates domain knowledge • Background knowledge in relevant field • Goals of the dataset Part 2: Pre-processing • Preparing raw data for transformation • Removal of noise, outliers • Strategy for handling missing records • Missing/unknown value mappings Part 3: Transformation • Data reduction • Grouping to reduce number of variables considered • Aggregation to higher row unit • Useful representations of data • Summary statistics Part 4: Data Mining • Selection of data model • Summarization, classification, clustering, regression analysis • Searching for patterns in data Part 5: Interpretation • Interpreting the model used in the previous step • Check results if they make sense • Consider different models, returning to prior steps • Utilize the obtained results Challenges of KDD • Massive datasets • Algorithmic efficiency, approximation, parallel processing • Making interaction possible for analysts • Develop better tools that allow for human-computer interaction • Overfitting, measures of significance • Testing on randomly chosen sections • Missing or invalid data • Strategies to identify hidden variables and dependencies • Making data understandable by humans • Improved data visualization methods Challenges of KDD • Rapidly changing data • Incrementally updating discovered patterns • Integration • Coordinating database tools (OLAP) and data mining tools • Nonstandard data (e.g. multimedia) • “Beyond the scope of current KDD technology” Conclusion • • • • Emerging nature of KDD & data mining fields Human interaction still necessary Incorporating machines to cope with scale of data Improve tools to make better decisions using data