Download link to slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
The KDD Process for
Extracting Useful
Knowledge from
Volumes of Data
Fayyad, Piatetsky-Shapiro, and Smyth
Ian Kim
SWHIG Seminar
Overview
• What can we gain from data?
• Business and marketing applications
• Public policy decision-making
• Scientific research
• Why do we need the KDD process?
• Increasing use of data analytics
• Size of databases involved
• Being able to access raw data isn’t enough
The KDD Process
Part 1: Selection
• Formulating the target dataset
• What kinds of records to consider?
• Desired fields?
• Incorporates domain knowledge
• Background knowledge in relevant field
• Goals of the dataset
Part 2: Pre-processing
• Preparing raw data for transformation
• Removal of noise, outliers
• Strategy for handling missing records
• Missing/unknown value mappings
Part 3: Transformation
• Data reduction
• Grouping to reduce number of variables considered
• Aggregation to higher row unit
• Useful representations of data
• Summary statistics
Part 4: Data Mining
• Selection of data model
• Summarization, classification, clustering, regression analysis
• Searching for patterns in data
Part 5: Interpretation
• Interpreting the model used in the previous step
• Check results if they make sense
• Consider different models, returning to prior steps
• Utilize the obtained results
Challenges of KDD
• Massive datasets
• Algorithmic efficiency, approximation, parallel processing
• Making interaction possible for analysts
• Develop better tools that allow for human-computer interaction
• Overfitting, measures of significance
• Testing on randomly chosen sections
• Missing or invalid data
• Strategies to identify hidden variables and dependencies
• Making data understandable by humans
• Improved data visualization methods
Challenges of KDD
• Rapidly changing data
• Incrementally updating discovered patterns
• Integration
• Coordinating database tools (OLAP) and data mining tools
• Nonstandard data (e.g. multimedia)
• “Beyond the scope of current KDD technology”
Conclusion
•
•
•
•
Emerging nature of KDD & data mining fields
Human interaction still necessary
Incorporating machines to cope with scale of data
Improve tools to make better decisions using data