Data mining - delab-auth Download

Transcript
Advanced Data Mining:
Introduction
http://delab.csd.auth.gr/courses/c_dm_pms
Material Covered
• Chapter 1 from Ullman’s book.
• Many slides are from the “Data Mining:
Concepts and Techniques” book.
2
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
3
Evolution of Sciences
•
Before 1600, empirical science
•
1600-1950s, theoretical science
– Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
•
1950s-1990s, computational science
– Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
– Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
•
1990-now, data science
– The flood of data from new scientific instruments and simulations
– The ability to economically store and manage petabytes of data online
– The Internet and computing Grid that makes all these archives universally accessible
– Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
•
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
from “Data Mining: Concepts and Techniques”
4
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
– Data mining: a misnomer?
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, business intelligence, etc.
• Watch out: Is everything “data mining”?
• Negative examples:
– Simple search and query processing
– (Deductive) expert systems
from “Data Mining: Concepts and Techniques”
5
Knowledge Discovery (KDD) Process
• This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
• Data mining plays an essential
role in the knowledge discovery
Data Mining
process
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
from “Data Mining: Concepts and Techniques”
6
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
End User
Decision
Making
Data Presentation
Visualization Techniques
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
from “Data Mining: Concepts and Techniques”
DBA
7
Directions in modeling
• Pattern extraction  Model Discovery
• Statistical modeling
– E.g., decide that the data comes from a Gaussian distribution,
estimate μ,σ parameters.
• Machine learning
– Train an algorithm, then apply to new data.
• Results of Complex Queries (computational approaches)
– E.g., summarization of the importance of a webpage in the form of
a “pagerank” value.
– E.g., prominent feature extraction, such as frequent itemsets and
similar items.
8
Multi-Dimensional View of Data Mining
•
•
•
•
Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining
– Multiple/integrated functions and mining at multiple levels
Data to be mined
– Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal, timeseries, sequence, text and web, multi-media, graphs & social and
information networks
Techniques utilized
– Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
from “Data Mining: Concepts and Techniques”
9
Meaningfulness of patterns
• A big data-mining risk is that you will
“discover” patterns that are
meaningless.
• Bonferroni’s principle: (roughly) if you look
in more places for interesting patterns than
your amount of data will support, you are
bound to find meaningless patterns
10
Rhine Paradox
• Joseph Rhine was a parapsychologist in the 1950’s who hypothesized
that some people had Extra-Sensory Perception
• He devised an experiment where subjects were asked to guess 10
hidden cards – red or blue
• He discovered that almost 1 in 1000 had ESP – they were able to get
all 10 right!
• He told these people they had ESP and called them in for another test
of the same type
• Alas, he discovered that almost all of them had lost their ESP
• What did he conclude?
• He concluded that you shouldn’t tell people they have ESP; it causes
them to lose it!
11
Major Challenges in Data Mining
• Efficiency and scalability of data mining algorithms
• Parallel, distributed, stream, and incremental mining methods
• Handling high-dimensionality
• Handling noise, uncertainty, and incompleteness of data
• Incorporation of constraints, expert knowledge, and background
knowledge in data mining
• Pattern evaluation and knowledge integration
• Mining diverse and heterogeneous kinds of data: e.g., bioinformatics,
Web, software/system engineering, information networks
• Application-oriented and domain-specific data mining
• Invisible data mining (embedded in other functional modules)
• Protection of security, integrity, and privacy in data mining
from “Data Mining: Concepts and Techniques”
12
Kdnuggets polls - 1
13
Kdnuggets polls - 2
14
Kdnuggets polls - 3
15
Things Useful to Know
•
•
•
•
•
•
Probability
Linear Algebra basics
Hash functions
Indices
Secondary storage
Power laws
16
Big Data
Sizes:
•
•
•
•
•
Tiny

Small 
Medium 
Large 
Huge

0s
1000s fitting in memory
1000000 (may not fit in memory)
1000000000
1000000000000 ++
From Graefe’s “New algorithms for join and grouping operations”, 2011.
17