Download DATA MINING II

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DATA MINING II - 1DL460
Spring 2015
A second course in data mining
http://www.it.uu.se/edu/course/homepage/infoutv2/vt15
Kjell Orsborn
Uppsala Database Laboratory
Department of Information Technology, Uppsala University, Uppsala, Sweden
Kjell Orsborn - UDBL - IT - UU
20/04/15
1
Personell
•  Kjell Orsborn, lecturer, examiner:
–  email: [email protected], phone: 471 5154, room: 116, ITC building 19
•  Tore Risch, lecturer
–  email: [email protected], phone 471 6342, room: 137, ITC building 19
•  Emil Jansson, course assistant,
–  email: [email protected], room: 138, ITC building 19
•  Michelle Brundin, course assistant,
–  email: [email protected], room: 138, ITC building 19
Kjell Orsborn - UDBL - IT - UU
20/04/15
2
Preliminary course contents
•  Lecture topics:
–  Course intro - overview of topics in
data mining 2
–  Web mining
–  Search engines
–  Sequential association analysis
–  Alt. association analysis
–  Visual data exploration
–  Cluster validation
–  Advanced clustering methods:
• 
• 
• 
• 
• 
– 
– 
– 
– 
Stream data mining
Privacy preserving data mining
Outlier detection
Additional topics if time:
•  Spatial data mining
•  More on large scale data mining
• 
Invited Guest lectures
Chamelon,
Cure
Birch
(SNN, Rock, Jarvis-Patrick)
large scale clustering methods
Kjell Orsborn - UDBL - IT - UU
20/04/15
3
Course contents continued …
•  Assignments:
–  Assignment 1 – Web mining – HITS / PageRank
–  Assignment 2 – Implementation of Association Rule Mining
–  Assignment 3 – Implementation of scalable K-means
Kjell Orsborn - UDBL - IT - UU
20/04/15
4
Examination
•  Written examination – grade 3, 4 and 5
•  Assignments – all 3 assignments should be passed with a passing grade
Kjell Orsborn - UDBL - IT - UU
20/04/15
5
Introduction to Data Mining II (Tan, Steinbach, Kumar ch. 1)
Kjell Orsborn
Department of Information Technology
Uppsala University, Uppsala, Sweden
Kjell Orsborn - UDBL - IT - UU
20/04/15
6
Data Mining
•  The process of extracting valid, previously unknown, comprehensible, and
actionable information from large databases and using it to make crucial
business decisions, (Simoudis, 1996).
–  Involves the analysis of data and the use of software techniques for finding hidden and
unexpected patterns and relationships in sets of data; in contrast to information and knowledge
that are already intuitive. –  Patterns and relationships are identified by examining the underlying rules and features in the
data.
–  Tends to work from the data up and most accurate results normally require large volumes of
data to deliver reliable conclusions. –  Data mining can provide huge paybacks for companies who have made a significant investment
in data warehousing. –  Relatively new technology, however already used in a number of industries.
Kjell Orsborn - UDBL - IT - UU
20/04/15
7
Historic view of
data mining
Han et al, 2006.
Kjell Orsborn - UDBL - IT - UU
20/04/15
8
The data mining process
Knowledge
• 
• 
• 
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be combined)
Data selection (where data relevant to the analysis task are
retrieved from the database)
• 
Data transformation (where data are transformed or consolidated
Evaluation & Presentation
into forms appropriate for mining by performing summary or
aggregation operations)
• 
Data mining (an essential process where intelligent • 
1
5
2
Patterns
Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)
• 
3
Data Mining
methods are applied in order to extract data patterns)
6
Selection & Transformation
Knowledge presentation (where visualization and knowledge representation techniques are used to present the
Cleaning & mined knowledge Integration
to the user)
Data Warehouse
Database
Database
Database
Kjell Orsborn - UDBL - IT - UU
File
File
20/04/15
File
9
Why data mining?
• 
"There was 5 exabytes of information created between the dawn of civilization
through 2003," Schmidt said, "but that much information is now created every 2 days,
and the pace is increasing... People aren't ready for the technology revolution that's
going to happen to them....”
(Eric Schmidt, Google)
Kjell Orsborn - UDBL - IT - UU
20/04/15
10
The information explosion
•  The world’s information is doubling every two years.
•  In 2011 the world will create a staggering 1.8 Zettabytes.
•  By 2020 the world will generate 50 times the amount of information and 75 times
the number of "information containers" (files) while IT staff to manage it will grow
less than 1.5 times. [ref. IDC/EMC 2011]
Kjell Orsborn - UDBL - IT - UU
20/04/15
11
Why data mining?
• 
The explosive growth of data: from terabytes, through petabytes, to exabytes
–  Data collection from automated data collection tools, database systems, web, e-commerce,
transactions, stocks, remote sensing, bioinformatics, scientific simulation, computerized
society, news, digital cameras, …
–  Human analysts may take weeks to discover useful information
4,000,000
–  Much
of the data is never analyzed at all
Total new
disk (TB)
since 1995
3,500,000
3,000,000
The Data Gap
2,500,000
From: R. Grossman, C. Kamath,
V. Kumar, “Data Mining for
2,000,000
Scientific and Engineering
1,500,000
Applications”
1,000,000
Number of
analysts
500,000
0
1995
Kjell Orsborn - UDBL - IT - UU
1996
1997
20/04/15
1998
1999
12
Why mine data (commercial viewpoint)?
• 
Lots of data is being collected and warehoused
–  Web data, e-commerce
–  purchases at department/
grocery stores
–  Bank & credit card transactions
• 
Computers have become cheaper and more powerful
• 
Competitive pressure is strong
–  Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Kjell Orsborn - UDBL - IT - UU
20/04/15
13
Why mine data (scientific viewpoint)?
• 
Data collected and stored at enormous speeds (GB/hour)
–  remote sensors on a satellite
–  telescopes scanning the skies
–  microarrays generating gene expression data
–  scientific simulations generating terabytes of data
• 
Traditional techniques infeasible
for raw data
• 
Data mining may help scientists
–  in classifying and segmenting
data
–  in hypothesis formation
Kjell Orsborn - UDBL - IT - UU
20/04/15
14
Why not traditional data analysis?
• 
Tremendous amount of data
–  Algorithms must be highly scalable to handle such as tera-bytes of data
• 
High-dimensionality of data
–  Micro-array may have tens of thousands of dimensions
• 
High complexity of data
– 
– 
– 
– 
– 
– 
• 
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
Kjell Orsborn - UDBL - IT - UU
20/04/15
15
Data mining tasks
•  Prediction methods
–  Use some variables to predict unknown or future values of other variables.
•  Description methods
–  Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Kjell Orsborn - UDBL - IT - UU
20/04/15
16
Classification - definition
•  Given a collection of records (training set)
–  Each record contains a set of attributes, one of the attributes is the class.
•  Find a model for class attribute as a function of the values of other attributes.
•  Goal: previously unseen records should be assigned a class as accurately as
possible.
–  A test set is used to determine the accuracy of the model. Usually, the given data set is
divided into training and test sets, with training set used to build the model and test set
used to validate it.
Kjell Orsborn - UDBL - IT - UU
20/04/15
17
Clustering - definition
• 
Given a set of data points, each having a set of attributes, and a similarity measure among
them, find clusters such that
–  Data points in one cluster are more similar to one another.
–  Data points in separate clusters are less similar to one another.
• 
Similarity Measures:
–  Euclidean distance if attributes are continuous.
–  Other problem-specific measures.
Kjell Orsborn - UDBL - IT - UU
20/04/15
18
Association rule discovery - definition
•  Given a set of records each of which contain some number of items from a given
collection;
–  Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Kjell Orsborn - UDBL - IT - UU
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
20/04/15
19
Sequential pattern discovery definition
•  Given is a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among different
events.
(A B) (C)
(D E)
•  Rules are formed by first disovering patterns. Event occurrences in the
patterns are governed by timing constraints.
(A B)
(C)
(D E)
<= xg
>ng
<= ws
<= ms
Kjell Orsborn - UDBL - IT - UU
20/04/15
20
Deviation or anomaly detection
• 
• 
Detect significant deviations from
normal behavior
Applications:
–  Credit Card Fraud Detection
–  Network Intrusion Detection
Typical network traffic at University level may reach over 100 million connections per day
Kjell Orsborn - UDBL - IT - UU
20/04/15
21
Challenges of data mining
• 
• 
• 
• 
• 
• 
• 
Scalability
Dimensionality
Complex and heterogeneous data
Data quality
Data ownership and distribution
Privacy preservation
Streaming data
Kjell Orsborn - UDBL - IT - UU
20/04/15
22