Download Introduction to data mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction to data mining
Literature
Data mining in commerce
• About 13 million customers per month contact the West
Coast customer service call center of the Bank of America
• In the past, each caller would have listened to the same
marketing advertisement, whether or not it was relevant to
the caller’s interests.
• Chris Kelly, vice president and director of database
marketing: “rather than pitch the product of the week, we
want to be as relevant as possible to each customer”
• Thus, based on individual customer profiles, the customer
can be informed of new products that may be of greatest
interest.
• Data mining helps to identify the type of marketing
approach for a particular customer, based on the
customer’s individual profile.
Recommendation systems
Why mine data – commercial viewpoint
• Lots of data is being collected
– Web data, e-commerce
– purchases at department/grocery stores
– Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive pressure is strong
– Provide better, customized services
R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
Why mine data – scientific viewpoint
• Data collected and stored at enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene expression data
– scientific simulations generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in hypothesis formation
R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
Data mining in bioinformatics
• Brain tumors represent
the most deadly cancer
among children
• Gene expression
database for pediatric
brain tumors was built,
in an effort to develop
more effective
treatment.
• Clearly, a lot of data is being collected.
• However, what is being learned from all this
data? What knowledge are we gaining from all
this information?
• “we are drowning in information but starved
for knowledge”
• The problem today is not that there is not
enough data. Rather, the problem is that there
are not enough trained human analysts
available who are skilled at translating all of
this data into knowledge.
• Data mining is the process of discovering
meaningful new correlations, patterns and trends
by sifting through large amounts of data stored in
repositories, using pattern recognition
technologies as well as statistical an
mathematical techniques.
(www.gartner.com)
• Data mining is an interdisciplinary field bringing
togther techniques from machine learning,
pattern recognition, statistics, databases, and
visualization to address the issue of information
extraction from large data bases.
(Peter Cabena, Pablo Hadjinian, Rolf Stadler, JaapVerhees, and Alessandro Zanasi, Discovering Data Mining:
From Concept to Implementation, Prentice Hall, Upper Saddle River, NJ, 1998.)
• The growth in this field has been fueled
by several factors:
– growth in data collection
– storing of the data in data warehouses
– availability of increased access to data from
Web
– competitive pressure to increase market
share
– development of data mining software suites
– tremendous growth in computing power
and storage capacity
Need for human direction of DM
• Don’t believe software vendors advertising
their analytical software as being plug-andplay out-of-the-box application providing
solutions without the need of human
interaction!
• Data mining is not a product that can be
bought, it is a discipline that must be
mastered!
• Automation is not substitute for human input.
• Data mining is easy to do badly.
• Software always gives some result.
• A little knowledge is especially dangerous
– e.g. analysis carried out on unpreprocessed data
can lead to errorneous conclusions, the models
can be way off
– if deployed, the errors can lead to very expensive
failures
• The costly errors stem from the black-box
approach.
Data maning trap
• If we try hard enough, we always find some patterns.
• However, they may be just a matter of chance.
• They don’t have to be characteristic for process that
generates the data.
• Google defines data mining as:
Data mining is the
equivalent to sitting a
huge number of monkeys
down at keyboards, and then
reporting on the monkeys
who happened to type actual
words.
• Instead, apply a “white-box” methodology.
• i.e. understand of the algorithms and
statistical model structures underlying the
software
• The white-box approach is the reason why
you are attending this lecture (apart from the
fact, that the lecture is compulsory).
Data mining as a process
• One of the fallacies associated with DM is that
DM represents an isolated set of tools
• Instead, DM should be viewed as a process
• The process is standardized – CRISP-DM
framework (http://www.crisp-dm.org/)
– Cross-Industry Standard Process for Data Mining
– developed in 1996 by analysts from DaimlerChrysler,
SPSS, and NCR
– provides a nonproprietary and freely available
standard process for fitting data mining into the
general problem-solving strategy of a business or
research unit
CRISP-DM
starts here
1. Business understanding phase
– Formulate the project objectives and requirements
2. Data understanding phase
– collect the data
– use EDA (exploratory data analysis) to familiarize
yourself with the data
– evaluate the quality of the data
3. Data preparation phase
– prepare from the initial raw data the final data set.
This phase is very labor intensive.
– select the cases and variables you want to analyze
– perform transformation of variables, if needed
– clean the raw data so they are ready for modelling
tools
4. Modeling phase
–
–
–
–
select and apply appropriate modeling techniques
calibrate model settings to optimize results
often, several different techniques may be used
if necessary, loop back to the data preparation phase
to bring the form of the data into line with the
specific requirements of a particular data mining
technique
5. Evaluation phase
– evaluate models for quality and effectivness
– establish whether some important facet of the
business or research problem has not been
accounted for sufficiently
6. Deployment phase
– make use of the models created
– examples of deployment:
•
•
report
implement a parallel DM process in another
department
CRISP-DM example
Investigated patterns in the warranty claims
for DaimlerChrysler automobiles
• Business understanding
– Objectives: reduce costs associated with
warranty claims and improve customer
satisfaction
– Specific business problems can be formulated:
•
•
Are there interdependencies among warranty claims?
Are past warranty claims associated with similar
claims in the future?
Jochen Hipp and Guido Lindner, Analyzing warranty claims of automobiles: an application description following the CRISP–DM data mining process, in Proceedings of
the 5th International Computer Science Conference (ICSC ’99), pp. 31–40, Hong Kong, December 13–15, 1999
• Data understanding
– use of DaimlerChrysler’s Quality Information
System (QUIS)
– it contains information on over 7 million vehicles
and is about 40 gigabytes in size
– QUIS contains production details about how and
where a particular vehicle was constructed +
warranty claim information
– researchers stressed the fact that the database
was entirely unintelligible to domain nonexperts
• experts from different departments had to be located
and consulted, a task that turned out to be rather costly
• Data preparation
– the QUIS DB did not contain all information
needed for the modelling purposes
– e.g. the variable “number of days from selling date
until first claim” had to be derived from the
appropriate date attributes
– researchers then turned to DM software where
they ran into a common roadblock: data format
requirements varied from algorithm to algorithm
• result was further exhaustive preprocessing of the data
– researchers mention that the data preparation
phase took much longer than they had planned
• Modeling
– to investigate dependencies, researchers used
• Bayesian networks
• Association rules mining
– the details of the results are confidential, but we
can get general idea of dependencies uncovered
by models
• particular combination of construction specifications
doubles the probability of encountering an automobile
electrical cable problem
• Evaluation
– The researchers were disappointed that
association rules models were found to be lacking
in effectiveness and to fall short of the objectives
set for them in the business understanding phase
• “In fact, we did not find any rule that our domain
experts would judge as interesting.”
– To account for this, the researchers point to the
“legacy” structure of the database, for which
automobile parts were categorized by garages and
factories for historic or technical reasons and not
designed for data mining.
– They suggest redesigning the database to make it
more amenable to knowledge discovery.
• Deployment
– It was a pilot project, without intention to deploy
any large-scale models from the first iteration.
– Product: report describing lessons learned from
this project
• e.g. change of the structure of the database (new
variables, different categorization of automobile parts)
Lessons learned
• uncovering hidden nuggets of knowledge in
databases is a rocky road
• intense human participation and supervision
is required at every stage of the data mining
process
• there is no guarantee of positive results
Connection to other fields
Machine learning
Vizualization
Pattern recognition
Data Mining
Statistics
Database
systems
Machine learning
• A subfield of artificial intelligence.
• Discipline that is concerned with the design and
development of algorithms that allow
computers to evolve behavior based on
experience.
– experience – empirical data, such as from sensors or
databases
– evolve behavior – usually through search of patterns
in data
• similar goal as DM, DM uses algorithms from ML
Pattern recognition
• Problem of searching patterns - a fundamental
one, long and successful history.
• For instance, the extensive astronomical
observations of Tycho Brahe in the 16th century
allowed Johannes Kepler to discover the
empirical laws of planetary motion, which in
turn provided a springboard for the
development of classical mechanics.
Pattern recognition
• automatic discovery of regularities in data
through the use of computer algorithms and
with the use of these regularities to take
actions such as classifying the data into
different categories
Pattern recognition
data
patterns
• if train has 2 wagons, it
goes to the left
More real patterns
face detection
Connection to other fields
Machine learning
Vizualization
Pattern recognition
Data Mining
Statistics
Database
systems
Iris Sample Data Set
• Many of the exploratory data techniques are illustrated with the
Fisher’s Iris Plant data set.
– From the statistician Douglas Fisher, mid-1930s
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
based on WEKA tutorial
iris setosa
iris versicolor
iris virginica
Contains flower dimension measurements on 50
samples of each species.
Fisher, R.A. (1936). "The Use of Multiple Measurements in Taxonomic Problems". Annals of Eugenics 7: 179–188,
http://digital.library.adelaide.edu.au/coll/special//fisher/138.pdf.
These dimensions were measured:
• sepal (kališní lístek) length
• sepal width
• petal (korunní lístek) length
• petal width
Measurements on these iris species:
• setosa
• versicolor
• virginica
Data mining terminology
• The four iris dimensions are termed attributes, input attributes,
features
• The three iris species are termed classes, output attributes
• Each example of an iris is termed a sample, instance, object, data
point
based on WEKA tutorial
Sample
Class
Input
Attributes
Sepal
Sepal
Inst. Length Width
1
5.1
3.5
2
4.9
3
3
4.7
3.2
4
4.6
3.1
5
5
3.6
Output
Attribute
Petal
Length
Petal
Width
Species
1.4
1.4
1.3
1.5
1.4
0.2
0.2
0.2
0.2
0.2
setosa
setosa
setosa
setosa
setosa
Numerical
Nominal
based on WEKA tutorial
Statistics
• statistical analysis
– summary statistics (mean, median, standard
deviation)
• Exploratory Data Analysis (EDA)
– A preliminary exploration of the data to better
understand its characteristics.
– Created by statistician John Tukey
– A nice online introduction can be found in Chapter 1
of the NIST Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
EDA
• Helps to select the right tool for preprocessing or
analysis
• People can recognize patterns not captured by
data analysis tools
• In EDA, as originally defined by Tukey
– The focus was on visualization
– Clustering and anomaly detection were viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory
• Human makes and validates hypotheses
– While in DM computer makes and validates hypotheses
setosa
versicolor
virginica
based on WEKA tutorial
versicolor
virginica
setosa
based on WEKA tutorial
sepal width
sepal length
based on WEKA tutorial
Connection to other fields
Machine learning
Vizualization
Pattern recognition
Data Mining
Statistics
Database
systems
Visualization
• Can reveal hypotheses
based on WEKA tutorial
Connection to other fields
Machine learning
Vizualization
Pattern recognition
Data Mining
Statistics
Database
systems
Data warehouse
• A data warehouse is a repository of an
organization's electronically stored data.
• Data warehouses are designed to facilitate
reporting and analysis.
• Technology:
– relational database system
– multidimensional database system
Data warehousing
• process of constructing and using data
warehouse
• Data warehousing is the coordinated, periodic
copying of data from various sources, both
inside and outside the enterprise, into an
environment optimized for analytical and
informational processing.
data warehousing includes
• business intelligence tools
• tools to extract, transform, and load data
• tools to manage and retrieve metadata
Business intelligence tools
• a type of application software designed to report,
analyze and present data (stored e. g. in data
warehouse)
• they include
– reporting and querying software
• “Tell me what happened.”
• tools that extract, sort, summarize, and present selected data
– OLAP (On-Line Analytical Processing )
• “Tell me what happened and why.”
– data mining
• “Tell me what might happened.” (predict)
• “Tell me something interesting.” (relationships)
OLAP
• Query and report data is typically presented in
row after row of two-dimensional data.
• OLAP: “Tell me what happened and why.”
• To support this type of processing, OLAP
operates against multidimensional databases.
Example: Iris data
• We show how the attributes, petal length, petal
width, and species type can be converted to a
multidimensional array
– First, we discretized the petal width and length to
have categorical values: low, medium, and high
– We get the following table - note the count
attribute
Length
• Slices of the multidimensional array are shown
by the following cross-tabulations
Setosa
Virginica
Versicolor
Creating a Multidimensional Array
• Two key steps in converting tabular data into a
multidimensional array.
1. identify which attributes are to be the dimensions
and which attribute is to be the target attribute
whose values appear as entries in the
multidimensional array.
•
•
The attributes used as dimensions must have discrete
values
The target value is typically a count or continuous value
2. find the value of each entry in the
multidimensional array by summing the values (of
the target attribute) or count of all objects that
have the attribute values corresponding to that
entry.
OLAP Operations: Data Cube
• The key operation of an OLAP is the formation
of a data cube.
• A data cube is a multidimensional
representation of data, together with all
possible aggregates.
• By all possible aggregates,
we mean the aggregates
that result by selecting a
proper subset of the
dimensions and summing
over all remaining
dimensions.
• For example, if we choose
the species type dimension
of the Iris data and sum
over all other dimensions,
the result will be a onedimensional entry with
three entries, each of which
gives the number of flowers
of each type.
Length
Data Cube Example
• Consider a data set that records the sales of
products at a number of company stores at
various dates.
• This data can be represented
as a 3 dimensional array
• There are 3 two-dimensional
aggregates (3 choose 2 ),
3 one-dimensional aggregates,
and 1 zero-dimensional
aggregate (the overall total)
• The following figure table shows one of the two
dimensional aggregates, along with two of the
one-dimensional aggregates, and the overall
total
OLAP Operations
• Various operations are defined on the data
cube:
– Slicing/Dicing - selecting a group/subgroup of cells
from the entire multidimensional array by
specifying a specific value for one or more
dimensions.
– Roll-up and Drill-down - granularity
The End
OLAP Operations: Roll-up and Drill-down
• Attribute values often have a hierarchical
structure.
– Each date is associated with a year, month, and week.
– A location is associated with a continent, country, state
(province, etc.), and city.
– Products can be divided into various categories, such as
clothing, electronics, and furniture.
• Note that these categories often nest and form a
tree or lattice
– A year contains months which contains day
– A country contains a state which contains a city
OLAP Operations: Roll-up and Drill-down
• This hierarchical structure gives rise to the rollup and drill-down operations.
– For sales data, we can aggregate (roll up) the sales
across all the dates in a month.
– Conversely, given a view of the data where the
time dimension is broken into months, we could
split the monthly sales totals (drill down) into daily
sales totals.