Download Data mining: some basic ideas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data mining: some basic
ideas
Francisco Moreno
Excerpts from Fundamentals of DB
Systems, Elmasri & Navathe and other
sources
Data mining
• For many years, organizations have generated a
large amount of data in the form of files and
databases
• These data can be processed using database
technology with languages such as SQL
• SQL drawbacks: it is assumed that the user is
aware of the DB schema, some queries can
become very complex, for example, those
oriented to discover information…
Data mining
• Data mining refers to the discovery of
information in terms of patterns or rules
from vast amounts of data
• To be useful, data mining must be carried
out efficiently on large files and databases
• Data mining uses techniques from areas
such as machine learning, statistics,
neural networks, and genetic algorithms,
among others.
Data mining
• In machine learning, e.g., the algorithms that can
learn from and make predictions on data.
• Such algorithms operate by building a model
from example inputs in order to make datadriven predictions or decisions rather than
following strictly static program instructions.
• Machine learning is closely related to and often
overlaps with computational statistics.
Data mining
• In a genetic algorithm, a population of randomly
generated individuals (candidate solutions) to an
optimization problem is evolved toward better
solutions.
• The evolution is an iterative process, with the
population in each iteration called a generation.
• In each generation, the fitness of every
individual in the population is evaluated.
• The fitness is usually the value of the objective
function in the optimization problem being
solved.
Data mining
• We will highlight the nature of the
information that is discovered, the types of
problems faced in databases and potential
applications
• Data mining is related with a broader area
called knowledge discovery (see below)
Data mining
• Remember: the goal of a Data Warehouse (DW)
is to support decision making with data: Data
mining can be used in conjuntion with a DW to
help with decision making processes
• It is possible to apply data mining to operational
databases (or files) with individual transactions
• However, to make data mining more efficient a
DW could be used, where we could take
advantage of the preprocessed and aggregated
collection of data
Data mining
• Data mining helps in extracting meaningful
patterns that cannot be found necessarily by
merely querying or processing data in the DW
• Data mining requirements should be considered
early, during the design of a DW
• Indeed, for very large databases, succesful use
of data mining will depend first on the
construction of the DW
Data mining
• Data mining is a part of the knowledge
discovery process
• Knowledge discovery in databases (KDD),
typically encompasses more than data
mining
KDD
• The KDD comprises six phases:
– Data cleansing
Data integration
– Enrichment
– Data transformation and encoding
– Data selection
– Data mining
– Reporting and display of the discovered
information
KDD
Knowledge
Pattern Evaluation
Data Mining
Selection
Data
Warehouse
Databases
Data integration: Data
cleansing, enrichment, data
transformation, encoding
KDD: Data integration
• Data cleansing, in this stage, invalid data
can be fixed: fix zip codes or eliminate
records with wrong phone prefixes
KDD: Data integration
• Enrichment typically enhances the data
with additional information from other
sources. For example, given the customer
names and phone numbers, an
organization can get (perhaps buy) other
data such as age, income, and credit card
rating and then append them to each
customer record.
KDD: Data integration
• Data transformation and encoding may be
done to reduce the amount of data. For
example, product codes may be grouped
in terms of product categories. Zip codes
may be aggregated into geographic
regions, incomes may be divided into
ranges, and so on.
Data mining
• Data selection, in this stage, data about
specific products or categories of specific
products, or from stores in a specific
region, may be selected
• After such preprocessing, data mining
techniques are used to discover rules and
patterns
Data mining
• For example, the result of mining could discover:
– Association rules: whenever a customer buys video
equipment, he also buys another electronic gadget
– Sequential patterns: a customer who buys a camera,
he will buy photographic supplies usually within the
next three months, and within six months, an
accesory item. A customer who buys more than twice
in the lean periods* may be likely to buy at least once
during Christmas period
* Periodos de escasez
Data mining
– Classification trees: customers may be
classified by frequency of visits, by types of
financing used, by amount of purchase, by
affinity for types of items  some revealing
statistics may be generated for such classes
Data mining
• This information can then be used
– to plan additional store locations based on
demographics
– to run store promotions
– to combine products in advertisements
– to plan seasonal marketing strategies
Goals of data mining and
knowledge discovery
• The goals of data mining fall into the
following classes:
– Prediction
– Identification
– Classification
– Optimization
Goals of data mining and
knowledge discovery
• Prediction: Data mining can show how
certain attributes within the data will
behave in the future: analysis of buying
transactions to predict what consumers
will buy under certains discounts, how
much sales volume a store would
generate in a given period, and whether
deleting a product line would yield more
profits
Goals of data mining and
knowledge discovery
• Identification: to identify the existence of
an item, an event, or an activity: intruders
may be identified by the programs
executed, files accessed, and CPU time
per session; a gene can be identified by
certain sequences of nucleotide symbols
in the DNA sequence.
Goals of data mining and
knowledge discovery
• Classification: Data mining can partition
the data so that different classes can be
identified based on combination of
parameters: customers in a supermarket
can be classified into discount-seekers or
shoppers in a rush.
Goals of data mining and
knowledge discovery
• Optimization: to optimize the use of limited
resources such as time, space, money, or
materials and to maximize output variables
such as sales under a given set of
constraints  A strong resemblance with
the objective function in operations
research field (there is no sharp line
separating data mining from this and other
related disciplines)
Data mining
• Some types of knowledge discovered
during data mining:
– Association rules
– Sequential patterns
– Patterns within time series
– Categorization and segmentation
Data mining
• Association rules*: correlate the presence
of items with another range of values for
another set of variables: when a female
retail shopper buys a handbag, she is
likely to buy shoes.
* Later, we will focus on this type of knowledge.
Data mining
• Sequential patterns: a sequence of actions or
events is sought: if a patient underwent cardiac
bypass surgery and later developed high blood
urea within a year of surgery, he is likely to
suffer from kidney within the next year.
• Note that detection of sequential patterns is
equivalent to detecting association among
events with certain temporal relationships
Data mining
• Patterns within time series: similarities can
be detected within positions of time series:
stocks of a utility (service) company A and
a financial company B show the same
pattern during a year, two products show
the same selling price pattern in summer
but a different one in winter.
Data mining
• Categorization and segmentation: a given
population of events or items can be
partitioned into sets of “similar” elements:
– a population of treatment data may be divided
into groups based on similarity of side effects
– a population may be categorized into groups
from “most likely to buy” to “least likely to buy”
– web accesses made by users may be
analized in terms of keywords to reveal
clusters of users
Web usage mining
Association rules
• The database is regarded a collection of
transactions (for example, purchases),
each involving a set of items
• A common example is that of marketbased data
• Consider the following example with four
transactions:
Association rules
Transaction_id
1
2
3
4
Items_bought
milk, bread, juice
milk, juice
milk, eggs
bread, cookies, coffee
Note: Some important information is not considered, for example,
the quantity of each item purchased in each transaction
Association rules
• Another example: a text document data
set, where each document is treated as a
set of keywords:
• Doc 1: {student, teach, school}
• Doc 2: {student, school}
• Doc 3: {teach, school, city, game}
• Doc 4: {baseball, basketball}
• Doc 5: {basketball, team, city, game}
Text mining, Web content mining
Association rules
• An association rule is of the form:
• LHS(left hand side) RHS(right hand side)
XY
where X = {x1, x2, …, xn} and
Y = {y1, y2, …, ym} are set of items,
xi and yi being distinct items for all i and j and X 
Y=
• This association states that if a customer buys
X, he is also likely to buy Y.
Association rules
• Association rules should include both
support (prevalence) and confidence
(strenght)
• The support for a rule LHS  RHS is the
percentage of transactions that hold all the
items in the set LHS  RHS.
• If the support is low, it implies that there is
no overwhelming evidence that the items
LHS  RHS occur together.
Association rules:
Support examples
• Milk  Juice has 50% support.
• Bread  Juice has 25% support.
Association rules
• To compute confidence, we consider all
transactions that include items in LHS.
The confidence for LHS  RHS is the
percentage of such transactions that also
include RHS.
Association rules:
Confidence examples
• Milk  Juice has 66.6% confidence.
• Bread  Juice has 50% confidence.
Association rules
• n = number of transactions, then:
•
(X  Y).count
Support =
n
•
Confidence =
(X  Y).count
X.count