Download SOM485CH7CLASSSLIDES

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 7
Pages 304-309, 311,
Sections 7.3, 7.5, 7.6
DATA, TEXT,
AND WEB MINING
Data mining
A process that uses statistical,
mathematical, artificial intelligence and
machine-learning techniques to extract and
identify new knowledge from large
databases
Recognizes the untapped value of data in
large databases
You may unexpectedly strike rich in
understanding relationships among data
Example
Task: Find the best route to cover the territory
Challenge of finding
relationships in large databases
Connect equal elevation points
to make a contour map
The dark vertical line shows the best route to cross the
territory without falling off a cliff.
Once relationships are discovered, they can be used for prediction
Uses of Data Mining-1
•
Classification
Identify attribute of interest (eg. You want to classify who
is likely to pay late)
Examine all other attribute values of customer from data
warehouse and locate the one that is most related to the
attribute of interest (eg. monthly income level)
•
Mining Algorithm
The most common algorithm used for Classification is
Decision trees
Gini Index: helps to determine where to find the split
between two classes (eg. at what income level)
- used in developing decision trees
(see example on page 316)
Which product class is the best seller?
Conclusion: Clay products with a price below $25!
Uses of Data Mining-2
• Segmentation
Partitioning a database into groups in which the
members of each group share similar characteristics
•
Mining Algorithm
Clustering: The object is to sort cases into groups so that the
similarities within the group are strong among members of the same
cluster and weak between members of different clusters
Eg. Companies with over 100 employees may share similar
characteristics (eg. revenue size) than those with less than 100
employees.
Knowledge can help with developing different policies when dealing
with different type of companies
Uses of Data Mining-3
• Association
A category of data mining algorithm that
establishes relationships about items that
occur together in a given record
Eg. You may discover from data that senior students
take elective courses together in the final semester
Can be helpful to schedule courses
People who buy a suit may also buy dress shirt
People who buy swimwear may buy fins, goggles,
cap, etc.
Uses of Data Mining-4
•
•
Sequence discovery
The identification of associations over time. Discovering
the order in which events occur.
The algorithm can examine data and predict what event
is most likely to occur next.
Widely used in studying how visitors navigate a Web site. Helps to
improve chances of making a sale.
Uses of Data Mining-5
Regression is a statistical technique that is
used to map data to a prediction value
•
Forecasting estimates future values based on patterns
within large sets of data
Eg. Gasoline prices this month may predict next month’s sales of
SUVs
Data Mining Concepts
and Applications
Data mining applications
–
–
–
–
Marketing
Banking
Retailing and sales
Manufacturing and
production
– Brokerage and
securities trading
– Insurance
– Computer hardware
and software
– Government and
defense
– Airlines
– Health care
– Broadcasting
– Police
– Homeland security
Text Mining
Application of data mining to text files, typically freestyle
text material
Discovers new knowledge that is not obvious
Examples:
Examine all news services, cluster similar topics, create
a new summary for each topic
Find the “hidden” content of documents, including
additional useful relationships, eg. Lies, deceptions,
scams
Not same as the search engine on the Web.
Text Mining – how is it done?
It entails the generation of meaningful numerical indices/factors
from the unstructured text and then processing these indices
using various data mining algorithms
Example:
Extract each word from the document being text mined
Eliminate commonly used words (the, and, other, etc)
Combine synonyms and phrases
Calculate weights for each term:
tf factor (term frequency) – actual number of times a word
appears in a document
idf factor (inter document frequency) – across multiple
documents
High tf factor value of a given term indicates that the document
topic is probably around the meaning of that term!
Text Mining - applications
–
Automatic detection of e-mail spam or phishing through analysis
of the document content
–
Automatic processing of messages or e-mails to route a
message to the most appropriate party to process that message
–
Analysis of warranty claims, help desk calls/reports, and so on to
identify the most common problems and relevant responses
Web Mining
The discovery and analysis of interesting and useful
information from the Web
Web content mining
The extraction of useful information from Web pages
Eg. Search with the help of keywords in the Meta tags of
the web page
You can analyze the document content of the first 10
links of Google in a search response
You can generate a summary of the contents
automatically in a new document!
Web structure mining
The development of useful information from the links
included in the Web documents
If a web site’s pages predominantly link to each other, you
may consider the site to exist ‘independent’
If a collection of web sites are linked to each other heavily,
it points to a web community or clan that share common
interests
Example application: Web structure mining can lead to
better understanding of extremist groups
Web usage mining
The extraction of useful information from the data being
generated through webpage visits, transaction, etc.
Clickstream analysis
Uses cookies, number of logs, time of log, etc
Can help profile users
Uses for Web mining
– Determine the lifetime value of clients
– Design cross-marketing strategies across
products
– Evaluate promotional campaigns
– Target electronic ads and coupons at user
groups
– Predict user behavior
– Present dynamic information to users
Data Mining Project Processes
Steps for Data Mining
• Problem definition: Decide the measure to study and
the suitable mining algorithm (see Exercise 11)
• Data preparation: Design the cube and populate it
relevant data from the data warehouse
• Training: Run the mining algorithm on a subset of the
data warehouse data for the system to learn to find
segments, associations, etc among data
• Validation: Run the ‘learnt’ model from previous step to
the remaining subset of data and try to ‘predict’. Since
you have historical data, you can verify if the ‘learnt’
model is any good.
• Deploy: Implement to predict in real environment where
you do not know the actual results.