Download Data, Text and Web Analytics Overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining and Big
Data
Ahmed K. Ezzat,
Data, Text, and Web Mining
for BI
1
Outline

Data Mining Overview

Text Mining Overview

Web Mining Overview
2
Data Mining Overview
3
Learning Objectives




Define data mining and list its objectives and
benefits
Understand different purposes and applications
of data mining
Understand different methods of data mining,
especially clustering and decision tree models
Build expertise in use of some data mining
software
4
Learning Objectives





Learn the process of data mining projects
Understand data mining pitfalls and myths
Define text mining and its objectives and
benefits
Appreciate use of text mining in business
applications
Define Web mining and its objectives and
benefits
5
Data Mining Concepts and Applications

Six factors behind the sudden rise in popularity of
data mining
General recognition of the untapped value in large
databases;
2. Consolidation of database records tending toward a
single customer view;
3. Consolidation of databases, including the concept of
an information warehouse;
4. Reduction in the cost of data storage and processing,
providing for the ability to collect and accumulate data;
5. Intense competition for a customer’s attention in an
increasingly saturated marketplace; and
6. The movement toward the de-massification of
business practices
1.
6
Data Mining Concepts and Applications

Data mining (DM)
A process that uses statistical, mathematical, artificial
intelligence and machine-learning techniques to extract
and identify useful information and subsequent
knowledge from large databases
7
Data Mining Concepts and Applications

Major characteristics and objectives of data
mining:

Data are often buried deep within very large
databases, which sometimes contain data from several
years; sometimes the data are cleansed and
consolidated in a data warehouse

The data mining environment is usually client/server
architecture or a Web-based architecture
8
Data Mining Concepts and Applications

Major characteristics and objectives of data
mining:


Sophisticated new tools help to remove the information
buried in corporate files or archival public records;
finding it involves massaging and synchronizing the
data to get the right results.
The miner is often an end user, empowered by data
drills and other power query tools to ask ad hoc
questions and obtain answers quickly, with little or no
programming skill
9
Data Mining Concepts and Applications

Major characteristics and objectives of data
mining:



Striking it rich often involves finding an unexpected
result and requires end users to think creatively
Data mining tools are readily combined with
spreadsheets and other software development tools;
the mined data can be analyzed and processed quickly
and easily
Parallel processing is sometimes used because of the
large amounts of data and massive search efforts
10
Data Mining Concepts and Applications

How data mining works

Data mining tools find patterns in data and may even
infer rules from them
 Three methods are used to identify patterns in data:
1.
2.
3.
Simple models
Intermediate models
Complex models
11
Data Mining Concepts and Applications

Classification

Supervised induction used to analyze the historical data
stored in a database and to automatically generate a
model that can predict future behavior
Common tools used for classification are:



Neural networks
Decision trees
If-then-else rules
12
Data Mining Concepts and Applications

Clustering
Partitioning a database into segments in which the
members of a segment share similar qualities

Association
A category of data mining algorithm that establishes
relationships about items that occur together in a given
record
13
Data Mining Concepts and Applications

Sequence discovery
The identification of associations over time

Visualization can be used in conjunction with
data mining to gain a clearer understanding of
many underlying relationships
14
Data Mining Concepts and Applications

Regression is a well-known statistical technique
that is used to map data to a prediction value

Forecasting estimates future values based on
patterns within large sets of data
15
Data Mining Concepts and Applications

Hypothesis-driven data mining
Begins with a proposition by the user, who then seeks to
validate the truthfulness of the proposition

Discovery-driven data mining
Finds patterns, associations, and relationships among the
data in order to uncover facts that were previously
unknown or not even contemplated by an organization
16
Data Mining Concepts and Applications
Data mining applications
–
–
–
–
–
–
Marketing
Banking
Retailing and sales
Manufacturing and
production
Brokerage and
securities trading
Insurance
–
–
–
–
–
–
–
Computer hardware
and software
Government and
defense
Airlines
Health care
Broadcasting
Police
Homeland security
17
Data Mining Concepts and Applications

Data mining tools and techniques can be
classified based on the structure of the data and
the algorithms used:


Statistical methods
Decision trees
Defined as a root followed by internal nodes. Each
node (including root) is labeled with a question and
arcs associated with each node cover all possible
responses
18
Data Mining Concepts and Applications

Data mining tools and techniques can be
classified based on the structure of the data and
the algorithms used:

Case-based reasoning
 Neural computing
 Intelligent agents
 Genetic algorithms
 Other tools


Rule induction
Data visualization
19
Data Mining Concepts and Applications

A general algorithm for building a decision tree:
1.
2.
3.
Create a root node and select a splitting attribute.
Add a branch to the root node for each split candidate
value and label
Take the following iterative steps:
a.
b.
Classify data by applying the split value.
If a stopping point is reached, then create leaf node
and label it. Otherwise, build another subtree
20
Data Mining Concepts and Applications

Gini index
Used in economics to measure the diversity of the
population. The same concept can be used to determine
the ‘purity’ of a specific class as a result of a decision to
branch along a particular attribute/variable
21
Data Mining Techniques and Tools
22
Data Mining Techniques and Tools

The ID3 algorithm decision tree approach

Entropy
Measures the extent of uncertainty or randomness in a
data set. If all the data in a subset belong to just one
class, then there is no uncertainty or randomness in that
dataset, therefore the entropy is zero
23
Data Mining Techniques and Tools

Cluster analysis for data mining

Cluster analysis is an exploratory data analysis tool
for solving classification problems

The object is to sort cases into groups so that the
degree of association is strong between members of
the same cluster and weak between members of
different clusters
24
Data Mining Techniques and Tools

Cluster analysis results may be used to:





Help identify a classification scheme
Suggest statistical models to describe populations
Indicate rules for assigning new cases to classes for
identification, targeting, and diagnostic purposes
Provide measures of definition, size, and change in
what were previously broad concepts
Find typical cases to represent classes
25
Data Mining Techniques and Tools

Cluster analysis methods






Statistical methods
Optimal methods
Neural networks
Fuzzy logic
Genetic algorithms
Each of these methods generally works with
one of two general method classes:


Divisive
Agglomerative
26
Data Mining Techniques and Tools

Hierarchical clustering method and example
1.
2.
3.
4.
5.
6.
Decide which data to record from the items
Calculate the distances between all initial clusters.
Store the results in a distance matrix
Search through the distance matrix and find the two
most similar clusters
Fuse those two clusters together to produce a cluster
that has at least two items
Calculate the distances between this new cluster and
all the other clusters
Repeat steps 3 to 5 until you have reached the prespecified maximum number of clusters
27
Data Mining Techniques and Tools

Classes of data mining tools and techniques
as they relate to information and business
intelligence (BI) technologies







Mathematical and statistical analysis packages
Personalization tools for Web-based marketing
Analytics built into marketing platforms
Advanced CRM tools
Analytics added to other vertical industry-specific
platforms
Analytics added to database tools (e.g., OLAP)
Standalone data mining tools
28
Data Mining Project Processes
29
Data Mining Project Processes
30
Data Mining Project Processes

Knowledge discovery in databases (KDD)
A comprehensive process of using data mining
methods to find useful information and patterns in
data
31
Data Mining Project Processes

KDD process
1.
2.
3.
4.
5.
Selection
Preprocessing
Transformation
Data mining
Interpretation/evaluation
32
Text Mining Overview
33
Text Mining Overview

Classification of data mining vs. text mining
Finding
Patterns
Finding
Nuggets
Novel
Non-Novel
Non-textual
Data
Standard
Data Mining
Analytics
Database
Queries
Textual Data
CPL & NLP
Real Text
Mining
Information
Retrieval
34
Text Mining Overview

Classification of data mining vs. text mining
35
The Intelligence Pyramid
36
36
Text Mining Overview

Text mining helps organizations:



Find the “hidden” content of documents, including
additional useful relationships
Relate documents across previous unnoticed
divisions
Group documents by common themes
37
Text Mining Overview

Text mining Process

Text Preprocessing (NLP):
text cleanup, tokenization,
part-of-speech tagging,
Word sense disambiguation,
semantic structure
Text Transformation: text
representation, doc feature
selection
Attribute Selection: reduce
dimensionality, remove
irrelevant attributes
Pattern Discovery: classic
data mining techniques
Interpretation Evaluation:
terminate or iterate




38
Text Mining Overview

How to mine text
1.
2.
3.
4.
Eliminate commonly used words (stop-words)
Replace words with their stems or roots (stemming
algorithms)
Consider synonyms and phrases
Calculate the weights of the remaining terms
39
Text Mining Overview

Text mining

Is a special case of data mining that aims at extracting useful
information from non-structured or semi-structured textual
data
 Involves the process of structuring the input text and detect
patterns and trends through data mining techniques such as
statistical pattern learning
 The above is typically done through using CPL/NLP engine to
read free text, identify terms, classify the terms (noun, verb,
adjective), disambiguate and perform fact extraction in the
appropriate structure representation, simplify attributes, apply
data mining techniques to discover patterns & trends, and
evaluate for possible more iteration/refinement
40
Text Mining Overview

Text mining

Typically text mining will entails the generation of meaningful
indices from the unstructured text and then use various data
mining algorithms leveraging these indices
 Taxonomy – even with filtering the terms we still have very
large number, and knowledge discovery typically relies on
some hierarchy of terms to form results at higher level. Term
taxonomy would enable the production of general association
rules that capture relationship between groups of terms rather
than individual terms
 Ontology – would help us to do reasoning on the generated
facts by associating terms with ontology concepts, e.g., OWL
with semantic web triples. In this case we would be treating
the triples as knowledge management system
41
Text Mining Overview

Applications of text mining






Automatic detection of e-mail spam or phishing through
analysis of the document content
Automatic processing of messages or e-mails to route a
message to the most appropriate party to process that
message
Analysis of warranty claims, help desk calls/reports, and so on
to identify the most common problems and relevant responses
Analysis of related scientific publications in journals to create
an automated summary view of a particular discipline
Creation of a “relationship view” of a document collection
Qualitative analysis of documents to detect deception
42
Web Mining Overview
43
Web Mining Overview

Web mining
The discovery and analysis of interesting and useful
information from the Web, about the Web, and usually
through Web-based tools
44
Web Mining Overview
45
Web Mining Overview

Web content mining
The extraction of useful information from Web pages

Web structure mining
The development of useful information from the links
included in the Web documents

Web usage mining
The extraction of useful information from the data being
generated through webpage visits, transaction, etc.
46
Web Mining Overview

Uses for Web mining:






Determine the lifetime value of clients
Design cross-marketing strategies across products
Evaluate promotional campaigns
Target electronic ads and coupons at user groups
Predict user behavior
Present dynamic information to users
47
Web Mining Overview
48
END
49