Download 03 - School of Computing | University of Leeds

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
COMP3740 CR32:
Technologies for Knowledge
Management
Introduction to Knowledge Discovery
By Eric Atwell, School of Computing,
University of Leeds
(including re-use of teaching resources from other sources,
esp. Knowledge Management by Stuart Roberts,
School of Computing, University of Leeds)
What has Machine Learning got to do
with Computing / Information Systems?
“Most international organizations produce more
information in a week than many people could
read in a lifetime”
Adriaans and Zantinge
Objectives of knowledge discovery or
data mining
• Data mining is about discovering patterns in data.
• For this we need:
– KD/DM techniques, algorithms, tools, eg BootCat,
WEKA
– A methodological framework to guide us, in collecting
data and applying the best algorithms: CRISP-DM
Data Mining, Knowledge Discovery,
Text Mining
• Data Mining was originally about “learning” patterns
from DataBases, data structured as Records, Fields
• Knowledge Discovery is “exotic term” for DM???
• Increasingly, data is unstructured text (WWW), so
• Text Mining is a new subfield of DM, focussing on
Knowledge Discovery from unstructured text data
define: data mining
• Data mining, also known as knowledge-discovery in
databases (KDD), is the practice of automatically
searching large stores of data for patterns. To do this,
data mining uses computational techniques from
artificial intelligence, statistics and pattern
recognition.
en.wikipedia.org/wiki/Data_mining
define: text mining
• Text mining, also known as intelligent text analysis,
text data mining or knowledge-discovery in text
(KDT), refers generally to the process of extracting
interesting and non-trivial information and
knowledge from unstructured text. Text mining is a
young interdisciplinary field which draws on
information retrieval, data mining, machine learning,
statistics and computational linguistics. ...
en.wikipedia.org/wiki/Text_mining
define: knowledge discovery
• Knowledge discovery is the process of finding
novel, interesting, and useful patterns in data. Data
mining is a subset of knowledge discovery. It lets
the data suggest new hypotheses to test.
www.purpleinsight.com/downloads/docs/visualizer_
tutorial/glossary/go01.html
• Data mining, also known as knowledge-discovery in
databases (KDD), is the practice of automatically
searching large stores of data for patterns. To do
this, data mining uses computational techniques
from AI, statistics and pattern recognition.
en.wikipedia.org/wiki/Knowledge_discovery
Data Mining: Overview
Concepts,
Instances or
examples,
Attributes
Data Mining
Concept Descriptions
Each instance is an example of the concept to be learned or
described. The instance may be described by the values of
its attributes.
Instances
• Input to a data mining algorithm is in the form of
a set of examples, or instances.
• Each instance is represented as a set of features or
attributes.
• Usually in DB Data-Mining this set takes the form
of a flat file; each instance is a record in the file,
each attribute is a field in the record.
• In text-mining, instance may be word/term in
context (surrounding words/document)
• The concepts to be learned are formed from
patterns discovered within the set of instances.
concepts
The types of concepts we try to ‘learn’ include:
• Key “indicators” – features or terms specific to our domain
• Clusters or ‘Natural’ partitions;
– Eg we might cluster customers according to their shopping habits.
– Eg is this web-page British or American English?
• Rules for classifying examples into pre-defined classes.
– Eg “Mature students studying information systems with high grade
for General Studies A level are likely to get a 1st class degree”
• General Associations
– Eg “People who buy nappies are in general likely also to buy beer”
More concepts
The types of concepts we try to ‘learn’ include:
• Unexpected (suspicious?) associations or coincidences
– Eg known suspects A, B, C all phoned D last week
• Numerical prediction
– Eg look for rules to predict what salary a graduate will get,
given A level results, age, gender, programme of study and
degree result – this may give us an equation:
Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree
(but are Gender, Programme really numbers???)
DB Example: weather to play?
/usr/local/weka-3-4-13/data/weather.nominal.arff
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
/usr/local/weka-3-4-13/data/weather.arff
@relation weather
@attribute outlook {sunny,overcast,rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
Text mining example: Which English
dominates the WWW, UK or US?
• “First catch your rabbit” (Mrs Beaton’s cookbook):
Other tools are possible, but WWW-BootCat was easier
to use …
• First: sign up for Domain, SketchEngine account,
Google key; download seeds-en from
http://corpus.leeds.ac.uk/internet.html
• (see cw1 specifications and lecture notes …)
Example 2: Data Mining for an ontology
• Ontology: the “concepts” in a discipline, and meaningrelationships between these concepts (01.ppt)
• “concepts” roughly equates to “terminology” – specialist
words and phrases in a discipline
• WordNet is freely-available for general English
• What about other languages? – EuroWordnet, BalkaNet, …
(but not ALL languages!)
• What about specific domains? Domain-specific
ONTOLOGIES have to be devised (by experts)
• What about my own specific domain/language?
• Automatic extraction of key words / concepts from example
documents (machine learning / knowledge discovery)
Automatic terminology extraction
Terminology extraction = thesaurus construction
based on documents (either retrieved set or the whole
collection) as Corpus – training text set
define a ‘measure’ of how close one index term is to
another – in meaning-space, ?or literal distance?
for each term, form a ‘neighbourhood’ comprising the
nearest ‘n’ terms
treat these neighbourhoods like ‘related’ thesaurus
classes
terms with similar neighbourhoods are treated as
synonyms.
Finding “coordinate terms”
One attempt to define how close a term is to another:
• If two terms are both used to index the same document
many times in the collection, then they are deemed to be
close.
• From document-term matrix, compute term-correlation
matrix
• The term correlation matrix can be normalised so that
terms that index a lot of documents don’t have an unfair
chance – reduce weight of common words
Other ways to find specialist terms
Other ways to find domain-specific terms and relations:
• Collect a domain corpus, find terms “different” from a
generic “gold standard” corpus: British National Corpus
• Collocation-groups: For each term, collect its
collocations in the Corpus: other words it appears next
to (or near to). If two terms have similar collocationsets, then they are deemed to be close.
• Association matrix based on proximity: compute
average distance between pairs of terms (no. of words
between them, literally), use this as closeness metric
Why build a thesaurus?
• a thesaurus or ontology can be used to normalise a
vocabulary and queries (?or documents?)
• it can be used (with some human intervention) to
increase recall and precision
• generic thesaurus/ontology may not be effective in
specialized collections and/or queries
• Semi-automatic construction of thesaurus/ontology
based on the retrieved set of documents has produced
some promising results, e.g. Semantic Web
Knowledge Discovery: Key points
• Knowledge Discovery (Data Mining) tools semiautomate the process of discovering patterns in data.
• Tools differ in terms of what concepts they discover
(differences, key-terms, clusters, decision-trees, rules)…
• … and in terms of the output they provide (eg clustering
algorithms provide a set of subclasses)
• Selecting the right tools for the job is based on business
objectives: what is the USE for the knowledge discovered
A Data Mining consultant…
• You should be able to:
– Decide which is the appropriate data mining technique
for a given a problem defined in terms of business
objectives.
– Decide which is the most appropriate form of input
(which attributes/features will be “useful” for learning)
and output (what does your client want to see?)