Download Mining Patterns from Protein Structures

Document related concepts
no text concepts found
Transcript
EECS 800 Research Seminar
Mining Biological Data
Instructor: Luke Huan
Fall, 2006
The UNIVERSITY of Kansas
Administrative
Class presentation schedule is online
First class presentation is “kernel based classification” by Han
Bin on Nov 6th
Project design is due Oct 30th
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide2
Overview
Gene ontology
Challenges
What is gene ontology
construct gene ontology
Text mining, natural language processing and
information extraction: An Introduction
Summary
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide3
Ontology
<philosophy> A systematic account of Existence.
<artificial intelligence> (From philosophy) An explicit formal
specification of how to represent the objects, concepts and other
entities that are assumed to exist in some area of interest and the
relationships that hold among them.
<information science> The hierarchical structuring of knowledge
about things by subcategorising them according to their essential (or
at least relevant and/or cognitive) qualities.
This is an extension of the previous senses of "ontology" (above)
which has become common in discussions about the difficulty of
maintaining subject indices.
The philosophy of indexing everything in existence?
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide4
Aristotele’s (384-322 BC) Ontology
Substance
plants, animals, ...
Quality
Quantity
Relation
Where
When
Position
Having
Action
Passion
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide5
Ontology and -informatics
In information sciences, ontology is better defined as: “a
domain of knowledge, represented by facts and their
logical connections, that can be understood by a
computer”.
(J. Bard, BioEssays, 2003)
“Ontologies provide controlled, consistent vocabularies to
describe concepts and relationships, thereby enabling
knowledge sharing”
(Gruber, 1993)
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide6
Information Exchange in Bio-sciences
Basic challenges:
Definition, definition, definition
What is a name?
What is a function?
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide7
Cell
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide8
Cell
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide9
Cell
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide10
Cell
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide11
Cell
10/16/2006
Text Mining
Mining Biological Data
Image from http://microscopy.fsu.edu
KU EECS 800, Luke Huan, Fall’06
slide12
What’s in a name?
The same name can be used to describe different
concepts
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide13
What’s in a name?
Glucose synthesis
Glucose biosynthesis
Glucose formation
Glucose anabolism
Gluconeogenesis
All refer to the process of making
glucose from simpler components
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide14
What’s in a name?
The same name can be used to describe different
concepts
A concept can be described using different names
 Comparison is difficult – in particular
across species or across databases
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide15
What is Function?
The Hammer Example
Function (what)
Process (why)
Drive nail (into wood)
Carpentry
Drive stake (into soil)
Gardening
Smash roach
Pest Control
Clown’s juggling object
Entertainment
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide16
Information Explosion
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide17
Entering the
Genome Sequencing Era
Eukaryotic Genome Sequences Year
10/16/2006
Text Mining
Genome
Size (Mb)
# Genes
Yeast (S. cerevisiae)
1996
12
6,000
Worm (C. elegans)
1998
97
19,100
Fly (D. melanogaster)
2000
120
13,600
Plant (A. thaliana)
2001
125
25,500
Human (H. sapiens, 1st Draft)
2001
~3000
~35,000
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide18
What is the Gene Ontology?
A Common Language for Annotation of
Genes from
Yeast, Flies and Mice
…and Plants and Worms
…and Humans
…and anything else!
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide19
http://www.geneontology.org/
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide20
What is the Gene Ontology?
Gene annotation system
Controlled vocabulary that can be applied to all organisms
Organism independent
Used to describe gene products
proteins and RNA - in any organism
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide21
The 3 Gene Ontologies
Molecular Function = elemental activity/task
the tasks performed by individual gene products; examples
are carbohydrate binding and ATPase activity
Biological Process = biological goal or objective
broad biological goals, such as mitosis or purine
metabolism, that are accomplished by ordered assemblies of
molecular functions
Cellular Component = location or complex
subcellular structures, locations, and macromolecular
complexes; examples include nucleus, telomere, and RNA
polymerase II holoenzyme
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide22
Cellular Component
where a gene product acts
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide23
Cellular Component
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide24
Cellular Component
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide25
Cellular Component
Enzyme complexes in the component ontology refer to
places, not activities.
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide26
Molecular Function
10/16/2006
Text Mining
insulin binding
insulin
receptor activity
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide27
Molecular Function
activities or “jobs” of a gene product
glucose-6-phosphate isomerase activity
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide28
Molecular Function
A gene product may have several functions; a function
term refers to a single reaction or activity, not a gene
product.
Sets of functions make up a biological process.
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide29
Biological Process
a commonly recognized series of events
cell division
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide30
Biological Process
transcription
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide31
Biological Process
Metabolism: degradation or synthesis of biomelecules
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide32
Biological Process
Development: how a group of cell become a tissue
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide33
Biological Process
courtship behavior
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide34
Ontology applications
Can be used to:
Formalise the representation of biological knowledge
Standardise database submissions
Provide unified access to information through ontology-based
querying of databases, both human and computational
Improve management and integration of data within databases.
Facilitate data mining
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide35
Gene Ontology Structure
Ontologies can be represented as directed acyclic graphs
(DAG), where the nodes are connected by edges
Nodes = terms in biology
Edges = relationships between the terms
is-a
part-of
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide36
Parent-Child Relationships
Chromosome
Cytoplasmic
chromosome
Mitochondrial
chromosome
Nuclear
chromosome
Plastid
chromosome
A child is
a subset or instances of
a parent’s elements
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide37
Parent-Child Relationships
cell
membrane
mitochondrial
membrane
chloroplast
chloroplast
membrane
is-a
part-of
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide38
Annotation in GO
A gene product is usually a protein but can be a functional
RNA
An annotation is a piece of information associated with a
gene product
A GO annotation is a Gene Ontology term associated with
a gene product
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide39
Terms, Definitions, IDs
Term: MAPKKK cascade (mating sensu Saccharomyces)
Goid: GO:0007244
Definition: OBSOLETE. MAPKKK cascade involved in
transduction of mating pheromone signal, as described in
Saccharomyces.
Evidence code: how annotation is done
Definition_reference: PMID:9561267
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide40
Annotation Example
nek2
PMID: 11956323
Reference
Gene Product
10/16/2006
Text Mining
IDA
centrosome
GO:0005813
Inferred from
Direct Assay
GO Term
Evidence Code
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide41
GO Annotation
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide42
GO Annotation
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide43
GO Annotation
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide44
Evidence Code
Indicate the type of evidence in the cited source that
supports the association between the gene product and the
GO term
http://www.geneontology.org/GO.evidence.html
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide45
Types of evidence codes
Types of evidence code
Experimental codes - IDA, IMP, IGI, IPI, IEP
Computational codes - ISS, IEA, RCA, IGC
Author statement - TAS, NAS
Other codes - IC, ND
Two types of annotation
 Manual Annotation
 Electronic Annotation
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide46
IDA: Inferred from Direct Assay
• direct assay for the function, process, or
component indicated by the GO term
10/16/2006
Text Mining
•
Enzyme assays
•
In vitro reconstitution (e.g. transcription)
•
Immunofluorescence (for cellular component)
•
Cell fractionation (for cellular component)
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide47
IMP: Inferred from Mutant Phenotype
•
variations or changes such as mutations or
abnormal levels of a single gene product
10/16/2006
Text Mining
•
Gene/protein mutation
•
Deletion mutant
•
RNAi experiments
•
Specific protein inhibitors
•
Allelic variation
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide48
IGI: Inferred from Genetic Interaction
•
Any combination of alterations in the sequence or
expression of more than one gene or gene product
•
Traditional genetic screens
- Suppressors, synthetic lethals
•
10/16/2006
Text Mining
•
Functional complementation
•
Rescue experiments
An entry in the ‘with’ column is recommended
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide49
IPI: Inferred from Physical Interaction
•
Any physical interaction between a gene
product and another molecule, ion, or complex
•
10/16/2006
Text Mining
•
2-hybrid interactions
•
Co-purification
•
Co-immunoprecipitation
•
Protein binding experiments
An entry in the ‘with’ column is recommended
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide50
IEP: Inferred from Expression
Pattern
Timing or location of expression of a gene
Transcript levels
Northerns, microarray
Exercise caution when interpreting expression results
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide51
ISS: Inferred from Sequence or structural
Similarity
Sequence alignment, structure comparison, or evaluation
of sequence features such as composition
Sequence similarity
Recognized domains/overall architecture of protein
An entry in the ‘with’ column is recommended
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide52
RCA: Inferred from Reviewed
Computational Analysis
non-sequence-based computational method
large-scale experiments
genome-wide two-hybrid
genome-wide synthetic interactions
integration of large-scale datasets of several types
text-based computation (text mining)
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide53
IGC
Inferred from Genomic Context
Chromosomal position
Most often used for Bacteria - operons
Direct evidence for a gene being involved in a process is
minimal, but for surrounding genes in the operon, the evidence
is well-established
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide54
IEA: Inferred from
Electronic Annotation
depend directly on computation or automated transfer of
annotations from a database
Hits from BLAST searches
InterPro2GO mappings
No manual checking
Entry in ‘with’ column is allowed (ex. sequence ID)
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide55
TAS: Traceable Author Statement
publication used to support an annotation doesn't show the
evidence
Review article
Text mining!
Would be better to track down cited reference and use an
experimental code
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide56
NAS: Non-traceable Author
Statement
Statements in a paper that cannot be traced to another publication
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide57
ND: No biological Data available
Can find no information supporting an annotation to any
term
Indicate that a curator has looked for info but found
nothing
Place holder
Date
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide58
IC: Inferred by Curator
annotation is not supported by evidence, but can be
reasonably inferred from other GO annotations for which
evidence is available
ex. evidence = transcription factor (function)
IC = nucleus (component)
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide59
Choosing the correct evidence code
Ask yourself:
What is the experiment that was done?
Text Mining can help you review
papers faster!
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide60
Beyond GO – Open Biomedical Ontologies
Orthogonal to existing ontologies to facilitate
combinatorial approaches
Share unique identifier space
Include definitions
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide61
Gene Ontology and Text Mining
Derive ontology from text data
More general goal: understand text data automatically
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide62
Finding GO terms
…for B. napus PERK1 protein
(Q9ARH1)
In this study, we report the isolation and molecular characterization
of the B. napus PERK1 cDNA, that is predicted to encode a novel
receptor-like kinase. We have shown that like other plant RLKs, the
kinase domain of PERK1 has serine/threonine kinase activity, In
addition, the location of a PERK1-GTP fusion protein to the plasma
membrane supports the prediction that PERK1 is an integral
membrane protein…these kinases have been implicated in early
stages of wound response…
PubMed ID: 12374299
Function:
protein serine/threonine kinase activity
GO:0004674
Component:
integral to plasma membrane
GO:0005887
Process:
response to wounding
GO:0009611
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide63
Mining Text Data
Data Mining / Knowledge Discovery
Structured Data
HomeLoan (
Loanee: Frank Rizzo
Lender: MWF
Agency: Lake View
Amount: $200,000
Term: 15 years
)
Multimedia
Free Text
Hypertext
Loans($200K,[map],...)
Frank Rizzo bought
his home from Lake
View Real Estate in
1992.
He paid $200,000
under a15-year loan
from MW Financial.
<a href>Frank Rizzo
</a> Bought
<a hef>this home</a>
from <a href>Lake
View Real Estate</a>
In <b>1992</b>.
<p>...
(Taken from ChengXiang Zhai, CS 397cxz, UIUC, CS – Fall 2003)
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide64
Bag-of-Tokens Approaches
Documents
Token Sets
Four score and seven
years ago our fathers brought
forth on this continent, a new
nation, conceived in Liberty,
and dedicated to the
proposition that all men are
created equal.
Now we are engaged in a
great civil war, testing
whether that nation, or …
Feature
Extraction
nation – 5
civil - 1
war – 2
men – 2
died – 4
people – 5
Liberty – 1
God – 1
…
Loses all order-specific information!
Severely limits context!
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide65
Natural Language Processing
A dog is chasing a boy on the playground
Det
Noun Aux
Noun Phrase
Verb
Complex Verb
Semantic analysis
Dog(d1).
Boy(b1).
Playground(p1).
Chasing(d1,b1,p1).
+
Det Noun Prep Det
Noun Phrase
Noun Phrase
Prep Phrase
Verb Phrase
Syntactic analysis
(Parsing)
Verb Phrase
Sentence
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back…
Scared(b1)
Inference
10/16/2006
Text Mining
Noun
Lexical
analysis
(part-of-speech
tagging)
Pragmatic analysis
(speech act)
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide66
General NLP—Too Difficult!
Word-level ambiguity
“design” can be a noun or a verb (Ambiguous POS)
“root” has multiple meanings (Ambiguous sense)
Syntactic ambiguity
“natural language processing” (Modification)
“A man saw a boy with a telescope.” (PP Attachment)
Anaphora resolution
“John persuaded Bill to buy a TV for himself.”
(himself = John or Bill?)
Presupposition
“He has quit smoking.” implies that he smoked before.
Humans rely on context to interpret (when possible).
This context may extend beyond a given document!
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide67
Shallow Linguistics
Progress on Useful Sub-Goals:
English Lexicon
Part-of-Speech Tagging
Word Sense Disambiguation
Phrase Detection / Parsing
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide68
WordNet
An extensive lexical network for the English language
Contains over 138,838 words.
Several graphs, one for each part-of-speech.
Synsets (synonym sets), each defining a semantic sense.
Relationship information (antonym, hyponym, meronym …)
Downloadable for free (UNIX, Windows)
Expanding to other languages (Global WordNet Association)
Funded >$3 million, mainly government (translation interest) to George
Miller, National Medal of Science, 1991.
moist
watery
parched
wet
dry
damp
anhydrous
arid
synonym
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
antonym
slide69
Part-of-Speech Tagging
Training data (Annotated text)
This
Det
sentence
N
serves
V1
“This is a new sentence.”
as
P
an example
Det
N
POS Tagger
of
P
annotated
V2
text…
N
This is a new
Det Aux Det Adj
sentence.
N
Pick the most
sequence.
p( w1 likely
,..., wk , ttag
1 ,..., tk )
 p(t1 | w1 )... p(tk | wk ) p( w1 )... p( wk )

p( w1 ,..., wk , t1 ,..., tk )   k
Independent assignment
 p( wi | ti ) p(ti | ti 1 )
Most common tag
 p(t1 | w1 )... p(tk | wk ) p(iw11 )... p( wk )

 k
 p( wi | ti ) p(ti | ti 1 )
Partial dependency
 i 1
(HMM)
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide70
Word Sense Disambiguation
?
“The difficulties of computational linguistics are rooted in ambiguity.”
N
Aux V
P
N
Supervised Learning
Features:
• Neighboring POS tags (N Aux V P N)
• Neighboring words (linguistics are rooted in ambiguity)
• Stemmed form (root)
• Dictionary/Thesaurus entries of neighboring words
• High co-occurrence words (plant, tree, origin,…)
• Other senses of word within discourse
Algorithms:
• Rule-based Learning (e.g. IG guided)
• Statistical Learning (i.e. Naïve Bayes)
• Unsupervised Learning (i.e. Nearest Neighbor)
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide71
Parsing
Choose most likely parse tree…
Grammar
Probability of this tree=0.000015
NP
Probabilistic CFG
S NP VP
NP  Det BNP
NP  BNP
NP NP PP
BNP N
VP  V
VP  Aux V NP
VP  VP PP
PP  P NP
S
1.0
0.3
0.4
0.3
Det
BNP
A
N
VP
Aux
dog
…
…
VP
10/16/2006
Text Mining
V
NP
is chasing
P
NP
on
a boy
the playground
..
.
Probability of this tree=0.000011
S
1.0
NP
Lexicon
PP
V  chasing
0.01
Aux is
N  dog
0.003
N  boy
N playground …
Det the
…
Det a
P  on
Det
A
VP
BNP
N
Aux
is
NP
V
PP
chasing NP
P
dog
a boy
NP
on
the playground
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide72
Obstacles
•
Ambiguity
“A man saw a boy with a telescope.”
• Computational Intensity
Imposes a context horizon.
Text Mining NLP Approach:
1. Locate promising fragments using fast IR
methods (bag-of-tokens).
2. Only apply slow NLP techniques to promising
fragments.
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide73
Summary: Shallow NLP
However, shallow NLP techniques are feasible and useful:
• Lexicon – machine understandable linguistic knowledge
• possible senses, definitions, synonyms, antonyms, typeof, etc.
• POS Tagging – limit ambiguity (word/POS), entity extraction
• “...research interests include text mining as well as bioinformatics.”
NP
N
• WSD – stem/synonym/hyponym matches (doc and query)
• Query: “Foreign cars”
Document: “I’m selling a 1976 Jaguar…”
• Parsing – logical view of information (inference?, translation?)
• “A man saw a boy with a telescope.”
Even without complete NLP, any additional knowledge extracted from
text data can only be beneficial.
Ingenuity will determine the applications.
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide74
Reference for GO
Gene ontology teaching resources:
http://www.geneontology.org/GO.teaching.resources.shtml
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide75
References for TM
1.
5.
6.
C. D. Manning and H. Schutze, “Foundations of Natural Language
Processing”, MIT Press, 1999.
S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”,
Prentice Hall, 1995.
S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and SemiStructured Data”, Morgan Kaufmann, 2002.
G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five
papers on WordNet. Princeton University, August 1993.
C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC, Fall 2003.
M. Hearst, Untangling Text Data Mining, ACL’99, invited paper.
7.
http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall
2.
3.
4.
8.
9.
2003.
A Road Map to Text Mining and Web Mining, University of Texas resource
page. http://www.cs.utexas.edu/users/pebronia/text-mining/
Computational Linguistics and Text Mining Group, IBM Research,
http://www.research.ibm.com/dssgrp/
10/16/2006
Text Mining
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide76