Download used for MS Thesis Defense - Bioinformatics Tools for Cerebellar

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Finding Functional Gene
Relationships Using the
Semantic Gene Organizer (SGO)
Kevin Heinrich
Master’s Defense
July 16, 2004
Outline
• Problem / Goals
• Related Work
• Information Retrieval
– Vector Space Model
– Latent Semantic Indexing (LSI)
• Biological Databases
• SGO Use & Results
Problem
• Biological tools are creating vast amounts
of data.
• Current techniques are time-consuming
and expensive.
• Want to know phenotype (function) from
genotype (structure/sequence).
Goals
• Develop a tool to aid researchers in finding
and understanding functional gene
relationships.
• Use information that covers whole
genome, e.g. literature.
Related Work
• Jenssen et al. (2001) developed PubGene.
– Literature network
– Assigns functional association if there is a cooccurrence of gene symbols
• Wilkinson and Huberman (2004) expanded this
idea to find communities of related genes.
• Yandell and Majoros (2002) use natural
language processing techniques to identify
nature of relationships.
Related Work
• Most all literature-based techniques rely
on term co-occurrence.
• What about gene aliases?
• Solution: Apply a more robust technique.
Information Retrieval
Vector Space Model
• Documents are parsed into tokens.
• Tokens are assigned a weight of, wij, of ith token
in jth document.
• An m x n term-by-document matrix, A, is
created where
 
A  wij
– Documents are m-dimensional vectors.
– Tokens are n-dimensional vectors.
Information Retrieval
Term Weights
• Term weights are the product of a local and global
component
wij  lij g i d j
• tf
lij  f ij
f

 f 
ij
• idf
gi
j
ij
j
• idf2
gi  log 2
n
 f 
ij
j
1
Information Retrieval
Term Weights (cont’d)
• log-entropy
lij  log 1  f ij 
  pij log 2 pij 
 j

gi  1  

log 2 n




pij 
f ij
f
ij
j
• Goal is to give distinguishing terms more weight.
Information Retrieval
Query & Similarity
• Queries are represented by a pseudo-document
vector
q0  g1 , g 2 ,, g m 
• Similarity is the cosine of the angle between
document vectors.
 
qdj
sim q, d j   cos  j    
q  dj
m
g w
k
k 1
m
w
k 1
2
kj
kj
m
2
g
 k
k 1
Information Retrieval
Latent Semantic Indexing (LSI)
LSI performs a truncated SVD on
A = UΣVT
• U is the m x n matrix of eigenvectors of AAT
• VT is the r x n matrix of eigenvectors of ATA
• Σ is the r x r diagonal matrix containing the r nonnegative
singular values of A
• r is the rank of A
A rank-k approximation is given by Ak = UkΣkVkT
Information Retrieval
LSI (cont’d)
• Document-to-document similarity is
A A  Vk k Vk k 
T
T
k
• Queries are projected into low-rank
approximation space
q  q Uk
T
0
1
k
Information Retrieval
LSI (cont’d)
• Scaled document vectors can be computed once
and stored for quick retrieval.
• The lower-dimensional space forces queries and
documents to be compared in a more
conceptual manner and saves storage.
• Choice of number of factors is an open question.
• End Effect: LSI can find similarities between
documents that have no term co-occurrence.
Information Retrieval
Evaluation Measures
• Precision – ratio of relevant returned documents to the
total number of returned documents.
• Recall – ratio of relevant returned documents to the total
number of relevant documents.
• Goal is to have high precision at all levels of recall.
• Systems are often evaluated by average precision (AP),
which is the average of 11 interpolated precision values
at the decile ranges.
Biological Databases
MEDLINE
• MEDLINE (NLM)
– Contains 14+ million references to journal
articles with a concentration in medicine
– Span over 4,600 journals worldwide
– 1966 to present
– ~500,000 citations added annually
– Each citation is manually indexed with MeSH
terms.
Biological Databases
PubMed
• PubMed
– Retrieves articles from MEDLINE and other
journals.
– Can be queried via any combination of
attributes.
Biological Databases
LocusLink
• NCBI human-curated database
• Single query interface to a comprehensive
directory for genes and gene reference
sequences for key genomes.
• Provides links to related records in PubMed and
other citations when applicable.
• Provides RefSeq Summary of gene function and
links to key MEDLINE citations relevant to each
gene.
Biological Databases
Overview
• MEDLINE has lots information
– Not all articles relate to genes
– Gene terminology problem
• LocusLink does not cover all relevant
citations, but a representative few.
Biological Databases
Gene Document Construction
• Concatenate titles and abstracts of MEDLINE
citations cross-referenced in Human, Rat, and
Mouse LocusLink entries.
• Sequencing abstracts included – noise
• LocusLink references are not comprehensive, so
recall of all relevant abstracts is not guaranteed.
SGO
• Primarily uses LSI to rank genes.
• Enables user to specify query method
– Gene query
– Keyword query
– Number of factors
– Show latent matches
• Saves previous query sessions.
SGO
Interface
SGO
Interface (cont’d)
SGO
Trees
• Unfortunately, ranked lists mean little to
biologists.
• Pairwise distances can be formed into a matrix
 
D  d ij
d ij  1 cos  ij
where cos  ij is the similarity between
documents i and j
SGO
Trees (cont’d)
• Fitch-Margoliash (1967) method in
PHYLIP is applied to D to generate
hierarchical trees.
• Thresholds can be applied to self-similarity
matrix to produce graphs.
SGO
Hierarchical Tree
SGO
Graph or Nodal Tree
SGO
Coding Issues
• Web interface – must be interactive
– Queries are processed on click
– Document collections are parsed offline
– Trees are constructed offline
• Storage will eventually become an issue.
Results
Test Data Set
• 50 gene test data set was
constructed.
– Alzheimer’s Disease
– Cancer
– Development
• Reelin signaling pathway
used as basis for evaluation
– 5 primary genes (directly
associated)
– 7 secondary genes (indirectly
associated)
Results
Primary AP
• AP for 5 primary
genes
– 61% for 5 factors
– 84% for 25 factors
– 84% for 50 factors
Results
Secondary AP
• AP for 12
secondary genes
– 53% for 5 factors
– 59% for 25 factors
– 61% for 50 factors
Results
Comparison
• LSI comparable to tf-idf for 5 primary genes
• Far superior to tf-idf for 12 second genes
– PubMed co-citation identifies 2 of the 7 indirectly
related genes
– Abstract overlap of LocusLink citations fails to identify
any indirectly related genes
• tf-idf fails on many keyword queries
• Tested on Gene Ontology classifications (not
shown)
– Similar tendencies are observed
Results
Abstract Representation
• To simulate scaling
up, decrease
representation of
reelin-related genes
• AP of 47% on
20,856 Human
LocusLink abstracts
Results
Hierarchical Tree
Results
Hierarchical Tree
Results
Hierarchical Tree
Conclusions
• SGO allows genes to be compared to
each other and to keyword (function).
• SGO identifies latent relationships with
promising accuracy.
• SGO is not meant to replace existing
technologies, but to assist researchers
– Verify current results
– Direct future exploration
Future Work
• Scale up to entire genome
• Document construction
• Incorporate structural or other information
for multi-modal similarity
• Test other models e.g. NMF, QR, etc.
• Interactive tree building
• Keep collections current