Download Towards an automated procedure for annotation of gene products through SAS Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Towards an
automated procedure for
annotation of gene products
through SAS® Methods
Henrik Tveit
Norwegian University of
Science and Technology
Overview
! Biology:
the Structure of Life
! Functional genomics
! The Challenge of automated gene
annotation
! Sources of information
! Methods and results
! Conclusion and general business
intelligence experience
The Structure of Life
!
!
!
!
The body consists of
cells
Each cell contains a
DNA molecule
Each DNA molecule
consists of proteincoding parts known as
genes
Each gene produces
different proteins
The Structure of Life
!
!
!
!
The body consists of
cells
Each cell contains a
DNA molecule
Each DNA molecule
consists of proteincoding parts known as
genes
Each gene produces
different proteins
Functional genomics
! Which
genes are active in e.g. cancer
cells?
! How can the activity be controlled to
repair/prevent illness?
! Gene
analysis
! Gene annotation
The Challenge
PEG3
Automatic annotation of genes
using terms from the Gene Ontology (GO)
DNMT1
Gene Ontology
BRCA1
RES1
R8928
TRD@1
R37302
PCTP
GRA3
ODC1
...
death
cell death
apoptosis
development
transduction
signal
transduction
imprinting DNA replication methyltransferase
The Source: Medline
PEG3
DNMT1
Gene Ontology
BRCA1
RES1
R8928
TRD@1
R37302
PCTP
GRA3
ODC1
...
death
cell death
apoptosis
development
transduction
signal
transduction
imprinting DNA replication methyltransferase
Methods
! Information
! Data
exploration
Mining
! Text Mining
! Combining the results
Using Medline entry key terms
! MeSH
to GO
– Implicit relations via
similar term usage
– Association analysis
Title
Abstract
MeSH
EC
RN
Dnmt1 overexpression causes
genomic hypermethylation, loss of
imprinting, and embryonic lethality.
Biallelic expression of Igf2 is frequently seen in cancers because Igf2 functions as a
survival factor. In many tumors the activation of Igf2 expression has been correlated
with de novo methylation of the imprinted region. We have compared the intrinsic
susceptibilities of the imprinted region of Igf2 and H19, other imprinted genes, bulk
genomic DNA, and repetitive retroviral sequences to Dnmt1 overexpression. At low
Dnmt1 methyltransferase levels repetitive retroviral elements were methylated and
silenced. The nonmethylated imprinted region of Igf2 and H19 was resistant to
methylation at low Dnmt1 levels but became fully methylated when Dnmt1 was
overexpressed from a bacterial artificial chromosome transgene. Methylation caused
the activation of the silent Igf2 allele in wild-type and Dnmt1 knockout cells, leading
to biallelic Igf2 expression. In contrast, the imprinted genes Igf2r, Peg3, Snrpn, and
Grf1 were completely resistant to de novo methylation, even when Dnmt1 was
overexpressed. Therefore, the intrinsic difference between the imprinted region of
Igf2 and H19 and of other imprinted genes to postzygotic de novo methylation may
be the molecular basis for the frequently observed de novo methylation and
upregulation of Igf2 in neoplastic cells and tumors. Injection of Dnmt1overexpressing embryonic stem cells in diploid or tetraploid blastocysts resulted in
lethality of the embryo, which resembled embryonic lethality caused by Dnmt1
deficiency.
Term occurrence frequency
! Occurrences
of
GO term synonyms
in Medline texts
Title
Abstract
MeSH
EC
RN
Dnmt1 overexpression causes
genomic hypermethylation, loss of
imprinting, and embryonic lethality.
Biallelic expression of Igf2 is frequently seen in cancers because Igf2 functions as a
survival factor. In many tumors the activation of Igf2 expression has been correlated
with de novo methylation of the imprinted region. We have compared the intrinsic
susceptibilities of the imprinted region of Igf2 and H19, other imprinted genes, bulk
genomic DNA, and repetitive retroviral sequences to Dnmt1 overexpression. At low
Dnmt1 methyltransferase levels repetitive retroviral elements were methylated and
silenced. The nonmethylated imprinted region of Igf2 and H19 was resistant to
methylation at low Dnmt1 levels but became fully methylated when Dnmt1 was
overexpressed from a bacterial artificial chromosome transgene. Methylation caused
the activation of the silent Igf2 allele in wild-type and Dnmt1 knockout cells, leading
to biallelic Igf2 expression. In contrast, the imprinted genes Igf2r, Peg3, Snrpn, and
Grf1 were completely resistant to de novo methylation, even when Dnmt1 was
overexpressed. Therefore, the intrinsic difference between the imprinted region of
Igf2 and H19 and of other imprinted genes to postzygotic de novo methylation may
be the molecular basis for the frequently observed de novo methylation and
upregulation of Igf2 in neoplastic cells and tumors. Injection of Dnmt1overexpressing embryonic stem cells in diploid or tetraploid blastocysts resulted in
lethality of the embryo, which resembled embryonic lethality caused by Dnmt1
deficiency.
Compare term definitions and texts
! Compare
gene texts
with GO term
definitions
(‘dictionary’-based)
Title
Abstract
MeSH
EC
RN
Locating frequent phrases
!
!
Locating known
phrases indicating
gene functions
[symbol] [something]
may be associated with
[process]
Title
Abstract
MeSH
The DNMT1 gene
may be associated with
imprinting
EC
RN
Singular Value Decomposition
Term pr document
matrix
! Decomposition and
reduction
! Conceptual
relations revealed
!
Finding conceptual similar texts
! Mixed
GO/Medline space
– Compare GO definition
texts and gene texts
! Medline
model space
– Measure new gene
texts to pre-made
model
Title
Abstract
MeSH
EC
RN
Mixed GO/Medline space
Mix GO term
definitions and
gene texts from
Medline
! Nearest neighbours
! Clustering
!
GO
Medline
Mixed space with rolled up terms
!
Roll up GO terms
– Larger GO texts
– Fewer GO points
GO
Medline
The training set:
Medline entries with related GO terms
Gene Ontology
Medline entries
Title
Title
Title
Abstract
Abstract
Abstract
MeSH
ECMeSH
ECMeSH
RN
EC
RN
RN
Medline model space
Training set
> model space
! All model points
has a GO term
! Project new
documents into
the model space
! Memory based
reasoning
! Neural Networks
!
New doc
Nearest
neighbour
Model space with binary target
Same model space
! Top to bottom
! Put documents into
1 to n GO nodes
! Process each
target GO term
independently
!
10
6
5
5
2
4
Combining the results
! Voting
scheme
– Each method votes
– Recommend the top 10-15 annotations
! Introduce
a ‘trust weight’
– emphasize suggestions from the best
performing methods
! Neural
Networks
! Expansion of the annotations
Conclusion
Designed and implemented methods for
gene annotation
! Combined text and data information
! Performs better than a widely used public
tool
! Methods to be part of the annotation
process at Medical Research Center
!
!
SAS allowed for quick exploration of ideas:
®
– Easy handling of large data sets
– Pre-made tools (Enterprise Miner™)
General experience
! We
have transformed key terms
(atomic data) and free text into
hierarchical decisions and categories
! Experience useful in
– Organizations with much information
stored as free text
– Situations where atomic data fields and
free text fields are used together
! Error
report forms > error prediction
! Customer feedback forms
Acknowledgments
! Dr.
Torulf Mollestad of SAS® Norway
and Norwegian University of Science
and Technology
! Dr. Astrid Lægreid of Medical
Research Center, Trondheim, Norway
! SAS Norway
®
! SAS
®
Academic Initiative
– Ms. Nina Hanke
?
[email protected]