Download lecture13_06

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cellular differentiation wikipedia , lookup

Protein moonlighting wikipedia , lookup

Histone acetylation and deacetylation wikipedia , lookup

Signal transduction wikipedia , lookup

List of types of proteins wikipedia , lookup

JADE1 wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene regulatory network wikipedia , lookup

Transcript
Gene Expression and Networks
Microarray Analysis
• Unsupervised
-Partion Methods
K-means
SOM (Self Organizing Maps
-Hierarchical Clustering
• Supervised Methods
-Analysis of variance
-Discriminate analysis
-Support Vector Machine (SVM)
2
Clustering
• Grouping genes together according to their
expression profiles.
• Hierarchical clustering: generate a tree
–
–
–
–
Each gene is a leaf on the tree
Distances reflect similarity of expression
Internal nodes represent functional groups
Similar approach to phylogenetic trees
• k-means clustering: generate k groups
– Number k is chosen in advance
– Each group represents similar expression
3
Hierarchical Clustering Example
Five separate clusters are indicated
by colored bars and by identical
coloring of the corresponding
region of the dendrogram. The
sequence-verified named genes in
these clusters contain multiple
genes involved in (A) cholesterol
biosynthesis, (B) the cell cycle,
(C) the immediate-early response,
(D) signaling and angiogenesis,
and (E) wound healing and tissue
remodeling. These clusters also
contain named genes not involved
in these processes and numerous
uncharacterized genes.
4
Expression Correlation
• Similar expression between genes
– One gene controls the other in a pathway
– Both genes are controlled by another
– Both genes required at the same time in cell
cycle
– Both genes have similar function
• Clusters can help identify regulatory motifs
– Search for motifs in upstream promoter regions
of all the genes in a cluster
5
Support Vector Machine(SVM)
• As applied to gene expression data, an SVM
would begin with a set of genes that have a
common function, for example, genes coding for
components of the proteasome (positive set).
In addition, a separate set of genes that are known
not to be members of the functional class
(negative set) is specified.
• Using this training set, an SVM would learn to
discriminate between the members and nonmembers of a given functional class based on
expression data.
• Having learned the expression features of the
class, the SVM could recognize new genes as
members or as non-members of the class based on
their expression data.
6
How do SVM’s work ?
Knowing the label of each example, the SVM tries to separates
all training examples correctly and maximizes the distance
between the points of each class
kernel
?
If this is not possible in the input space it searches for
a hyperplane in a higher dimension space
7
Probe Selection
• Probe on DNA chip is shorter than target
– Choice of which section to hybridize
• Select a region which is unstructured
– RNA folding, DNA stem-and-loop
• Choose region which is target-specific
– Avoid cross-hybridization with other DNA
• Avoid regions containing variation
– Minimize presence of SNP sites
8
Probe Design
Two main factors to optimize
• Sensitivity
– Strength of interaction with target sequence
– Requires knowledge of target only
• Specificity
– Weakness of interaction with other sequences
– Requires knowledge of ‘background’
9
Sensitivity
• Basic measure: best gapless alignment of
entire probe against part of target sequence:
-
-2+6=+4
-7+1=-6
6+2=-4
CTACACGA
CTACACGA
CTACACGA
AGTGCAAGTCCGATATGCCGTAATGCTATCA
CTACACGA
CTACACGA
-6+2=-4
-8
• Better: +3 for C–G, +2 for A–T, etc…
10
Selectivity
E-value
• Can be calculated by Blasting the probe
against the genome studied in the specific
experiment.
11
Sources of Inaccuracy
• Some sequences bind better than others
– Cross-hybridization, A–T versus G–C
• Scanning of microarray images
– Scratches, smears, cell spillage
• Effects of experimental conditions
– Point in cell cycle, temperature, density
12
Gene Expression Databases
and Resources on the Web
• GEO Gene Expression Omnibus
- http://www.ncbi.nlm.nih.gov/geo/
• List of gene expression web resources
– http://industry.ebi.ac.uk/~alan/MicroArray/
• Another list with literature references
– http://www.gene-chips.com/
• Cancer Gene Anatomy Project
– http://cgap.nci.nih.gov/
• Stanford Microarray Database
– http://genome-www.stanford.edu/microarray/
13
Functional Genomics
The task is to define the function of a gene
(or its protein) in the life processes of the
organism, where function refers to the role
it plays in a larger context.
14
GO (gene ontology)
http://www.geneontology.org/
• The GO project is aimed to develop three
structured, controlled vocabularies (ontologies)
that describe gene products in terms of their
associated
• molecular functions (F)
• biological processes (P)
• cellular components (C)
Ontology is a description of the concepts and relationships that can
exist for an agent or a community of agents
15
GO Annotations
RIM11 GO evidence and references
Molecular Function
glycogen synthase kinase 3
activity (ISS)
protein serine/threonine kinase
activity (IDA)
Biological Process
protein amino acid
phosphorylation (IGI, ISS)
proteolysis (IGI)
response to stress (IGI, IMP)
sporulation (sensu Fungi) (IMP)
Cellular Component
cytoplasm (IDA)
Extracted from SGD Saccharomyces Genome Database
16
Cellular Processes
• The cell is a dynamic entity
– Grows, divides, responds to environmental changes
• Cellular processes - composed of molecular interactions
Yeast cell cycle
17
• Different cellular processes can be
represented as graphs
-Genetic networks
-Metabolic pathways
-Regulatory networks
-protein-protein interaction networks
18
Representing Genetic Networks
Entity
Relationship
Entity
Enabler
Gene,
protein,
ligand
Enhances,
represses,
becomes
Energy
source,
catalyst
19
Metabolic pathways
20
Regulatory Network
21
Analysis of transcription regulation networks
Network Motifs
Connected patterns of interactions
that recur in the integrated cellular network
statistically significantly more often than at
random
22
Analysis of transcription regulation networks
A
Feed-forward loop
Single input module
P1
g2
Dense regulons
……….
(Shen-Orr S. et al., 2002)
………..
23
A network of interactions can be built
For all proteins in an organism
P1
P2
DATA TYPE
Gal4
Gal80
Ste12
Dig2
Swi4
Swi6
…….
A large network of 8184 interactions among 4140 S. Cerevisiae
proteins
24
Highthroughput biological data is required for
for generating networks
• Measure direct interactions
– DNA footprinting
– One-hybrid, two-hybrid
experiments
– Accurate but low
throughput
Yeast
Yeast 22
-- hybrid
hybrid
25
Networks generated from
microarray data are less accurate
• Expression levels with
microarrays
– Examine expression
correlations
– Problem: multiple
interpretations
– High throughput but
only suggestive
26
Other Resources
• BioCyc
– http://www.biocyc.org/
• Biomolecular Interaction Network Database
– http://www.bind.ca/
• ‘What is There’ Interaction Database
– http://wit.mcs.anl.gov/WIT2/
• Gene Ontology Consortium
– http://www.geneontology.org/
27