Download Introduction to Microarray Analysis (Section D1)

Document related concepts

Biology and consumer behaviour wikipedia , lookup

X-inactivation wikipedia , lookup

Epitranscriptome wikipedia , lookup

Epigenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Transposable element wikipedia , lookup

Oncogenomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene nomenclature wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene desert wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Point mutation wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genome (book) wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Primary transcript wikipedia , lookup

NEDD9 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genome editing wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Helitron (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Functional Genomics
Introduction
Julie A Dickerson
Electrical and Computer Engineering
Iowa State University
1
Module Structure: Day 1




Introduction to Functional Genomics
Transcriptomics
Analysis and Experiment Design for
Microarray Data (Dr. Peng Liu)
RNA-Seq Data (Mr. Kun Liang)
LAB:

Using R for Normalizing, processing microarray
data, and clustering analysis of ‘omics data
(John Van Hemert)
Module Structure: Day 2




Metabolomics (Dr. Ann Perera)
Proteomics (Dr. Young-Jin Lee)
Pathways and data integration methods (Dr.
Julie Dickerson and Erin Boggess)
Lab:

Analyzing integrated sets of microarray, proteomics
and metabolomics data (Erin Boggess)
BBSI - 2010
June 15, 2010
3
F1: Outline




Module Structure
What is Functional Genomics?
Data Types Available
Transcriptomics




Basic biology behind microarrays
What can you learn from microarrays?
Types of arrays
Limitations of microarrays
4
Functional Genomics Definition


Functional genomics is a field of molecular
biology that attempts to make use of the data
produced by genomic projects to describe
gene (and protein) functions and interactions.
Functional genomics focuses on the dynamic
aspects such as gene transcription,
translation, and protein-protein interactions,
as opposed to the static aspects such as
DNA sequence or structures.
From Wikipedia, the free encyclopedia
5
Genome Wide View of Metabolism
Streptococcus
pneumoniae


Explore capabilities of global network
How do we go from a pretty picture to a
model we can manipulate?
Metabolic Pathways
hexokinase
phosphoglucoisomerase
Metabolites
glucose
phosphofructokinase
aldolase
Enzymes
phosphofructokinase
triosephosphate isomerase
G3P dehydrogenase
phosphoglycerate kinase
phosphoglycerate mutase
Reactions & Stoichiometry
1 F6P => 1 FBP
Kinetics
enolase
pyruvate kinase
Regulation
gene regulation
metabolite regulation
Metabolic Modeling: The Dream
Data Types Available for Determining
Function





Genomes
Genes
Proteins
Metabolites
Phenotypes





June 11, 2009
BBSI - 2009
Sequence
Microarrays,
Nextgen sequencing
Proteomics
Metabolomics
Phenomics
9
A VERY Simplified Eukaryotic Cell
chromosome
nucleus
DNA strands
cytoplasm
DNA contains thousands of genes.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
10
Posttranscriptional Modifications
to Primary Transcript
Primary transcript
3’ UTR
5’ UTR
Intervening sequences corresponding to introns
that are removed through splicing
Primary transcript after modification: messenger RNA (mRNA)
G
5’ cap
5’ UTR
3’ UTR
Coding portions of RNA sequence
corresponding to exons
AAAAAA...AAAA
poly-A tail
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
11
Transcription takes place inside the nucleus.
chromosome
nucleus
DNA strands
cytoplasm
Translation takes place outside the nucleus.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
12
Translation
Ribosome
mRNA
amino acid sequence
folds to become a protein
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
13
During translation transfer RNA (tRNA)
translates the genetic code
codon
...
codon
U
U
A
A
C
G
A
A
U
U
G
C
...
G
tRNA
anticodon
leu
thr
amino acids
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
14
The Genetic Code
First Base
U
mRNA
codon
Second Base
C
amino
acid
G
A
U
UUU
UUC
UUA
UUG
phe
phe
leu
leu
UCU
UCC
UCA
UCG
ser
ser
ser
ser
UAU
UAC
UAA
UAG
tyr
tyr
STOP
STOP
UGU
UGC
UGA
UGG
cys
cys
STOP
trp
C
CUU
CUC
CUA
CUG
leu
leu
leu
leu
CCU
CCC
CCA
CCG
pro
pro
pro
pro
CAU
CAC
CAA
CAG
his
his
gln
gln
CGU
CGC
CGA
CGG
arg
arg
arg
arg
A
AUU
AUC
AUA
AUG
ile
ile
ile
met
ACU
ACC
ACA
ACG
thr
thr
thr
thr
AAU
AAC
AAA
AAG
asn
asn
lys
lys
AGU
AGC
AGA
AGG
ser
ser
arg
arg
G
GUU
GUC
GUA
GUG
val
val
val
val
GCU
GCC
GCA
GCG
ala
ala
ala
ala
GAU
GAC
GAA
GAG
asp
asp
glu
glu
GGU
GGC
GGA
GGG
gly
gly
gly
gly
15
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
Miscellaneous Comments

The biology is more complicated than I described.

Humans have somewhere around 30,000 genes.
(The exact number is a subject for debate.)
Regulation of these genes seems to be more
important than number!

Much of the variation is created by differences in
how cells use the genes they have.

Microarrays are a tool that can help us
understand how cells of various types use their
genes in response to varying conditions.
16
Microarrays



5/23/2017
With only a few exceptions, every
cell of the body contains a full set of
chromosomes and identical genes.
Only a fraction of these genes are
turned on, however, and it is the
subset that is "expressed" that
confers unique properties to each
cell type.
"Gene expression" is the term
used to describe the transcription of
the information contained within the
DNA, the repository of genetic
information, into messenger RNA
(mRNA) molecules that are then
translated into the proteins that
perform most of the critical
functions of cells.
BCB570 Gene Expression Data Analysis
17
Microarrays


Microarrays work by exploiting the ability of a given mRNA
molecule (target) to bind specifically to, or hybridize to,
the DNA template (probe) from which it originated.
This mechanism acts as both an "on/off" switch to control
which genes are expressed in a cell as well as a "volume
control" that increases or decreases the level of
expression of particular genes as necessary.
Source: The Genetic Science Learning Center, University of Utah
5/23/2017
BCB570 Gene Expression Data Analysis
18
DNA Microarrays



Small, solid supports onto which the sequences
from thousands of different genes are immobilized,
or attached, at fixed locations.
The DNA is printed, spotted, or actually synthesized
directly onto the support.
The spots themselves can be DNA,
complementary DNA (cDNA, DNA synthesized
from a mRNA template) , or oligonucleotides. (or
oligo, a short fragment of a single-stranded DNA
that is typically 5 to 50 nucleotides long)
5/23/2017
BCB570 Gene Expression Data Analysis
19
Why do microarray experiments?

Comparing two conditions to find differentially
expressed genes



Compare more than two conditions; some of
which may interact


Control/treatment
Disease/normal
Different treatments, different strains
Exploratory analysis

5/23/2017
What genes are expressed under drought stress?
BCB570 Gene Expression Data Analysis
20
Why use microarrays (cont)?

What happens over time?



Developmental stages
Predicting certain conditions (cancer vs.
normal)
Patterns of gene expression that characterize
a patient’s or organism’s response
5/23/2017
BCB570 Gene Expression Data Analysis
21
Differentially Expressed Genes


Find genes that show a large difference in expression
between groups and are similar within a group
Statistical tests (t-test), look at if the groups have different
means or variances (chi-squared, F-statistics)
Adapted from “Practical Microarray Analysis”,
Presentation by Benedikt Brors, German
Cancer
Research Center
5/23/2017
BCB570 Gene Expression Data Analysis
22
Multiple Conditions
Mutant 1
Inoculated


Mutant 2
Control
Inoculated
Control
Are there differences in expression level between
the k conditions?
Analysis of Variance (ANOVA)
5/23/2017
BCB570 Gene Expression Data Analysis
23
Some Example Microarray Experiments
from Iowa State University
Jim Reecy from Animal Science: muscle undergoing
hypertrophy vs. normal muscle
David Putthoff, Steve Rodermel, Thomas Baum from
Plant Pathology: roots infected with soybean cyst
nematodes vs. uninfected roots
Anne Bronikowski in Genetics: wheel-running mice vs.
non-runners
Roger Wise, Rico Caldo in Plant Pathology: interaction
between multiple isolates of powdery mildew and
multiple genotypes of barley.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
24
Wild-type vs. Myostatin Knockout Mice
Belgian Blue
cattle have a
mutation in the
myostatin gene.
Identifying Genes Involved in Pathways That Distinguish
Compatible from Incompatible Interactions
Barley Genotype
Mla6
Mla13
Mla1
Incompatible
Compatible
Incompatible
Compatible
Incompatible
Incompatible
Bgh Isolate
5874
K1
Caldo, Nettleton, Wise (2004). The Plant Cell. 16, 2514-2528.
26
An Example Gene of Interest
Log Expression
Incompatible
Compatible
Hours after Inoculation
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
27
Exploratory Analysis



Find patterns in data to see what genes are
expressed under different conditions
Analysis includes clustering methods
Used when little or no prior knowledge exists about
the problem
5/23/2017
BCB570 Gene Expression Data Analysis
28
Fig. 5 (see Supplemental data at http://www.pnas.orgwww.pnas.org) for the full cluster diagram with all
gene names\]
Perou, Charles M. et al. (1999) Proc. Natl. Acad. Sci. USA 96, 9212-9217
5/23/2017
Copyright ©1999 by the National Academy of Sciences
BCB570 Gene Expression Data Analysis
29
Time Series
0 hours




5/23/2017
4 hours
12 hours
24 hours
Goal: find patterns of co-expressed genes over time or partial
time
Typical length is 3-10 time points
Cluster to find similar patterns (k-means, self-organizing
maps)
Correlations to find genes that behave like a given gene of
interest.
BCB570 Gene Expression Data Analysis
30
Classification



Learn characteristic patterns from a training
set and evaluate with a test set.
Classify tumor types based on expression
patterns
Predict disease susceptibility, stages, etc.
5/23/2017
BCB570 Gene Expression Data Analysis
31
Source:
“Practical Microarray Analysis”,
Presentation
byData Analysis
5/23/2017
BCB570
Gene Expression
Benedikt Brors, German Cancer Research Center
32
Some Commonly Used Tools
for Microarray Analysis

Oligonucleotide arrays

Affymetrix GeneChips

Nimblegen

Agilent
33
Oligonucleotides

An oligonucleotide is a short sequence of nucleotides.
(oligonucleotide=oligo for short)

An oligonucleotide microarray is a microarray whose
probes consist of synthetically created DNA
oligonucleotides.

Probes sequences are chosen to have good and
relatively uniform hybridization characteristics.

A probe is chosen to match a portion of its target mRNA
transcript that is unique to that sequence.

Oligo probes can distinguish among multiple mRNA
transcripts with similar sequences.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
34
Simplified Example
...
gene 1
...
oligo probe
for gene 1
ATTACTAAGCATAGATTGCCGTATA
...gene 2
...
shared green regions indicate
high degree of sequence similarity
throughout much of the transcript
5/23/2017
GCGTATGGCATGCCCGGTAAACTGG
BCB570 Gene Expression Data Analysis
Source: Dan Nettleton Course Notes Statistics 416/516X
oligo probe for gene 2
35
Oligo Microarray Fabrication

Oligos can be synthesized and stored in
solution.

Oligo sequences can be synthesized on a slide
or chip using various commercial technologies.

The company Affymetrix uses a
photolithographic approach which we will
describe briefly.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
36
Affymetrix GeneChips

Affymetrix (www.affymetrix.com) manufactures
GeneChips.

GeneChips are oligonucleotide arrays.

Each gene (more accurately sequence of interest or
feature) is represented by multiple short (25-nucleotide)
oligo probes.

Some GeneChips include probes for around 120,000
genes and gene variants.

mRNA that has been extracted from a biological sample
can be labeled (dyed) and hybridized to a GeneChip.

Only one sample is hybridized to each GeneChip.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
37
Different Probe Pairs Represent Different Parts of the
Same Gene
gene sequence
Probes are selected to be specific to the target gene
and have good hybridization characteristics.
5/23/2017
BCB570 Gene Expression Data Analysis
Source: Dan Nettleton Course Notes Statistics 416/516X
38
Affymetrix Probe Sets

A probe set is used to measure mRNA levels of a single
gene.

Each probe set consists of multiple probe cells.

Each probe cell contains millions of copies of one oligo.

Each oligo is intended to be 25 nucleotides in length.

Probe cells in a probe set are arranged in probe pairs.

Each probe pair contains a perfect match (PM) probe
cell and a mismatch (MM) probe cell.

A PM oligo perfectly matches part of a gene sequence.

A MM oligo is identical to a PM oligo except that the
middle nucleotide (13th of 25) is replaced by its
complementary nucleotide.
39
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
A Probe Set for Measuring Expression Level of a
Particular Gene
gene sequence
...TGCAATGGGTCAGAAGGACTCCTATGTGCCT...
perfect match sequence
AATGGGTCAGAAGGACTCCTATGTG
mismatch sequence
AATGGGTCAGAACGACTCCTATGTG
probe
pair
probe
cell
probe set
40
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
Different Probe Pairs Represent Different Parts of
the Same Gene
gene sequence
Probes are selected to be specific to the target gene
and have good hybridization characterictics.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
41
Affymetrix’s Photolithographic Approach
mask
mask
mask
mask
mask
mask
mask
mask
A
T
G
A
C
T
T
C
T
T
C
A
C
A
A
G
GeneChip
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
42
43
Source: www.affymetrix.com
Source: www.affymetrix.com
44
Source: www.affymetrix.com
45
Image from Hybridized GeneChip
Source: www.affymetrix.com
46
Image Processing for Affymetrix GeneChips




Image processing for Affymetrix GeneChips is
typically done using proprietary Affymetrix
software.
The entire surface of a GeneChip is covered
with square-shaped cells containing probes.
Probes are synthesized on the chip in precise
locations.
Thus spot finding and image segmentation are
not major issues.
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
47
Probe Cell
8 x 8 =64
pixels
border
pixels
excluded
75th percentile
of the 36 pixel
intensities
corresponding
to the center 36
pixels is used
to quantify
fluorescence
intensity for
each probe cell.
These values are
called PM values
for perfect-match
probe cells and
MM values for
mismatch probe
cells.
The PM and MM values are used to compute
expression measures for each probe set.
48
Dan Nettleton, Department of Statistics, IOWA STATE UNIVERSITY,Copyright © 2008 Dan Nettleton
Normalization



Outputs from each individual probe pair are
statistically combined to give an expression
level for the gene represented by the probe
set.
Normalization accounts for background noise
on the chip, levels of control probes, etc
Key methods are MAS5.0, RMA, GCRMA
Summary of Microarrays


Positives: commercial chips are accurate
and repeatable in experienced hands and the
statistics and modeling have been wellexplored
Negatives: cost, can only see what is on the
chip and difficult to update to new knowledge.
June 11, 2007
BBSI - 2007
50
Short Read Sequencing



Sequencing technology has evolved in the
last 15 years
Eventual goal is to be able to sequence a
genome for $1000 (NIH).
Why not just sequence the transcriptome
directly and see what is there?
June 11, 2007
BBSI - 2007
51
Sequencing by synthesis (454)



Takes a single strand of DNA and
synthesizes its complementary strand
enzymatically one base pair at a
timedetecting which base was actually added
at each step.
Pyrosequencing detect the activity of DNA
polymerase with a chemiluminescent
enzyme.
Reads are about 400-500 bp
June 11, 2007
BBSI - 2007
52
Other Techologies


Illumina Solexa: 40-100 bp, tag DNA or RNA
at both ends
ABI SOLID around 50 bp
Digital Gene Expression
Sequence census methods for functional genomics
Barbara Wold & Richard M Myers