Download Next generation sequencing

Document related concepts

DNA damage theory of aging wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Transposable element wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Oncogenomics wikipedia , lookup

DNA polymerase wikipedia , lookup

Cancer epigenetics wikipedia , lookup

DNA vaccination wikipedia , lookup

Genealogical DNA test wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Point mutation wikipedia , lookup

DNA supercoil wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Human genome wikipedia , lookup

Molecular cloning wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Replisome wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Primary transcript wikipedia , lookup

Pathogenomics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Gene wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Non-coding DNA wikipedia , lookup

Designer baby wikipedia , lookup

Epigenomics wikipedia , lookup

DNA sequencing wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

History of genetic engineering wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Microevolution wikipedia , lookup

Genome editing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Microsatellite wikipedia , lookup

Genomic library wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Deoxyribozyme wikipedia , lookup

SNP genotyping wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Metagenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Next generation sequencing
www.gatc.co.uk/cgi-bin/wPrintpreview.cgi?sour...
Depiction of a new type of highly
miniaturized microarray that incorporates
randomness in its design.
Each array contains ~50,000 beads carrying
oligonucleotide probes. The beads are lodged
in wells on the surface of a hexagonally
packed optical fiber bundle.
The location and identity of the randomly
arrayed beads are determined using a
hybridization-based decoding process
Among other applications, the decoded arrays
have been used to develop a microarray
based gene expression profiling assay that
makes use of PCR, and to carry out
genotyping from small amounts of human
genomic DNA using whole-genome
amplification. (Artwork provided by Andrew
Roberts at Studio 209, PortlandOR.
Next-generation of DNA and RNA sequencing methods
- The bead-amplification sequencing (Roche/454FLX)
-Sequencing by synthesis (Illumina/Solexa Genome analyzer)
-Sequencing by ligation (Applied Biosystems SOLID System)
-Helicos Helioscope (2008)
-Pacific Biosciences SMRT (2010)
Common features:
-A compex interplay of enzymology, chemistry, software, hardware, optics engineering…)
-A streamline of sample preparation prior to sequencing (time saving)
-Preparation of fragment libraries of the DNA of interest by annealing for platform-specific linkers
and amplification
-Amplification of single stranded fragment library and performing sequencing on amplified
fragments
-Single molecule sequencing just arrived or is under development
Comparison between capillary sequencers and novel generation sequencers
-A huge difference in the throughput of a single run: 96 capillaries of about 750 bp compared
to several thousand (Roche) to tens of millions (Illumina, ABi) shorter reads.
-Time runs are longer that by next generation sequencers (8h – 10days)
- Longer run times are required to image the massive parallel sequencing reactions
-Due to the streamline preparation and long reading times a single human operator can
operate several machines of next generation at the full capacity.
ABI
Solexa
Nova (naslednja) generacija sekvenciranja
- Sekvenciranje človeškega genoma je trajalo več let, z uporabo cca 20 kb BAC klonov, ki so
vsebovali cca 100 kb dolge tarčne fragmente, in 8-kratnega pokrivanja vsakega dela tarče. Analiza s
kapilarno elektroforezo.
-Nadaljnji razvoj sekvenciranja je temeljil na sočasnem sekvenciranju celotnega genoma (WHS, angl.
whole genome sequencing), ki je bil vstavljen v vektorje. Metoda je hitrejša, pušča pa velike praznine
v zelo polimorfnih ali repetitivnih genomih. Analiza s kapilarno elektroforezo.
-Naslednja generacija sekvenciranja (2004) – visokozmogljivostno paralelno čitanje odsekov DNA na
ravni celega genoma preko PCR pomnoževanja enoverižnih fragmentov genomske knjižnice.
Classical Sanger dideoxy sequencing method
Dideoxynucleotide sequencing represents only one method of
sequencing DNA. It is commonly called Sanger sequencing since
Sanger devised the method. This technique utilizes 2',3'dideoxynucleotide triphospates (ddNTPs), molecules that differ
from deoxynucleotides by the having a hydrogen atom attached
to the 3' carbon rather than an OH group. These molecules
terminate DNA chain elongation because they cannot form a
phosphodiester bond with the next deoxynucleotide.
campus.queens.edu/faculty/jannr/molecular/
General concepts for clonal-array generation and sequencing
a | Bead-chips. Genomic DNA is fragmented and adaptors are ligated to create an insert library that is
flanked by two universal priming sites. Because of the random fragmentation, the complexity of this
signature sequence library is equivalent to the genome. This library is cloned on beads using emulsion
PCR technology. A water-in-oil emulsion is created from a PCR mix that contains a limiting dilution of
DNA and beads. The emulsion creates micro-compartments with, on average, a single bead and single
DNA template each. After PCR, beads with clones are affinity selected and assembled onto a planar
substrate. A subsequent cycle-sequencing reaction is used to read out the sequence on the clones.
b | Sequencing by synthesis (SBS). A common anchor primer is annealed to a constant sequence
(universal priming site) that is contained within the library clones that are located on the polony (clonal
bead) array (the orientation of the immobilized target might vary depending on the platform that is used).
The sequence is read out by polymerase extension in a base-by-base fashion using either reversible
terminators or sequential nucleotide addition (pyrosequencing). After incorporation of a single base or
base type, the incorporated base is identified by fluorescence (laser) or chemiluminescence (no laser
required).
c | Sequencing by ligation. The polony array set-up is similar to SBS in which a common primer is
annealed to an arrayed polony library and used to read out the sequence through a stepwise ligation of
random oligomers. The labelled oligomers are designed to have random bases inserted at every site
except the query site. The query site has one of four base substitutions, each matched to a particular
fluorescent label on the oligonucleotide. After read-out of each ligation event, the primer and the ligated
oligomer are stripped, a new primer reannealed and the process repeated with an oligomer that contains
a query base at a different position.
Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901
General concepts for clonal-array generation and sequencing
Bead chips
Sequencing by
synthesis
Sequencing by
ligation
Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901
Sequencing with novel generation bead-chips
FIGURE 1. A single well within a picotiter plate includes DNA copies bound to the bead as sequencing reagents are
added. (Courtesy of Roche)
The Roche system uses native, unmodified DNA bases in its process. In the DNA preparation step, the DNA sample is
sheared into small fragments that are then attached to 26 µm beads, one fragment to one bead. Then, in a process called
emulsion PCR, the DNA is amplified so that each bead carries 100,000 copies of the original DNA fragment. The DNAcoated beads are then loaded into the wells of a 1.6 million-well picotiter plate so that, on average, there is one bead per
well. The wells of the picotiter plate are made of fiber-optic material so that they can transmit (via a coupled CCD imager) the
light signals that are used to indicate the DNA sequence. The sequencing reagents and one of the four DNA bases are
added to start the pyrosequencing process. The camera records all the wells in which that base was added, and the intensity
of the signal is used to infer the number of times that base was added to the growing strand. Then that base is washed away
and the second base is added, and so on, until the sequence of the fragment is established.
Principles of Pyrosequencing
Pyrosequencing I
Step 1
A sequencing primer is hybridized to a
single-stranded PCR amplicon that
serves as a template.
Mixtures incubated with the enzymes,
DNA polymerase, ATP sulfurylase,
luciferase, and apyrase as well as the
substrates, adenosine 5'
phosphosulfate (APS), and luciferin.
Step 2
The first deoxribonucleotide triphosphate (dNTP)
is added to the reaction. DNA polymerase
catalyzes the incorporation of the deoxyribonucleotide triphosphate into the DNA strand, if it
is complementary to the base in the template
strand.
Each incorporation event is accompanied by
release of pyrophosphate (PPi) in a quantity
equimolar to the amount of incorporated
nucleotide.
www1.qiagen.com/Products/PyroMarkQ96ID.aspx
Pyrosequencing II
Step 3
ATP sulfurylase converts PPi to ATP in the
presence of adenosine 5' phosphosulfate
(APS).
ATP drives the luciferase-mediated
conversion of luciferin to oxyluciferin
that generates visible light in amounts that
are proportional to the amount of ATP.
The light produced in the luciferasecatalyzed reaction is detected by a charge
coupled device (CCD) chip and seen as a
peak in the raw data output (Pyrogram).
The height of each peak (light signal) is
proportional to the number of nucleotides
incorporated
Step 4
Apyrase, a nucleotide-degrading enzyme,
continuously degrades unincorporated
nucleotides and ATP. When degradation is
complete, another nucleotide is added
www1.qiagen.com/Products/PyroMarkQ96ID.aspx
Pyrosequencing III
Step 5
Addition of dNTPs is performed sequentially. It should be noted that deoxyadenosine alfa-thio
triphosphate (dATP·S) is used as a substitute for the natural deoxyadenosine triphosphate (dATP)
since it is efficiently used by the DNA polymerase, but not recognized by the luciferase. As the
process continues, the complementary DNA strand is built up and the nucleotide sequence is
determined from the signal peaks in the Pyrogram trace
Roche/454 FLX Pyrosequencer
Library fragments are mixed with agarose
beads with oligos complementary to
adapter sequences on the library.
Each bead is associated with a single
fragment.
Each fragment-bead complex is isolated
into individual oil:water micelles with PCR
mixture.
Thermal cycling of this emulsion PCR of
the micelles produces amplified unique
sequences on the bead surface.
“En mass” sequencing of PCR products
on picotiter plates (PTP) with single beads
in each picowell.
Enzyme/substrate containing beads for
the pyrosequencing reaction are added to
wells that act as floww cells for addition of
individual pure nucleotide solutions. The
CCD camera records the light emitted at
each bead.
Mardis E.R. Annual Review of Genomics and Human Genetics 9: 387-403 (2008).
Watson and Cricks pyrosequencing readout
Timeline of the pyrosequencing development
October 2005 Release of the Genome Sequencer 20, the first next-generation sequencing
system on the market
October 2005 Collaboration agreement signed with Roche Diagnostics
.
January 2007 Release of the Genome Sequencer FLX System
March 2007 Roche Diagnostics completes integration with 454 Life Sciences
May 2007 Complete sequence of Jim Watson published in Nature. First genome to be
sequenced for less than $1 million.
November 2007 Announcement of the 100th peer-reviewed publication enabled by 454
Sequencing
June 2008 454 Joins the 1000 Genome Project, an international effort to build the most detailed
map to date of human genetic variation as a tool for medical research
September 2008 Announcement of the 250th peer-reviewed publication enabled by 454
Sequencing
October 2008 Release of Genome Sequencer FLX Titanium Series reagents, featuring 1
million reads at 400 base pairs in length
illumina sequencing technology
is based on arrays of randomly assembled glass (silica) beads;
the beads have oligonucleotides covalently attached to the surface;
each bead has about one million oligos on its surface;
all oligos on each bead have the same sequence
Attached DNA fragments are extended and bridge amplified to create an ultra-high density
sequencing flow cell with 80-100 million clusters, each containing ~1,000 copies of the same template.
These templates are sequenced using a robust four-color DNA sequencing-by-synthesis technology
that employs reversible terminators with removable fluorescent dyes. This novel approach ensures
high accuracy and true base-by-base sequencing, eliminating sequence-context specific errors and
enabling sequencing through homopolymers and repetitive sequences.
the beads are randomly assembled on the arrays, and the location of a particular probe is initially
unknown;a process called decoding is used to find the location of each bead;
Illumina beads and scanner
Bridge amplification – sequencing by synthesis
.
An isolated pair of 5' immobilized primers
(positive and negative) and a specific
target DNA strand. The solution above
the array with amplification buffer, target
DNA, polymerase, and labeled dNTPs
Bridge amplification is a technology that uses primers
bound to a solid phase for the extension and
amplification of solution phase target nucleic acid
sequences.
The name refers to the fact that during the annealing
step, the extension product from one bound primer
forms a bridge to the other bound primer.
All amplified products are covalently bound to the
surface, and can be detected and quantified without
electrophoresis.
An array of 100 pixels on a flat surface (bead, chip or any other suitable solid phase format). Each pixel
contains primer pairs (negative and positive) and is specific for one target DNA sequence.
All amplified DNA remains covalently attached to a specific pixel on the array. Detection of incorporated label
in a pixel indicates the presence of a specific target DNA sequence in the sample
www.promega.com/
Illumina sequencing by synthesis
Illumina sequencing by synthesis
Decoding process
After the arrays are assembled, the (example 16) beads in each position are identified by decoding.
The array is hybridized to 16 'decoder oligonucleotides‘ each one of which is a match for one of the
oligonucleotide sequences (bead type) on the beads in the array;
The decoder oligos are labelled with 4 different fluorescent dyes;
the array is then imaged and stripped;
Illumina2008ProductGuide.pdf
Illumina experimental protocol
The total mature RNA is isolated from the cell/tissue being studied. This RNA has already been “processed”
(removal of the noncoding introns and splicing together of the coding exon) as well as the addition of a
poly-A tail
The RNA is turned into a double stranded DNA copy known as a cDNA. This is done through reverse
transcription. This is done because RNA itself is not a very stable molecule and the cDNA is a way to
store the RNA for a much longer period of time
When it comes time to run the array, the cDNA is allowed to go through in vitro transcription back to RNA
(now known as cRNA), but this RNA is labeled with Biotin. This is done by having uracil bases tagged
with the Biotin.
The Biotin-labeled cRNA is then added to the array
Anywhere on the array where a RNA fragment and an oligonucleotide on a bead are complimentary, the RNA
sticks to the probe on the bead
The array is then washed to remove any RNA that is not stuck to an array (i.e., no match was made) and then
stained with the fluorescent molecule that sticks to Biotin
Lastly, the entire array is scanned with a laser and the information is kept in a computer for quantitative
analysis of what genes were expressed and at what approximate level.
Ligation mediated sequencing
Mardis E.R. Annual Review of Genomics and Human Genetics 9: 387-403 (2008).
Structure of detector oligonucleotides
First two nucleotides determine the colour of the
fluorophore. Colour table show the relationship
between dinucleotides and fluorophores.
Four different dinucleotides (256 different
oligonucleotides) correspond to each fluorophore.
If first or second nucleotide (in dinucleotide) is known,
colour is unambiguously related with the other
nucleotide.
Three next positions — degenerate nucleotides: 64
different versions for each particular dinucleotide.
When ligated to the sequencing primer, only one from
these 64 versions would fit to the position.
Detector oligonucleotides (DO) are 8-mers
fluorescently labeled on 3' end. DO's can't be too
short, otherwise T4 ligase would not recognize
them as a substrate.
Altogether, there are 1024 different detection
oligos: (dinucleotide + 3 degenerate)4=54.
Three last positions: universal bases, they are the
same for all detector oligonucleotides.
Dark oligonucleotides have the same internal
structure, but have no fluorophores.
seq.molbiol.ru/sch_seq_ligase.html
Sequencing: ligation step
Three main operations during lgation-based sequencing
are:
ligation of detector oligonucleotides: only one from
1024 possible types of oligonucleotides is suitable for
ligation. Both "XY-dinucleotide" and degenerate part
should be complementary to the template for the
succesfull ligation.
scanning: unincorporated oligonucleotides washed out,
bead fluorescence registered in four spectral
intervals.
digestion of ligated DO
remover fluorophore,
expose phosphate on 5'-end,
shift sequencing primer to a new position.
Ligation accuracy
Two factors provide specificity of ligation:
hybridization stability: 8-mer oligonucleotide should be
very sensitive to any mismatches;
T4 ligase accuracy: enzyme is particularly sensitive to
mismatches on 3'-side of the gap (sensitivity drop
down fast with increasing of a distance to the gap).
seq.molbiol.ru/sch_seq_ligase.html
Sequencing: example of 35-base sequencing
Five primers & seven ligations for each primer: 35
reactions altogether.
Each ligation reaction provides information about
colour of particular dinucleotide.
According to colour code table (bottom-right) four
different dinucleotides may correspond to the same
colour.
To resolve this ambiguity, one ligation reaction
analyses dinucleotide with one known nucleotide in
the first ligation with primer "B", dinucleotide overlaps
with known adaptor sequence.
Starting from the first known nucleotide it is possible to
determine the whole sequence.
The principles of 2-base encoding/decoding
Mardis E.R. Annual Review of Genomics and Human Genetics 9: 387-403 (2008).
Applications of next generation sequencing
Creating highly parallel genotyping assays
Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901
Creating highly parallel genotyping assays
a | Molecular-inversion probe (MIP) genotyping uses circularizable probes with 5' and 3' ends that anneal upstream and
downstream of the SNP site leaving a 1 bp gap (genomic DNA is shown in blue). Polymerase extension with dNTPs and a
non-strand-displacing polymerase is used to fill in the gap. Ligation seals the nick, and exonuclease I (which has 3'
exonuclease activity) is used to remove excess unannealed and unligated circular probes. Finally, the circularized probe is
release through restriction digestion at a consensus sequence, and the resultant product is PCR-amplified using common
primers to 'built-in' sites on the circular probe. The orientation of the primers ensures that only circularized probes will be
amplified. The resultant product is hybridized and read out on an array of universal-capture probes.
b | GoldenGate genotyping uses extension ligation between annealed locus-specific oligos (LSOs) and allele-specific oligos
(ASOs). An allele-specific primer-extension (ASPE) step is used to preferentially extend the correctly matched ASO (at the 3'
end) up to the 5' end of the LSO primer. Ligation then closes the nick. A subsequent PCR amplification step is used to
amplify the appropriate product using common primers to 'built-in' universal PCR sites in the ASO and LSO sequences. As in
MIP, the resultant products are hybridized and read out on an array of universal-capture probes (complementary to
IllumiCodes).
c | Reduced-complexity PCR representation using restriction enzyme (RE) digestion of genomic DNA (gDNA), common
primer adaptor ligation and single-primer PCR. The single-primer PCR reaction effectively selects for restriction digestion
products of 200–2,000 nucleotides. The reduced-complexity representation is read out on an array of locus-specific probes.
The decrease in complexity improves the signal-to-noise ratio by increasing the partial concentration of any given locus and
decreasing cross-hybridization.
d | Whole-genome genotyping on bead arrays. gDNA is whole-genome amplified (WGA), fragmented, denatured and
hybridized to an array of locus-specific capture probes (shown is an allele-specific primer extension assay using two bead
types, A and B, per locus). SNPs are scored directly on the array surface by primer extension. The separation of the capture
step from the SNP-scoring step allows efficient target capture and facilitates good discrimination between alleles. After
extension, the array is stained and read out using standard immunohistochemical detection methods.
SNP genotyping illumina
The third approach to SNP genotyping
is that primarily of a high through put
low multiplicity assay (1536 plex) more
usually used for follow-on or focused
custom genotyping. This approach
adopts Illuminas’ GoldenGate® assay
(figure 4), a modified form of which is
used for the DASL assay we has seen
already. The assay, unlike DASL is an
allele specific PCR based amplification
and ligation assay that relies upon a
pool of locus (LSO) and allele specific
(ASO) primers each containing one of
three universal tag sequences (P1, P2
& P3) and an array specific address
sequence (which forms the duplex with
the oligo attached to the bead
specifying the location within the array).
It is the combination of these that allow
extension; ligation and PCR based
amplification conferring detection of the
specific alleles. Because of this design
GoldenGate® genotyping is a twocolour system.
CNV genotyping illumina
Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901
SNP/CNV genotyping
.
For genome wide SNP/CNV interrogation12 Illumina
Serial analysis of gene expression (SAGE)
Fan et al. Nature Reviews Genetics 7, 632–644 (August 2006) | doi:10.1038/nrg1901
Povzetek
• Nova generacija visokozmogljivostnega sekvenciranje omogoča razpoznavanje zaporedij
DNA na ravni celega genoma, z resolucijo posameznega baznega para.
• Iz vsakega vzorca se pripravi z adaptorji ligirana knjižnica, ki vsebuje vse v vzorcu
prisotne fragmente DNA ali RNA (cDNA).
• Vse platforme bazirajo na ligaciji adaptorjev in pomnoževanju, imajo pa različne pristope
sekvenciranja:
-Pirosekvenciranje (Roche-Nimblegen)
-Sekvenciranje s sintezo (Illumina-Solexa)
-Sekvenciranje z ligacijo (ABI)
•Razvijajo se tudi metode, ki pred sekvenciranjem ne potrebujejo pomnoževanja.
• Aplikacije so enake kot pri klasičnih mikromrežah (ekspresijsko profiliranje oz. SAGE,
genotipizacija SNP in CNV, kroamtinska imunoprecipitacija, metilacija kromatina, itd.).
•Prednost pred klasičnimi mikromrežami je v preprosti pripravi vzorca in zmožnosti
procesiranja velikega števila vzorcev v kratkem času.
•Procesiranje velikega števila vzorcev na eni ali več aparaturah lahko upravlja en človek.
Načrtovanje bioloških poskusov in standardizacija
- Pomen bioinformatike pri načrtovanju in sledenju poskusov
- Standardi za izvedbo bioloških poskusov z mikromrežami
- Normalizacija
• Biološke replike
• Tehnične replike
• Načrtovanje bioloških poskusov na konkretnih primerih
Bionformatics
www.gwumc.edu
http://bioinformatics.ubc.ca/about/what_is_bioinformatics/images/computer.gif
Bioinformatics
Wikipedia
Making sense of the huge amounts of DNA data produced by gene sequencing
projects.
Bioinformatics and computational biology involve the use of techniques from
applied mathematics, informatics, statistics, and computer science to solve biological
problems.
Research in computational biology often overlaps with systems biology.
Major research efforts in the field include sequence alignment, gene finding, genome
assembly, protein structure alignment, protein structure prediction, prediction of gene
expression and protein-protein interactions, and the modeling.
The terms bioinformatics and computational biology are often used interchangeably,
although the former typically focuses on algorithm development and specific
computational methods, while the latter focuses more on hypothesis testing and
discovery in the biological domain.
Bioinformatics
More hypothesis-driven research in computational biology.
More technique-driven research in bioinformatics.
A common thread in projects in bioinformatics and computational biology is the
use of mathematical tools to extract useful information from noisy data
produced by high-throughput biological techniques.
A representative problem in bioinformatics is the assembly of high-quality DNA
sequences from fragmentary "shotgun" DNA sequencing.
In computational biology, a representative problem might be statistical testing of
a hypothesis of common gene regulation using data from mRNA microarrays or
mass spectrometry.
Microarrays and bioinformatics
Standardization
The lack of standardization in arrays presents an interoperability problem in
bioinformatics, which hinders the exchange of array data.
Various projects are attempting to facilitate the exchange and analysis of data
produced with non-proprietary chips.
The "Minimum Information About a Microarray Experiment" (MIAME) XML
based standard for describing a microarray experiment is being adopted by
many journals as a requirement for the submission of papers incorporating
microarray results.
http://www.mged.org/Workgroups/MIAME/miame.html
MIAME – I.
Experiment Design:
The goal of the experiment – one line maximum (e.g., the title from the related publication)
A brief description of the experiment (e.g., the abstract from the related publication)
Keywords, for example, time course, cell type comparison, array CGH (the use of MGED
ontology terms is recommended).
Experimental factors - the parameters or conditions tested, such as time, dose, or genetic
variation (the use of MGED ontology terms is recommended).
Experimental design - relationships between samples, treatments, extracts, labeling, and
arrays (e.g., a diagram or table).
Quality control steps taken (e.g., replicates or dye swaps).
Links to the publication, any supplemental websites or database accession numbers.
MIAME – II.
Samples used, extract preparation and labeling:
The origin of each biological sample (e.g., name of the organism, the
provider of the sample) and its characteristics (e.g., gender, age,
developmental stage, strain, or disease state).
Manipulation of biological samples and protocols used (e.g., growth
conditions, treatments, separation techniques).
Experimental factor value for each experimental factor, for each sample
(e.g., ‘time = 30 min' for a sample in a time course experiment).
Technical protocols for preparing the hybridization extract (e.g., the
RNA or DNA extraction and purification protocol), and labeling.
External controls (spikes), if used.
MIAME – III.
Hybridization procedures and parameters:
The protocol and conditions used for hybridization, blocking and washing, including any
post-processing steps such as staining.
Measurement data and specifications:
The raw data, i.e. scanner or imager and feature extraction output (providing the
images is optional). The data should be related to the respective array designs
(typically each row of the imager output should be related to a feature on the array
– see Array Designs).
The normalized and summarized data, i.e., set of quantifications from several
arrays upon which the authors base their conclusions (for gene expression
experiments also known as gene expression data matrix and may consist of
averaged normalized log ratios). The data should be related to the respective array
designs (typically each row of the summarized data will be related to one biological
annotation, such as a gene name).
Data extraction and processing protocols.
Image scanning hardware and software, and processing procedures and
parameters.
Normalization, transformation and data selection procedures and parameters.
Statistical analysis
The analysis of DNA microarrays poses a large number of statistical problems,
including the normalisation of the data.
From a hypothesis-testing perspective, the large number of genes present on a
single array means that the experimenter must take into account a multiple
testing problem: even if each gene is extremely unlikely to randomly yield a
result of interest, the combination of all the genes is likely to show at least one
or a few occurrences of this result which are false positives.
Gene Ontology Viewer for Microarray Data Interpretation
swift.cmbi.kun.nl/.../ report/materials/
Definitions of gene ontology on the Web:
a controlled vocabulary used to describe the biology of a gene product in any
organism. There are 3 independent sets of vocabularies, or ontologies, that
describe the molecular function of a gene product, the biological process in
which the gene product participates, and the cellular component where the
gene product can be found.
www.madison.k12.wi.us/west/science/biotech/vocabulary.htm
The Gene Ontology, or GO, is a trio of controlled vocabularies that are being
developed to aid the description of the molecular functions of gene products,
their placement in and as cellular components, and their participation in
biological processes. Terms in each of the vocabularies are related to one
another within a vocabulary in a polyhierarchical (or directed acyclic graph)
manner; terms are mutually exclusive across the three vocabularies. ...
en.wikipedia.org/wiki/Gene_Ontology
Gene Ontology
The adoption of common standards and ontologies for the management
and sharing of microarray and/or mass spectrometry data is essential.
The Global Open Biological Ontologies GOBO effort, which has grown from
work by the Gene Ontology Consortium, is seeking to collect ontologies for
the domains of genomics and proteomics. Together with Spotfire the
ErasmusMC bioinformatics group works on the improvement and further
development of a GO tool that runs in the portal environment of the Spotfire
decision site product for functional genomics and proteomics applications.
The wealth of biological data that will be generated using high-throughput
technologies from different modalities in the next decade has yet to be
realized, as has the enormous potential for discoveries.
www.erasmusmc.nl/.../ research/gocp.shtml
http://cardioserve.nantes.inserm.fr/ptf-puce/images/camembert_go.gif
Steps in microarray technology
irfgc.irri.org/cropbioportal/index.php?option...
Experimental design
A proper experimental design is crucial for obtaining useful conclusions from a project. The choice of
design ideally includes an assessment of the biological variation, the technical variation, the cost and
duration of the experiment, and the availability of biological material. The experimental design can
also depend on the methods that will be used to analyse the data afterwards. In certain cases, the
parameters needed to find the optimal design must be obtained by a pilot experiment.
A related problem is the comparison of different competing experimental methods or devices. Here, a
proper test design is crucial as well to be able to make a firm conclusion in favor of one or the other
method.
Dere E. et al., BMC Genomics 2006, 7:80doi
lunabiosciences.com/experimentaldesign.html
Načrtovanje poskusov
Različni načrti poskusa z dvobarvnimi DNA
mikromrežami, ki vključujejo dva tretmaja (A in
B).
(a) in (b) predvidevata dve oz. štiri tehnične
ponovitve z zamenjavo barv (angl. dye swap).
(c) in (d) sta osnovana na dveh neodvisnih bioloških
ponovitvah tretmajev (ponazorjeno z indeksom
pri oznaki tretmaja).
(c) predvideva biološko ponovitev načrta (a).
(d) prikazuje enostaven krožni načrt.
Načrt poskusa z dvobarvnimi DNA
mikromrežami, kjer vzorce (A,...Z) primerjamo
preko skupne reference
Juvan P., Rozman, D. Informatica Medica Slovenica, 11: 2-15 (2006)
Technical Issues Involved in Obtaining Reliable Data from Microarray Experiments
– Standardization and beyond
Primary data analysis – experimental example
Figure 1 – Drug metabolism and cholesterol homeostasis
Principal groups of genes involved in cholesterol homeostasis and drug metabolism present on the
Sterolgene v0 cDNA microarray prototype.
T. Rezen et al., BMC Genomics 2008, 9:76
Primary data analysis – experimental example
Images of the Sterolgene v0 microarrays were analyzed by Array-Pro Analyzer 4.5 (Media Cybernetics, Bethesda, MD, USA).
The median feature and local background intensities were extracted together with the estimates of their standard deviation.
Only features with foreground to background ratio higher than 1.5 and coefficient of variation (CV, ratio between standard
deviation of the background and the median feature intensity) lower than 0.5 in both channels were used for further analysis.
Log2 ratios were normalized using LOWESS fit to spike in control RNAs according to their average intensity. Two types of
spike in controls were used: custom-made (Firefly luciferase) and commercial ArrayControl Spikes (Ambion, Austin, TX, USA).
In phenobarbital and cholesterol-feeding experiments data were additionally standardized (median-centered and scaled by
median absolute deviation) in order to reduce inter-array variability. All data analysis were done in Orange software [37]
Images of Agilent microarrays were analyzed using Array-Pro Analyzer 4.5 (Media Cybernetics, Bethesda, MD, USA).
Features, with CV>0.39, were filtered out and data were normalized using LOWESS fit to all genes according to their average
intensity. Filtration and normalization was done in BASE softwar].
Affymetrix data were normalized by the Robust Multichip Average (RMA) algorithm. After transformation to non-logarithmic
data, the expression estimates were scaled to the average expression levels in the control group analyzed using GeneSpring
software (Silicon Genetics, Redwood, USA). Classification of the differentially expressed genes was done in Orange using
single-factor ANOVA or two tailed Student’s t-test. For Sterolgene microarrays probability of type I error αS=0.05 or αS=0.1
was used. For Agilent microarrays complementary probability of type I error was calculated according to Bonferroni correction for
multiple testing (αA=αS*nS/nA), but final comparisons were made using more relaxed criteria αA=0.001 and αA=0.01. For
Affymetrix microarrays complementary probability αAf=0.00043 was calculated, but also a more relaxed criterion αAf=0.001 was
used.
Additional data analyses were done only using common genes between platforms, which were matched using unigene, refseq
or gene symbol. On this gene list another classification of differentially expressed genes was done in Orange using ANOVA for
Affymetrix and Agilent platforms. A probability for type I error was selected as in a complementary Sterolgene experiment (α=0.1
in Affymetrix analyses and α=0.05 in Agilent analyses). Pearson’s product moment correlation coefficient and a scatterplot
between log2 ratios from common genes were calculated in SPSS 14.0. All data have been submitted to GEO (Gene expression
omnibus) under accession codes: GSE6271 (Affymetrix data), GSE6317 (Agilent data), GSE6447 (Sterolgene phenobarbital and
high cholesterol diet data), and GSE6423 (Sterolgene fasting and inflammation data).
T. Rezen et al., BMC Genomics 2008, 9:76
Problems in comparison different data formats
Agilent
1
Sterolgene
11
4
A
1
Sterolgene
2
3
B
Affymetrix
17
C
Agilent
4
Sterolgene
Affymetrix
11
36
6
Sterolgene
9
D
Figure 3 - Agreement between Sterolgene v0, Agilent and Affymetrix platforms
Venn diagram illustrating agreement between differentially expressed (DE) gene lists from Sterolgene v0 cDNA, Agilent
10K cDNA microarrays and Affymetrix MOE430A GeneChip. DE genes were determined using a single factor ANOVA,
and a probability of type I error α=0.05 for Sterolgene and Agilent platform comparisons and α=0.1 for Sterolgene and
Affymetrix platform comparison. Only genes present on both microarrays were used in these analyses and are shown on
the diagrams. A. In starvation experiment only one gene was common to both platforms. B. In TNF-α experiment only
one gene was common to both platforms. C. In phenobarbital experiment four genes were common to both platforms. D.
In cholesterol diet experiment six genes were common to both platforms.
Changes in cholesterol homeostasis and drug metabolism caused by different factors in mouse liver
The Sterolgene v0 cDNA microarray successfully detected all changes in cholesterol homeostasis and drug metabolism
caused by high-cholesterol diet, fasting, TNF-α, and phenobarbital (PB) treatment (solid arrows). The Agilent 10 K cDNA
microarray (G4104A) detected none of the changes caused by the inflammatory cytokine TNF-α and fasting (crosses). The
Affymetrix MOE430A GeneChip detected only down-regulation of the cholesterol biosynthesis by the high-cholesterol diet and
induction of the Cyp2b family by the phenobarbital treatment (dashed arrows), but not the up-regulation of Cyp3a11 by highcholesterol diet and up-regulation of Cyp3a family and Alas1 by phenobarbital treatment. For all microarrays the same
statistical method for determination of differentially expressed genes was used (single-factor ANOVA).
T. Rezen et al., BMC Genomics 2008, 9:76
Povzetek
-Načrtovanje bioloških poskusov zahteva sodelovanje eksperimentatorjev in informatikov že
od vsega začetka.
-Zasnova poskusa zahteva definicijo biolškega vprašanja, izbiro platforme za analizo,
določitev števila bioloških in tehničnih replik in načrt serije poskusov (hibridizacij ali
sekvenciranj).
-Po tehnični izvedbi poskusa sledita statistična in informatična obdelava ter rudarjenje
podatkov.
Statistično-informatična obdelava obsega ekstrakcijo intenzitet signala, normalizacijo
podatkov, ter različne statistične teste, da pridobimo listo diferencialno izraženih genov ali
zaporedij DNA.
Sekundarna informatična analiza in rudarjenje podatkov obsegata gručanje in
razpoznavanje vzorcev, študije seznama genov z genskimi ontologijami. Sikanje
regulatornih vzorcev, kot tudi načrtovanje validacijskih eksperimentov.