Download Presentazione di PowerPoint

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of human development wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Point mutation wikipedia , lookup

Frameshift mutation wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Gene expression profiling wikipedia , lookup

Tag SNP wikipedia , lookup

Genetic engineering wikipedia , lookup

Human genetic variation wikipedia , lookup

Copy-number variation wikipedia , lookup

Transposable element wikipedia , lookup

Gene wikipedia , lookup

Primary transcript wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Genome (book) wikipedia , lookup

Oncogenomics wikipedia , lookup

NUMT wikipedia , lookup

Designer baby wikipedia , lookup

Epigenomics wikipedia , lookup

Public health genomics wikipedia , lookup

ENCODE wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Human genome wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

DNA sequencing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome editing wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic library wikipedia , lookup

Exome sequencing wikipedia , lookup

Human Genome Project wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Next Generation Sequencing
Data Analysis
Nadia Pisanti, University of Pisa
Why sequencing?
The knowledge of DNA and RNA sequences has become
a crucial tool for:
• Basic research in biology, pharmacology and medicine.
• Many applied fields: diagnostic (genetic diseases
detection), pharmacogenomics (influence of genetic
variation on drug response) and personalized medicine,
forensic biology, gene therapies, biological systematics
(the study of the diversification of living forms)…
Sequencing: some history
•
•
•
•
•
"rapid DNA sequencing" by Frederick Sanger (UK) in the 1970s, became
the method of choice for DNA sequencing, and was worth him his 2nd
Nobel Prize in chemistry in 1980.
Sanger method for sequencing DNA was used in the Human Genome
Project (HGP) that produced the first reference sequence of the human
genome.
The HGP started in 1990 and was expected to take 15 years.
A first "rough draft" was finished in 2000 and announced in a press
conference by… Bill Clinton and Tony Blair!
The complete genome was announced in 2003.
– Why announcing the rough draft in 2000?
– Why did the HGP take less than expected?
Celera Genomics
- cut and paste from wikipedia and my memory •
•
•
•
•
•
In 1998, the American NIH researcher Craig Venter announced that his private company Celera
Genomics would sequence the human genome at a fraction of the cost of the public project.
A significant portion of the human genome had already been sequenced when Celera entered the field
and was freely available to the public from GenBank.
Celera used a technique called whole genome shotgun sequencing. This novelty spurred the HGP to
change its own strategy, leading to a rapid acceleration of the public effort.
Celera filed preliminary ("place-holder") patent applications on 6,500 whole or partial genes. Celera
also promised to publish their findings in accordance with the terms of the 1996 "Bermuda Statement,"
by releasing new data annually (the HGP released its new data daily), although, unlike the publicly
funded project, they would not permit free redistribution or scientific use of the data. For this reason,
the public competitor was compelled to publish the first draft of the human genome before Celera.
In 2000, the HGP released a first working draft on the web. The scientific community downloaded onehalf trillion bytes of information from the UCSC genome server in the first 24 hours of free and
unrestricted access to the first ever assembled blueprint of our human species.
Also in 2000, president Clinton announced that the genome sequence could not be patented, and
should be made freely available to all researchers. The statement sent Celera's stock plummeting and
dragged down the biotechnology-heavy Nasdaq. The biotechnology sector lost about $50 billion in
market capitalization in two days. But the public release of the data ensured its fair use and availability.
shotgun sequencing
•
•
The Sanger sequencing technology could only be used for short DNA
fragments (from 100 to 1000 bases): DNA must thus be divided into small
pieces, and then be re-assembled.
This can be done in two ways:
– Chromosome walking: sequencing piece by piece consecutive fragments.
– Shotgun sequencing: break several copies of the DNA strand into random
overlapping fragments, sequencing them, and then re-assemblying in
silico exploiting the overlap.
•
Since when shotgun sequencing was introduced by Celera, it is
the method of choice for large scale sequencing.
Shotgun sequencing & assembly
•
Wikipedia, about shotgun sequencing: "faster
but more complex".
•
The "complexity" of the approach is because of
algorithmic issues…
•
(Eu)gene Myers, a string algorithms expert, was
leading the computer scientists at Celera: he
made the difference…
•
Challenges in assembly phase: finding
prefix/suffix overlap, data structure for storing
fragments and "overlap graph", assembly
algorithm managing duplications.
Fragment Assembly
The problem of sequence assembly can be compared to taking many
copies of a book, passing them all through a shredder, and piecing the
text of the book back together just by looking at the shredded pieces.
Besides the obvious difficulty of this task, there are some extra practical
issues: the original may have many repeated paragraphs, and some
shreds may be modified during shredding to have typos. Excerpts from
another book may also be added in, and some shreds may be
completely unrecognizable.
What is NGS?
•
•
•
•
Next/New Generation Sequencing
Massively Parallel Sequencing
Third Generation Sequencing
High Throughput Sequencing
millions of fragments (reads) in a single run !!
by means of new technologies developed mainly by:
•
•
•
•
Lynx Therapeutics merged with Solexa and they were bought by Illumina.
ABI SOLiD
ION Torrent Systems
454 Life Science acquired by Roche Diagnostics
they actually differ quite a lot on performances and characteristics.
What's new with NGS?
•
Sequencing the whole human genome took the HGP:
– 3.000.000.000 dollars
– 13 years
•
Sequencing a whole human genome now with NGS techniques takes:
– about 1.000 dollars
– 4-5 days
Sequencing is much faster and (thus) cheaper !!
What is NGS great for
• re-sequencing: no assembly, just mapping on a known
reference genome.
• Metagenomics
• Transcriptome Sequencing: RNA-Seq
• Chromatin immunoprecipitation combined with DNA
sequencing: ChIP-Seq
re-sequencing
• Sequencing a new individual of a
species for which the reference genome
is know (and (well) annotated).
• Important applications:
– Medicine
– Building datasets of several strains of the
same organism to investigate intra-species
evolution.
re-sequencing: medical applications
[we will get back to this later]
• Genotyping: testing for known mutations (sequencing can
be possibly targeted to specific regions).
• Variation analysis: scanning for any mutation such as
Single Nucleotide Polymorphisms (SNPs), or Copy Number
Variations (CNVs) or other Structural Variants (SVs) that
can be associated to congenital diseases, predisposition for
certain pathologies, or drug response.
• Most of NGS tools offer the relative software to detect
mutations.
• With NGS these tests can be made on large scale…. and
back in time: Roche sequenced the Neanderthal genome in
2006!
re-sequencing
• Challenges for computer science:
– Indexing data and (quickly) mapping on
reference genome
– SNPs and SVs calling.
– Mind the repeats up there!
• Challenges for informatics:
– Build tools for genetists.
– Interpreting SNPs and SVs crossing with
DB information.
– DB management…
metagenomics
Metagenomics essentially entails brute force sequencing of DNA fragments obtained from an
uncultured, unpurified, microbial and/or viral population, followed by bioinformatics-based
analyses that attempt to answer the question "Who's there?" [E.R.Mardis, Trends in genetics
2008]
• Characterizing the human microbiome: we live in symbiosis with
millions of microbial species. There is a theory saying that these
symbiotic microbes provide an extension of the human genome and
hence contribute to its genetic potentials in terms of protective
immunity, added enzymatic capability…
• Metagenomics not only in human body, but also in important
ecosystems such as ocean, soil, deep mines.
• Metagenomics costs are effordable only now with NGS (mostly 454
Roche as with longer reads they better allow de novo sequencing)
What is RNA-Seq
NGS opened a new phase
in transcriptomics (aka
expression profiling)
thanks to
low requirements of
nucleotide sequence
product
and
deep coverage
Why RNA-Seq
• Among the goals of the HGP there was the mapping
and genotype associated to (the predisposition for)
diseases.
• It is now very clear (and it was not then) that reading
the genome is not enough…
• Same genome, different phenotypes and different
diseases: how comes?
• Environmental effects (food, pollution, life style) act
on gene transcription.
• We ought to investigate the transcriptome!
• The transcriptome are the genes that are being
actively expressed at a given time.
• The role of miRNA for gene regulation.
RNA-Seq
Sequencing the transcriptome to investigate
differentially expressed genes:
- under different conditions, or
- in different tissues
- in different alleles
The different expression can be in
quantitative terms or in alternative
splicing terms (eukaryotes only).
de novo transcriptome assembly
RNA-Seq
Sequencing the transcriptome to investigate
differentially expressed genes:
- under different conditions, or
- in different tissues
- in different alleles
The different expression can be in
quantitative terms or in alternative
splicing terms (eukaryotes only).
transcriptome re-sequencing
RNA-Seq quantification
RNA-Seq (Quantification) is used to analyze gene expression of certain
biological objects under specific conditions.
Alternative Splicing
[we will get back to this later]
•
AS is when several mRNAs
can be produced from a
unique pre-mRNA
•
E.g. in humans there are
approximately 30,000 genes
and it is estimated that 70%
of human protein-coding
genes undergo alternative
splicing to generate up to
150,000-200,000 mRNAs and
proteins through alternative
splice site usage.
•
In 2008, an experiment
revealed that 34% of human
transcripts were not from
known genes [Science 321]
non coding RNA
•
•
•
•
•
•
ncRNA includes a wide class of regulatory RNA molecules whose function is as
crucial as not yet understood.
Discovering their sequences and (hence) genomic locations is hard because
they (mostly) small and poorly conserved over evolutionary time.
In silico prediction methods are of high importance and very promising, but so
far of little use.
Currently, ncRNA are mostly discovered by sequencing small RNA fragments,
for which task NGS tools are ideal!
In silico analysis of such data will be crucial for understanding it (secondary
structure prediction, putative functions prediction based on learning methods).
A new class of miRNA (or small RNA) is being discovered every day…
ChIP-Seq
• ChIP-seq combines chromatin immunoprecipitation
(ChIP) with massively parallel DNA sequencing to
identify the binding sites of DNA-associated proteins.
• The goal is to analyze protein interactions with DNA
(e.g. how transcription factors, that are proteins,
regulate gene expression).
The bad side of NGS
• Even shorter fragments: from 1000 of
Sanger technology to 25, then 50, then
75, now 100 bases.
• Even more errors (when new size is
released).
Fragment assembly is even harder !!
What is the best
depends on:
what you need
it for
and
how much money
you have
From: M.L. Metzker "Sequencing technologies — the next generation",
Nature Reviews Genetics 11, 31-46, 2010
Roche 454 Genome Sequencer
• It was the first introduced in the market in 2005.
• Its technology allows to produce relatively long reads
(400-700 bases).
• Its base calling cannot handle long (>6) stretches of
the same nucleotide, resulting in insertions and
deletions errors there…
• On the other hand very low substitutions error rate.
• Overall error rate at 1%.
Illumina Genome Analyzer
(aka Solexa sequencer)
• The most widely available NGS
technology.
• Reads up to 100b long.
• Error rate at 1-1,5%, mostly substitutions
(indels are much less common).
ABI's SOLiD
• Probably the second most widely used.
• The workflow is similar to Solexa/Illumina's.
• An interesting difference: SOLiD uses a di-base
sequencing technique in which two nucleotides are read
simultaneously. 16 di-bases still represented by 4 "colors",
but the one-base-shift solves the redundancy.
• As a consequence:
– Sequencing error may propagate.
– Read alignment can be speed up.
• Error rate around 2-4%
Paired-end and Mate-pairs
• Two very different objects from the point of view of the
technology as they are obtained with very different
procedures.
• Available from all NGS platforms.
• From the computational point of view, they are the same:
two sequences at an approximatively know distance from
eachother in the genome (insert size).
• They are crucial to:
– Correctly map/assemble repeated fragments
– Detect Structural Variants and Copy Number Variations.
Fragment Assembly
with NGS data
It is like a diabolic sudoku:
- with very few initial numbers
- many solutions satisfy the constraint:
choice is arbitrary
- only one of the many solution is the
good one, and there is no clue on
which…
NGS and Informatics
the challenges 1
• Massive Image processing and basecalling
within sequencing technology.
• Growing need of managing big data:
– Indexing issues.
– Efficient mapping and alignments.
– Parallel and High Performance computing.
– New emphasis on efficient data structures and
algorithms with special care on memory usage.
NGS and Informatics
the challenges 2
• Designing and producing tools for data analysis
integrating information from different sources (e.g.
genome browsers).
• Designing and producing tools for assemblying..
• Designing and producing tools for genotyping: a
new one every day, hard to compare...
• Customized analysis: informatics is needed for any
project and in any lab.
• "Curiously": back to old style stuff such as
command line, machine language programming…