Download Introduction to Next-Generation Sequence analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nucleic acid analogue wikipedia , lookup

Heritability of IQ wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Human genetic variation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Medical genetics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Epigenomics wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Chromosome wikipedia , lookup

Transposable element wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Polyploid wikipedia , lookup

Oncogenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

DNA sequencing wikipedia , lookup

Gene expression profiling wikipedia , lookup

Non-coding DNA wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genetic engineering wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Human genome wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene wikipedia , lookup

Helitron (biology) wikipedia , lookup

Public health genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Genome editing wikipedia , lookup

Designer baby wikipedia , lookup

Human Genome Project wikipedia , lookup

Pathogenomics wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic library wikipedia , lookup

History of genetic engineering wikipedia , lookup

Whole genome sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Transcript
From genetics to the next
generation sequencing
McGill University and Génome Québec Innovation Centre
Mathieu Bourgey, Ph.D
Senior Bioinformatician
[email protected]
June 2nd 2013
Introduction to
Genetics
Genetics is the Key to
Biology
• Genetics
– The scientific study of heredity
– Geneticists study how traits and diseases
are passed from one generation to the
next
– Understanding what genes are, how they
are passed from one generation to the
next, and how they work is essential to
understanding life
What Are Genes
and How Do They Work?
• Gene
– The fundamental unit of heredity - made of DNA.
– DNA is comprised of a polymer (linked string) of
chemical subunits called nucleotides.
• Genetic code
– There are four different nucleotides in DNA
•
•
•
•
Adenine = A
Thymine = T
Guanine = G
Cytosine = C
– Combinations of these four nucleotides define which
amino acids will be used to make specific proteins in
the cell
DNA (DeoxyriboNucleic
Acid)
• Genes are
comprised of
sequences of
nucleotides
contained on
a doublestranded
helical DNA
molecule
Three-Dimensional
Structure of a Protein
http://mach7.bluehill.com/proteinc/graphics/alphpro.jpg
Traits
• Any observable property of an
organism is a trait
– Actions of gene products (proteins)
produce visible traits such as eye color
and hair color
How Are Traits Transmitted
from Parents to Offspring?
• Gregor Mendel’s experiments showed that
genes are passed from parents of offspring
– Each parent carries two genes that control a trait
– Each parent contributes one copy from each pair
– Pairs of genes separate from each other during
the formation of egg and sperm (meiosis)
– When egg and sperm fuse during fertilization,
genes from mother and father become a new
gene pair
Chromosomes
• Genes are
contained on
chromosomes
– Chromosomes are
found in the
nucleus of human
cells and other
higher organisms
– Meiosis separates
chromosomes
pairs during
formation of egg
and sperm
Image: dream designs / FreeDigitalPhotos.net
How Do Scientists Study
Genes? (1)
• Many different model organisms have been
used ranging from bacteria to plants to
insects to humans.
• Transmission genetics
– Study inheritance patterns and how traits are
passed from generation to generation
• Pedigree analysis
– Construction of family trees used to follow
transmission of genetic traits in families
(inheritance)
Pedigree Analysis
• A pedigree represents the inheritance
of a trait through several generations of
a family.
How Do Scientists Study
Genes? (2)
• Cytogenetics
– Study of the organization and arrangement of
genes on a chromosome
– Study of chromosome number and structure
• Karyotype
– A complete set of chromosomes from a cell that
has been photographed during cell division and
arranged by size and shape in a standard order
Karyotype
• A karyotype arranges the
chromosomes in a standard format so
they can be evaluated for abnormalities
How Do Scientists Study
Genes? (3)
• Molecular genetics
– The study of genetic events at the molecular level
– Identification, isolation, and analysis of specific
genes
• Population genetics
– The study of inherited variation in populations of
individuals
– Forces, such as environment, that result in
changing gene frequencies over generations
Genetics in Basic and
Applied Research
• Recombinant DNA technology
– Techniques whereby DNA fragments are linked
to self-replicating vectors, which are replicated in
a host cell, often bacteria
– Genetically modified organisms
– Carry and express genes from another species
• Clone
– Genetically identical molecules, cells, or
organisms, all derived from a single source or
parent
– Gene therapy
– Normal genes are transplanted into humans with
defective copies to treat genetic diseases
Applied Biotechnologies
• Medicine
– Vaccines
– Customized proteins for treating disease
• Agriculture
– Increased crop yields
– Lower fat content
– Disease-resistant crops
Genetic Testing
• Genes associated with hundreds of
genetic diseases have been cloned and
are used to develop genetic tests
– Cystic fibrosis
– Sickle cell anemia
– Muscular dystrophy
– Phenylketonuria (PKU)
From Genetics to
Genomics
Origin of terms Genomes and
Genomics
• The term genome was used by German
botanist Hans Winker in 1920
• Collection of genes in haploid set of
chromosomes
• Now it encompasses all DNA in a cell
• In 1986 mouse geneticist Thomas
Roderick used Genomics for “mapping,
sequencing and characterizing genomes”
Why should we study
genomes?
Each and everyone is a unique creation!
Life’s little book of instructions
DNA blue print of life!
Human body has 1013 cells and each cell has
6 billion base pairs (A, C, G, T)
• A hidden language/code determines which
proteins should be made and when
• This language is common to all organisms
•
•
•
•
Genome sequence can
tell us…
Everything about the organism's life
Its developmental program
Disease resistance or susceptibility
How do we struggle, survive and die?
Where are we going and where we
came from?
• How similar are we to apes, trees, and
yeast?
•
•
•
•
•
Genomics is the study of all
genes present in an organism
Science of Genomics?
• A marriage of molecular biology, robotics,
and computing
• Tools and techniques of recombinant
DNA technology
– e.g., DNA sequencing, making libraries and
PCRs
• High-throughput technology
– e.g., robotics for sequencing
• Computers are essential for processing
and analyzing the large quantities of data
generated
Genomics relies on
high-throughput
technologies
• Automated sequencers
• Fluorescent dyes
• Robotics
– Microarray spotters
– Colony pickers
• High-throughput genetics
Technology revolution
Sequencing genomes in
Months and Years
Projects cost:
Billions $
Sequencing genomes in
HOURS/Minutes !!
Thousands $
Bioinformatics:
computational analysis of
genomics data
• Uses computational
approaches to solve
genomics problems
– Sequence analysis
– Gene prediction
– Modeling of biological
processes and network
Introduction to the next
generation sequencing
Four Major Players
Roche: 454
Life technology: SOLiD / ion torrent
Illumina: Genome Analyzer / Hiseq /
Miseq
Pacific Bioscience: PacBio
Technology comparison
instrument
Method
Pacbio
Ion Torrent
454
Single-molecule
Ion
Pyrosequencing
in real-time semiconductor
Illumina
SOLiD
synthesis
Ligation
Read length
3kb average
200 bp
700 bp
50 to 250 bp
50+35 or 50+50
bp
Error type
indel
indel
indel
substitution
A-T bias
single-Pass
Error rate %
13
~1
~0.1
~0.1
~0.1
Reads per run
35000–75000
up to 4M
1M
up to 3.2G
1.2 to 1.4G
Time per run
30 minutes to 2
hours
2 hours
24 hours
1 to 10 days,
1 to 2 weeks
Cost per 1
million bases
(in US$)
$2
$1
$10
$0.05 to $0.15
$0.13
Advantages
Longest read
length. Fast.
Less expensive
high sequence
Long read size.
equipment.
yield, cost,
Fast.
Fast.
accuracy
Low cost per
base.
Low yield at
Slower than
Runs are
high accuracy.
Equipment can other methods,
Homopolymer
expensive.
Disadvantages Equipment can
be very
read length,
errors.
Homopolymer
be very
expensive. longevity of the
errors.
expensive.
plateform
Applications
Equipment
Genome
Quebec
number
454
3
(1)
Small de novo genome sequencing
Amplicon sequencing
Metagenomics
Ion Torrent
1
Small de novo genome sequencing
Amplicon sequencing
Metagenomics
SOLiD
0
Transcriptome sequencing (RNA-Seq)
Whole Exome Sequencing
Whole Genome Sequencing
Illumina
MiSeq
1
Small de novo genome sequencing
Amplicon sequencing
Metagenomics
Validation
Illumina
HiSeq
2000/2500
12
Transcriptome sequencing (RNA-Seq)
Whole Exome Sequencing
Whole Genome Sequencing
Pacific
Biosciences
1
Small genomes, Long haplotype
sequencing, Epigenomics
Current Applications
Different type of
sequencing libraries
From Glenn TC, Mol Ecol Resour. 2011 adatped for 2013
What the NGS problem
is about ?
• Strings of 100 to ≈1kb letters
• Puzzle of 3,000,000,000 letters
• Usually have 120,000,000,000 letters you
need to fit
• Many pieces don’t fit :
– sequencing error/SNP/Structural variant
• Many pieces fit in many places:
– Low complexity region/microsatellite/repeat
Basecalling
• How do we translate the machine readouts
to base calls?
• How do we estimate and represent
sequencing errors?
From MICHAEL STRÖMBERG
34
Trimming based on
qualities
Will generate input sequence data of various size !!
low qualtity bases can bias subsequent anlaysis
(i.e, SNP and SV calling, …)
35
Assembly vs mapping
contig1
contig2
assembly
all vs all
reads
mapping
Reference
all vs reference
Assembly vs mapping
• Mapping:
– useful for interrogating the “known” genome
• Assembly:
– Essential if no genome sequence
– unbiased ascertainment of variation in known genome
SNP Discovery: Goal
sequencing errors
SNP
An accurate SNP dicovery is closely linked with a good base
quality and a suffisent depth of coverage
Mopdified from Bionformatics.ca
Structural variation
• Indel:
– Short insertion or deletion events < 50 bp
• Structural variations:
– Large insertion
– Large deletion
– TE insertion
– Inversion
– Interspersed duplication
– Tandem duplication
Strucutral variant detection
From Alkan et al. 2011
Conclusion
• NGS offers a variety of technologies
and methods
• NGS is still an open fields where many
area are under constructs
• NGS analyses requires both
mathematics and informatics skills
• The major challenge is actually link to
the compute and storage capacities
Lincoln Stein
(http://goo.gl/TD4tE)
Acknowledgment
• Guillaume Bourque
• Louis Létourneau
" The $1,000 genome, the $100,000 analysis?"
Elaine R. Mardis