Download Chalmers_Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

DNA barcoding wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Exome sequencing wikipedia , lookup

Molecular cloning wikipedia , lookup

DNA sequencing wikipedia , lookup

Replisome wikipedia , lookup

Mutation wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Community fingerprinting wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Point mutation wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genome evolution wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
Bioinformatics
Lecture 1
DNA - the basics
Drew Berry – DNA animations
http://www.youtube.com/watch?v=WFCvkkDSfIU&index=4&list=PL9CBBEA5A85DBCDEF
Organisation of DNA
• DNA is packed in Chromosomes
• Karyotype: chromosome set of a species
• Chromosomes are dynamic structures
The Human karyotype:
• 23 pairs of chromosomes
• 46 DNA molecules
DNA replication
• The ability of DNA to replicate itself is a fundamental driver of life
• DNA copy is catalysed by enzymes (DNA polymerases)
• The complementary strand is synthesised from a template strand, using
deoxynucleotides and a primer
• Synthesis is directional (5’->3’)
Deoxyribonucleotides
dNTPs
A
Primer
Template DNA strand
C
T
DNA polymerase
5’
Template
TCAG
3’
3’
5’
AG
TC
reverse complement copy
G
The polymerase chain reaction
• Replication requires a DNA polymerase
• Thermostable DNA polymerase (eg Taq
polymerase)
• Efficient DNA amplification
• No error correction
Melt DNA
(94-98 °)
Anneal primers
(50-65 °)
Kary Mullis
Nobel prize in chemistry: 1993
Elongation
(72 °)
Exponential
replication
DNA Sequencing (Sanger)
• PCR Reaction is terminated using
randomly incorporated
dideoxynucleosides (ddNP)
• Older methods use radiolabelled
phosphate
• Newer methods use ddNP
incorporating dyes
• Truncated DNA strands are
separated on a gel or by capillary
electrophoresis
Next Generation Sequencing
• Next generation sequencing refers to methods newer than the
Sanger approach
• A variety of techniques developed by different companies
• DNA is generally immobilized on a solid support
• Very large numbers of small reads
• Multiple reads of a each section of genomic DNA (eg 30x)
• Assembling the genome becomes a significant computational
problem
• Some ‘single molecule’ methods do not require PCR (reduces
errors)
• Cost has reduced substantially  the $1000 genome!
•
Refs: Metzker, M. L. Sequencing Technologies — the Next Generation. Nat. Rev. Genet. 2009,
11, 31–46.
The Human Genome Project
• Funded by US government
• The human genome was published in February
2001
• Project completed in 2003
• Cost $US 2.7 billion in 1991 dollars
• Hierarchical shotgun sequencing (genome is
broken down into many smaller fragments)
• Automated Sanger type sequencing
•
Ref: http://www.nature.com/scitable/topicpage/dnasequencing-technologies-key-to-the-human-828
Human genome by function
•
•
•
•
The human genome contains about 21K genes (about 100,000 were
expected!)
98% of the human genome is noncoding DNA
Noncoding DNA can code for regulatory RNAs or otherwise regulate
transcription
Ref: Häggström, Wikiversity Journal of Medicine 1 (2). DOI:10.15347/wjm/2014.008. ISSN 20018762
The druggable genome – Current drug targets
Ref: Hopkins, A. L.; Groom, C. R. The Druggable Genome. Nat Rev Drug Discov 2002, 1, 727–730.
The druggable genome – Human genes
Ref: Hopkins, A. L.; Groom, C. R. The Druggable Genome. Nat Rev Drug Discov 2002, 1, 727–730.
Human genome resources
• Three useful sites providing a huge number of resources
such as genome browsers
• NCBI: National center of biological information
– http://www.ncbi.nlm.nih.gov/
– http://www.ncbi.nlm.nih.gov/genome/guide/human/
• UCSC genome browser
– http://genome.ucsc.edu/
• Ensembl: European site at the Sanger centre
– http://www.ensembl.org
Next-gen Sequencing Overview
•
Ref: http://res.illumina.com/documents/products/illumina_sequencing_introduction.pdf
Multiple Genomes
•
Ref: McVean et al. An Integrated Map of Genetic Variation From 1,092 Human Genomes. Nature 2012, 491, 56-65.
Bioinformatics
• Sequencing technologies produce enormous
amounts of sequence data. What do we want to do
with this?
– Identify genes
– Identify functions of gene products (proteins)
– Compare genes between species
– Identify relationships (similarities) between
species
The Genetic Code
In general:
• Amino acids that share the same biosynthetic pathway tend to have the
same first base in their codons
• Amino acids with similar physical properties have similar codons causing
conservative substitutions in the case of mutations or mistranslation
Genetic mutation
The genetic code can be changed by a variety of
processes
Small scale:
• Damage to DNA (radiation or chemical damage)
• Translation errors
Large scale:
• Duplication of sections of DNA
• Deletion of sections of DNA
• Transposition of sections of DNA
The rate of genetic mutation
• The mutation rate (per year or per generation) differs
between species and even between different sections of
the genome
• Different types of mutations occur with different
frequencies
• The average mutation rate is estimated to be ~2.5 × 10−8
mutations per nucleotide site or 175 mutations per
diploid genome per generation
•
Ref: Nachman, M. W.; Crowell, S. L. Estimate of the Mutation Rate Per Nucleotide in Humans.
Genetics, 156, 297 (2000).
Amino acid substitution matrices
• Substitution matrices describe the probability that
one AA is converted to another and ‘accepted’
• Matrix is a ‘log odds’ matrix – i.e. here the
probability of conversion from Ala to Arg is 1/log(30)
PAM and BLOSUM matrices
• Scoring matrices are used to:
– produce sequence alignments and score similarity
between two or more protein
– to search a database to find sequences similar to a test
sequence
• Commonly used families of matrices:
– PAM (Accepted Point Mutation) matrices (Dayhof)
• Derived from global alignments of entire proteins
• Better for closely related protens
– BLOSUM (BLocks SUbstitution Matrices) matrices (Steven
and Henikof)
• Derived from local alignments of blocks of sequences
• Better for evolutionally divergent sequences
BLAST - Searching genomes
• BLAST is a rapid method for searching protein or DNA
sequences in large databases
• Sequences are divided into groups k AAs or Bases
PGFHJIQMQVVS  PGF, GFH, FHJ, HJI, etc (k=3)
• Common or repeated sequences are discarded
• Sections of exact sequence match are searched for
• The sequence alignment is expanded from sections
that are exact matches
• Blast can miss difficult matches
http://blast.ncbi.nlm.nih.gov/
Sequence alignment
• Protein or DNA sequences can be aligned
• Differences between sequences are interpreted as
mutations, insertions or deletions
• Substitution matrices are used to score the likelihood
of a match
• Alignment scores are calculated between pairs of
sequences
• Multiple alignments can be performed
• Many alignment programs: Clustal, T-coffee,
Clustal
Sequence alignments and protein structural
similarity
• Sequence alignments are
based on protein/DNA
sequence similarity and not
on structural similarity
• High sequence similarity
implies (but does not
guarantee) structural
similarity
• High sequence similarity
implies (but does not
garuantee) similar protein
function
Comparison of RMSD when pairs of
similar proteins are superimposed using
the sequence alignment (X axis) and the
protein 3D structures (Y axis)
Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891
Differences between sequence and structural
alignment
Chain A versus chain D from PDB ID
1vr4. The two chains are 100%
identical in sequence
A: Alignment by sequence
B: Alignment by structure
C: Overlaid structures
Ref: Kosloff, M.; Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891
Improving sequence alignments
• Adding structural information to sequence
alignments can improve their quality
Summary
• This lecture should provide an overview of:
•
•
•
•
DNA sequencing and the Polymerase Chain Reaction
Genome sequencing
BLAST searching
Sequence alignments and their limitations