Download Sequence Alignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic library wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Deoxyribozyme wikipedia , lookup

History of genetic engineering wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Primary transcript wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Gene wikipedia , lookup

Human genome wikipedia , lookup

Koinophilia wikipedia , lookup

RNA-Seq wikipedia , lookup

Non-coding DNA wikipedia , lookup

Point mutation wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

Microsatellite wikipedia , lookup

Microevolution wikipedia , lookup

DNA barcoding wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Sequence alignment wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Transcript
ANALISA PILOGENETIKA
Oleh Irda Safni
TAXONOMY
Taxonomy of Bacteria and Archaea
• Modern taxonomy comprises the following features:
– Nomenclature: giving names of appropriate taxonomic rank to
the classified organisms.
– Classification: the theory and process of ordering the organisms,
on the basis of shared properties, into groups.
– Identification: obtaining data on the properties of the organism
(characterization) and determination which species it belongs to.
This is based on direct comparison to known taxonomic groups.
Nomenclature of Bacteria and Archaea
• There are a, quite complicated, set of rules for the naming Bacteria
and Archaea. They must have two names: the first refers to the
genus (= slekt) and the second refers to the species (= art).
• The names can be derived from any language but they must be
Latinized. Take for example Staphylococcus aureus. The genus
name is capitalized and the species name is lower case. The name
is italized to indicate that is Latinized. Staphyl is derived from the
Greek staphyle meaning ”a bunch of grapes” and coccus from the
Greek meaning ”a berry”. Aurous is from Latin and means ”gold”. A
yellow bunch of berries.
• The higher taxonomic orders are family, order, class, phylum and
domain but except for domain these are rarely used.
Species concept
• The species concept applied to eukaryotes cannot be applied to
bacteria and archaea. In fact it is quite difficult to define prokaryote
species.
• In order to be of the same species prokaryotes must share many
more properties with each other than with other prokaryotes.
• They must have similar mol % G+C. Note that two species having
the same mol % G+C are not necessary of the same species.
• The DNA from organisms of the same species must show a
minimum of 70% reassociation.
Numerical Taxonomy
• Numerical taxonomy is a methods which is used to differentiate a
large number of similar bacteria, i.e. species.
• A large number of tests (~100) are carried out and the results are
scored as positive or negative. Several control species are included
in the analysis.
• All characteristics are given equal weight and a computer based
analysis is carried out to group the bacteria according to shared
properties.
Homologous genes are used in the
construction of phylogenetic trees
• Homologous means that genes have a common anscestor
• Orthologs are homologous genes that belong to different species but
still retain their original function
• Paralogs are homologous genes that have arrisen by gene
duplication and are found in the same organism
• Only orthologes can be used in the construction of phylogenetic
trees. The classical example is the 16S ribosomal RNA gene.
Ribosomal Database project
• The database contains over 78,000 bacterial 16S rDNA sequences
• Approximately 7000 Type strains (the bacteria are in pure culture)
• Approximately 70000 Environmental samples (bacteria and archaea
samples have been collected from the environment and
characterized by molecular methods.)
• http://rdp.cme.msu.edu/html/index.html
SEQUENCE ANALYSIS
What is Sequence ?
• A sequence is an ordered list of objects (or events).
• Biological sequence is a single, continuous molecule
of nucleic acid or protein.
• Sequence analysis in bioinformatics is an automated,
computer-based examination of characteristic
fragments, e.g. of a DNA strand.
• The term "sequence analysis" in biology implies
subjecting a DNA or peptide sequence to sequence
alignment, sequence databases, repeated sequence
searches, or other bioinformatics methods on a
computer.
Nucleotide Sequence Databases
 NCBI (National Center for Biotechnology
Information)
 EMBL (European Molecular Biology Laboratory)
 DDBJ (DNA DataBank of Japan)
Sequence Alignment
• The identification of residue-residue correspondences
• The basic tool in bioinformatics
WHY Sequence Alignment ?
• For discovering functional, structural and
evolutionary information in biological sequences
• Eases further tasks like:
‾ Annotation of new sequences
‾ Modeling of protein structures
‾ Design and analysis of gene expression experiments
Basic Steps in Sequence Alignment
• Comparison of sequences to find similarity and
dissimilarity in compared sequences
• Identification of gene-structures, reading frames,
distributions of introns and exons and regulatory
elements
• Finding and comparing point mutations to get the
genetic marker
• Revealing the evolutionary and genetic diversity
• Function annotation of genes.
The Concept
• An alignment is a mutual arrangement of two
sequences
• Exhibits where two sequences are similar, and where
they differ
• An ‘optimal’ alignment – most correspondences and
the least differences
• Sequences that are similar probably have the same
function
Sequence alignment involves the identification of the
correct location of deletions and insertions that have
occurred in either of the two lineages since the
divergence from a common ancestor.
Terms of sequence comparison
Sequence identity
• Exactly same Nucleotide/AminoAcid in same position
Sequence similarity
• Substitutions with similar chemical properties
Sequence homology
• General term that indicates evolutionary relatedness
among sequences
• Sequences are homologous if they are derived from a
common ancestral sequence.
Things to consider
• To find the best alignment one needs to examine all
possible alignment
• To reflect the quality of the possible alignments one
needs to score them
• There can be different alignments with the same
highest score
• Variations in the scoring scheme may change the
ranking of alignments
Manual alignment
• When there are few gaps and the two sequences are
not too different from each other, a reasonable
alignment can be obtained by visual inspection.
• Advantages:
(1) use of a powerful and trainable tool (the brain,
well… some brains).
(2) ability to integrate additional data
Disadvantage : The method is subjective and unscalable.
Types of Alignment
- Pairwise Alignment
•Dot Matrix Method
•Dynamic Programming
•Word Method
- Multiple Alignment
•Dynamic Programming
•Progressive Methods
•Iterative Methods
•Motif Finding
Pairwise Sequence Alignment
• One pair of elements at a time
• Challenge – Find optimum alignment of 2
seqs with some degree of similarity
• Optimality is based on SCORE
• Score reflects the no. of paired characters
in the 2 seqs and the no. and length of gaps
introduced to adjust the seqs so that max
no. of characters are in alignment
A pairwise alignment consists of a series of paired bases,
one base from each sequence. There are three types of
pairs:
(1) matches = the same nucleotide appears in both
sequences.
(2) mismatches = different nucleotides are found in the
two sequences.
(3) gaps = a base in one sequence and a null base in the
other.
Match
Gap
Mismatch
GCGGCCCATCAGGTACTTGGTG -G
GCGT TCCATC - - CTGGTTGGTGTG
FASTA
1) Derived from logic of the dot plot
– compute best diagonals from all frames of alignment
2) Word method looks for exact matches between words in query and test
sequence
– hash tables (fast computer technique)
– DNA words are usually 6 bases
– protein words are 1 or 2 amino acids
– only searches for diagonals in region of word
matches = faster searching
FastA searches can be done on the WWW FastA server at EBI:
http://www2.ebi.ac.uk/fasta3/
FASTA Format
• simple format used by almost all programs
• >header line with a [return] at end
• Sequence (no specific requirements for line
length, characters, etc)
>URO1 uro1.seq
Length: 2018
November 9, 2000 11:50
Type: N
Check: 3854
CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA
ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT
GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC
CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG
TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA
GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT
CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA
TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG
GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC
CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC
CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT
..
BLAST Searches GenBank
[BLAST= Basic Local Alignment Search Tool]
The NCBI BLAST web server lets you compare
your query sequence to various sections of
GenBank:
– nr = non-redundant (main sections)
– month = new sequences from the past few weeks
– ESTs
– human, drososphila, yeast, or E.coli genomes
– proteins (by automatic translation)
• This is a VERY fast and powerful computer.
BLAST
• Uses word matching like FASTA
• Similarity matching of words (3 aa’s, 11 bases)
– does not require identical words.
• If no words are similar, then no alignment
– won’t find matches for very short sequences
• Does not handle gaps well
Phylogeny
Phylogenetics: the study of ancestor descendent
relationships. The objective of phylogeneticists is to
construct phylogenies
Phylogeny: A hypothesis of ancestor descendent
relationships.
Phylogenetic tree: a graphical summary of a
phylogeny
Phylogeny
All life forms are related by common ancestry
and descent. The construction of phylogenies
provides explanations of the diversity seen in
the natural world.
Phylogenies can be based on morphological
data, physiological data, molecular data or all
three. Today, phylogenies are usually
constructed using DNA sequence data
Phylogenetic trees
Two different formats of phylogenetic trees used to show
relatedness among species.
What is phylogenetic analysis and why should we
perform it?
Phylogenetic analysis has two major components:
1.
Phylogeny inference or “tree building” —
the inference of the branching orders, and
ultimately the evolutionary relationships,
between “taxa” (entities such as genes,
populations, species, etc.)
2.
Character and rate analysis —
using phylogenies as analytical frameworks
for rigorous understanding of the evolution of
various traits or conditions of interest
Unrooted and rooted trees
Representations of the
possible relatedness
between three species, A, B,
and C. (A) A single unrooted
tree (shown in both formats;
see Figure 17.4). (B) Three
possible rooted trees (in one
format).
•
•
•
•
•
Automatic and manual sequence alignment
Inferring phylogenetic trees
Mining web-based databases
Estimating rates of molecular evolution
Testing evolutionary hypotheses
•
Mega works on Windows, Mac OS, and Linux
Get the mRNA sequence of chicken LDHA (accession X53828) from the database
Choose
“Query Databanks”
Get the mRNA sequence of chicken LDHA (accession X53828) from the database
Choose
“Query Databanks”
Search for the
sequence
Get the mRNA sequence of chicken LDHA (accession X53828) from the database
Choose
“Query Databanks”
Search for the
sequence
Add to Alignment
Now, get only the CDS:
Scroll down and follow
the link to the CDS
Now, get only the CDS:
• Scroll down and
follow the link to the
CDS
• Get the fasta
sequence
• Add to Alignment
Alignment Explorer
Close the MEGA Web-Browser and examine the mRNA and
CDS sequences
Alignment Explorer
Edit the names of the sequences
Alignment Explorer
Edit the names of the sequences
Alignment Explorer
Align the DNA sequences
Alignment Explorer
At the DNA level, cut the UTR region from the mRNA
Alignment Explorer
Align the DNA sequences again and translate to proteins
Alignment Explorer
Create a new alignment, from the FASTA file ldh_a-c.fas
Further analysis
•
•
•
•
•
Export alignment to mega format
Save the data to a MEGA file
Give it an appropriate title
Specify if it is a protein-coding sequences
Open the data file in the Sequence Data
Explorer