* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequence Alignment
Genomic library wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Deoxyribozyme wikipedia , lookup
History of genetic engineering wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Primary transcript wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Human genome wikipedia , lookup
Koinophilia wikipedia , lookup
Non-coding DNA wikipedia , lookup
Point mutation wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Pathogenomics wikipedia , lookup
Microsatellite wikipedia , lookup
Microevolution wikipedia , lookup
DNA barcoding wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome editing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metagenomics wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
ANALISA PILOGENETIKA Oleh Irda Safni TAXONOMY Taxonomy of Bacteria and Archaea • Modern taxonomy comprises the following features: – Nomenclature: giving names of appropriate taxonomic rank to the classified organisms. – Classification: the theory and process of ordering the organisms, on the basis of shared properties, into groups. – Identification: obtaining data on the properties of the organism (characterization) and determination which species it belongs to. This is based on direct comparison to known taxonomic groups. Nomenclature of Bacteria and Archaea • There are a, quite complicated, set of rules for the naming Bacteria and Archaea. They must have two names: the first refers to the genus (= slekt) and the second refers to the species (= art). • The names can be derived from any language but they must be Latinized. Take for example Staphylococcus aureus. The genus name is capitalized and the species name is lower case. The name is italized to indicate that is Latinized. Staphyl is derived from the Greek staphyle meaning ”a bunch of grapes” and coccus from the Greek meaning ”a berry”. Aurous is from Latin and means ”gold”. A yellow bunch of berries. • The higher taxonomic orders are family, order, class, phylum and domain but except for domain these are rarely used. Species concept • The species concept applied to eukaryotes cannot be applied to bacteria and archaea. In fact it is quite difficult to define prokaryote species. • In order to be of the same species prokaryotes must share many more properties with each other than with other prokaryotes. • They must have similar mol % G+C. Note that two species having the same mol % G+C are not necessary of the same species. • The DNA from organisms of the same species must show a minimum of 70% reassociation. Numerical Taxonomy • Numerical taxonomy is a methods which is used to differentiate a large number of similar bacteria, i.e. species. • A large number of tests (~100) are carried out and the results are scored as positive or negative. Several control species are included in the analysis. • All characteristics are given equal weight and a computer based analysis is carried out to group the bacteria according to shared properties. Homologous genes are used in the construction of phylogenetic trees • Homologous means that genes have a common anscestor • Orthologs are homologous genes that belong to different species but still retain their original function • Paralogs are homologous genes that have arrisen by gene duplication and are found in the same organism • Only orthologes can be used in the construction of phylogenetic trees. The classical example is the 16S ribosomal RNA gene. Ribosomal Database project • The database contains over 78,000 bacterial 16S rDNA sequences • Approximately 7000 Type strains (the bacteria are in pure culture) • Approximately 70000 Environmental samples (bacteria and archaea samples have been collected from the environment and characterized by molecular methods.) • http://rdp.cme.msu.edu/html/index.html SEQUENCE ANALYSIS What is Sequence ? • A sequence is an ordered list of objects (or events). • Biological sequence is a single, continuous molecule of nucleic acid or protein. • Sequence analysis in bioinformatics is an automated, computer-based examination of characteristic fragments, e.g. of a DNA strand. • The term "sequence analysis" in biology implies subjecting a DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence searches, or other bioinformatics methods on a computer. Nucleotide Sequence Databases NCBI (National Center for Biotechnology Information) EMBL (European Molecular Biology Laboratory) DDBJ (DNA DataBank of Japan) Sequence Alignment • The identification of residue-residue correspondences • The basic tool in bioinformatics WHY Sequence Alignment ? • For discovering functional, structural and evolutionary information in biological sequences • Eases further tasks like: ‾ Annotation of new sequences ‾ Modeling of protein structures ‾ Design and analysis of gene expression experiments Basic Steps in Sequence Alignment • Comparison of sequences to find similarity and dissimilarity in compared sequences • Identification of gene-structures, reading frames, distributions of introns and exons and regulatory elements • Finding and comparing point mutations to get the genetic marker • Revealing the evolutionary and genetic diversity • Function annotation of genes. The Concept • An alignment is a mutual arrangement of two sequences • Exhibits where two sequences are similar, and where they differ • An ‘optimal’ alignment – most correspondences and the least differences • Sequences that are similar probably have the same function Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since the divergence from a common ancestor. Terms of sequence comparison Sequence identity • Exactly same Nucleotide/AminoAcid in same position Sequence similarity • Substitutions with similar chemical properties Sequence homology • General term that indicates evolutionary relatedness among sequences • Sequences are homologous if they are derived from a common ancestral sequence. Things to consider • To find the best alignment one needs to examine all possible alignment • To reflect the quality of the possible alignments one needs to score them • There can be different alignments with the same highest score • Variations in the scoring scheme may change the ranking of alignments Manual alignment • When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection. • Advantages: (1) use of a powerful and trainable tool (the brain, well… some brains). (2) ability to integrate additional data Disadvantage : The method is subjective and unscalable. Types of Alignment - Pairwise Alignment •Dot Matrix Method •Dynamic Programming •Word Method - Multiple Alignment •Dynamic Programming •Progressive Methods •Iterative Methods •Motif Finding Pairwise Sequence Alignment • One pair of elements at a time • Challenge – Find optimum alignment of 2 seqs with some degree of similarity • Optimality is based on SCORE • Score reflects the no. of paired characters in the 2 seqs and the no. and length of gaps introduced to adjust the seqs so that max no. of characters are in alignment A pairwise alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs: (1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null base in the other. Match Gap Mismatch GCGGCCCATCAGGTACTTGGTG -G GCGT TCCATC - - CTGGTTGGTGTG FASTA 1) Derived from logic of the dot plot – compute best diagonals from all frames of alignment 2) Word method looks for exact matches between words in query and test sequence – hash tables (fast computer technique) – DNA words are usually 6 bases – protein words are 1 or 2 amino acids – only searches for diagonals in region of word matches = faster searching FastA searches can be done on the WWW FastA server at EBI: http://www2.ebi.ac.uk/fasta3/ FASTA Format • simple format used by almost all programs • >header line with a [return] at end • Sequence (no specific requirements for line length, characters, etc) >URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT .. BLAST Searches GenBank [BLAST= Basic Local Alignment Search Tool] The NCBI BLAST web server lets you compare your query sequence to various sections of GenBank: – nr = non-redundant (main sections) – month = new sequences from the past few weeks – ESTs – human, drososphila, yeast, or E.coli genomes – proteins (by automatic translation) • This is a VERY fast and powerful computer. BLAST • Uses word matching like FASTA • Similarity matching of words (3 aa’s, 11 bases) – does not require identical words. • If no words are similar, then no alignment – won’t find matches for very short sequences • Does not handle gaps well Phylogeny Phylogenetics: the study of ancestor descendent relationships. The objective of phylogeneticists is to construct phylogenies Phylogeny: A hypothesis of ancestor descendent relationships. Phylogenetic tree: a graphical summary of a phylogeny Phylogeny All life forms are related by common ancestry and descent. The construction of phylogenies provides explanations of the diversity seen in the natural world. Phylogenies can be based on morphological data, physiological data, molecular data or all three. Today, phylogenies are usually constructed using DNA sequence data Phylogenetic trees Two different formats of phylogenetic trees used to show relatedness among species. What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: 1. Phylogeny inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.) 2. Character and rate analysis — using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest Unrooted and rooted trees Representations of the possible relatedness between three species, A, B, and C. (A) A single unrooted tree (shown in both formats; see Figure 17.4). (B) Three possible rooted trees (in one format). • • • • • Automatic and manual sequence alignment Inferring phylogenetic trees Mining web-based databases Estimating rates of molecular evolution Testing evolutionary hypotheses • Mega works on Windows, Mac OS, and Linux Get the mRNA sequence of chicken LDHA (accession X53828) from the database Choose “Query Databanks” Get the mRNA sequence of chicken LDHA (accession X53828) from the database Choose “Query Databanks” Search for the sequence Get the mRNA sequence of chicken LDHA (accession X53828) from the database Choose “Query Databanks” Search for the sequence Add to Alignment Now, get only the CDS: Scroll down and follow the link to the CDS Now, get only the CDS: • Scroll down and follow the link to the CDS • Get the fasta sequence • Add to Alignment Alignment Explorer Close the MEGA Web-Browser and examine the mRNA and CDS sequences Alignment Explorer Edit the names of the sequences Alignment Explorer Edit the names of the sequences Alignment Explorer Align the DNA sequences Alignment Explorer At the DNA level, cut the UTR region from the mRNA Alignment Explorer Align the DNA sequences again and translate to proteins Alignment Explorer Create a new alignment, from the FASTA file ldh_a-c.fas Further analysis • • • • • Export alignment to mega format Save the data to a MEGA file Give it an appropriate title Specify if it is a protein-coding sequences Open the data file in the Sequence Data Explorer