Download Genomics of Theileria parva

TDR-HAT Bioinformatics Course Etienne de Villiers Sonal Patel ILRI - Kenya Outline 1. Introduction 2. Nucleic acid sequence analysis 3. Protein sequence analysis 4. Accessing Completed Genomes 5. Homology Searching 6. Multiple Sequence Alignments 7. Comparative Genomics A gene codes for a protein Gene/DNA CCTGAGCCAACTATTGATGAA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE Eukaryotes have spliced genes… Promises of genomics and bioinformatics • Medicine – Knowledge of protein structure facilitates drug design – Understanding of genomic variation allows the tailoring of medical treatment to the individual’s genetic make-up – Genome analysis allows the targeting of genetic diseases – The effect of a disease or of a therapeutic on RNA and protein levels can be elucidated • The same techniques can be applied to biotechnology, crop and livestock improvement, etc... What is bioinformatics? • Application of information technology to the storage, management and analysis of biological information • Facilitated by the use of computers What is bioinformatics? • Sequence analysis – Geneticists/ molecular biologists analyse genome sequence information to understand disease processes • Molecular modeling – Crystallographers/ biochemists design drugs using computer-aided tools • Phylogeny/evolution – Geneticists obtain information about the evolution of organisms by looking for similarities in gene sequences • Ecology and population studies – Bioinformatics is used to handle large amounts of data obtained in population studies Sequence analysis: overview Sequencing project management Nucleotide sequence analysis Sequence entry Sequence database browsing Manual sequence entry Nucleotide sequence file Search for protein coding regions Search databases for similar sequences Design further experiments Restriction mapping PCR planning coding non-coding Protein sequence analysis Translate into protein Search databases for similar sequences Sequence comparison Search for known motifs RNA structure prediction Create a multiple sequence alignment Edit the alignment Molecular phylogeny Search for known motifs Predict secondary structure Sequence comparison Multiple sequence analysis Format the alignment for publication Protein sequence file Protein family analysis Predict tertiary structure Gene Sequencing Automated chemical sequencing methods allow rapid generation of large data banks of gene sequences Database similarity searching The BLAST program has been written to allow rapid comparison of a new gene sequence with the 100s of 1000s of gene sequences in data bases. Sequences producing significant alignments: (bits) Value gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] 112 gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106 gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69 gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae] 30 gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae] 29 gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae] 29 7e-26 5e-24 7e-13 0.66 1.1 1.5 gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] Length = 478 Score = 112 bits (278), Expect = 7e-26 Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%) Query: 2 QSVPWGISRVQAPAAHNRG---------LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50 + PWG+ RV G G GV VLDTGI T H D R + + Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233 Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110 P D NGHGTH AG I + + GVA + ++ +G+E Sbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288 Sequence comparison Gene sequences can be aligned to see similarities between gene from different sources 768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG . . . . . 814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | 136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG . . . . . 864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | 173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 813 135 863 172 913 216 Restriction mapping Genes can be analyzed to detect gene sequences that can be cleaved with restriction enzymes 50 AceIII AluI AlwI ApoI BanII BfaI BfiI BsaXI BsgI BsiHKAI Bsp1286I BsrI BsrFI CjeI CviJI CviRI DdeI DpnI EcoRI HinfI MaeIII MnlI MseI MspI NdeI Sau3AI SstI TfiI Tsp45I Tsp509I TspRI 100 150 200 250 1 2 1 2 1 2 1 1 1 1 1 2 1 2 4 1 2 2 1 2 1 1 2 1 1 2 1 2 1 3 1 CAGCTCnnnnnnn’nnn... AG’CT GGATCnnnn’n_ r’AATT_y G_rGCy’C C’TA_G ACTGGG ACnnnnnCTCC GTGCAGnnnnnnnnnnn... G_wGCw’C G_dGCh’C ACTG_Gn’ r’CCGG_y CCAnnnnnnGTnnnnnn... rG’Cy TG’CA C’TnA_G GA’TC G’AATT_C G’AnT_C ’GTnAC_ CCTCnnnnnn_n’ T’TA_A C’CG_G CA’TA_TG ’GATC_ G_AGCT’C G’AwT_C ’GTsAC_ ’AATT_ CAGTGnn’ PCR Primer Design Oligonucleotides for use in the polymerisation chain reaction can be designed using computer based prgrams OPTIMAL primer length MINIMUM primer length MAXIMUM primer length OPTIMAL primer melting temperature MINIMUM acceptable melting temp MAXIMUM acceptable melting temp MINIMUM acceptable primer GC% MAXIMUM acceptable primer GC% Salt concentration (mM) DNA concentration (nM) MAX no. unknown bases (Ns) allowed MAX acceptable self-complementarity MAXIMUM 3' end self-complementarity GC clamp how many 3' bases --> --> --> --> --> --> --> --> --> --> --> --> --> --> 20 18 22 60.000 57.000 63.000 20.000 80.000 50.000 50.000 0 12 8 0 Gene discovery Computer program can be used to recognize the protein coding regions in DNA 0 1,000 2,000 3,000 4,000 1,000 2,000 3,000 4,000 2.0 1.5 1.0 0.5 -0.0 2.0 1.5 1.0 0.5 -0.0 2.0 1.5 1.0 0.5 -0.0 0 Plot created using codon preference (GCG) Protein structure prediction Particular structural features can be recognized in protein sequences 50 100 50 100 5.0 KD Hydrophobicity -5.0 10 Surface Prob. 0.0 1.2 Flexibility 0.8 1.7 Antigenic Index -1.7 CF Turns CF Alpha Helices CF Beta Sheets GOR Turns GOR Alpha Helices GOR Beta Sheets Glycosylation Sites Protein Structure The 3-D structure of proteins is used to understand protein function and design new drugs Multiple sequence alignment Sequences of proteins from different organisms can be aligned to see similarities and differences Alignment formatted using Jalview Phylogeny inference Analysis of sequences allows evolutionary relationships to be determined E.coli C.botulinum C.cadavers C.butyricum B.subtilis B.cereus Phylogenetic tree constructed using the Phylip package DNA sequence analysis Inferring function by homology • The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology searching, the equivalent genes in one species to those known to be important in other model species. • Logic: if the linear alignment of a pair of sequences is similar, then we can infer that the 3-dimensional structure is similar; if the 3-D structure is similar then there is a good chance that the function is similar. Basic Local Alignment Search Tools (BLAST) • BLAST programs (there are several) compare a query sequence to all the sequences in a database in a pairwise manner. • Breaks: query and database sequences into fragments known as "words", and seeks matches between them. • Attempts to align query words of length "W" to words in the database such that the alignment scores at least a threshold value, "T". known as High-Scoring Segment Pairs (HSPs) • HSPs are then extended in either direction in an attempt to generate an alignment with a score exceeding another threshold, "S", known as a Maximal-Scoring Segment Pair (MSP) 2 sequence alignment To align GARFIELDTHECAT with GARFIELDTHERAT is easy GARFIELDTHECAT ||||||||||| || GARFIELDTHERAT Gaps Sometimes, you can get a better overall alignment if you insert gaps GARFIELDTHECAT |||||||| ||| GARFIELDA--CAT is better (scores higher) than GARFIELDTHECAT |||||||| GARFIELDACAT No gap penalty But there has to be some sort of a gappenalty otherwise you can align ANY two sequences: G-R--E------AT | | | || GARFIELDTHECAT Affine gap penalty • Could set a score for each indel • Usually use affine gap penalty – (open + extend) * gap length • Open –10, extend -0.05 2+ similar sequences • When doing a similarity search against a database you are trying to decide which of many sequences is the CLOSEST match to your search sequence. • Which of the following alignment pairs is better?: Scoring Alignments GARFIELDTHECAT |||| ||||||| GARFRIEDTHECAT GARFIELDTHECAT ||| ||| ||||| GARWIELESHECAT GARFIELDTHECAT || ||||||| || GAVGIELDTHEMAT ? Low Complexity Masking • Some sequences are similar even if they have no recent common ancestor. • Huntington's disease is caused by poly CAG tracks in the DNA which results in polyGlutamine (Gln, Q) tracks in the protein. • If you do a homology search with QQQQQQQQQQ you get hits to other proteins that have a lot of glutamines but have totally different function. 2 sequence alignment Huntingtin: MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA hits >MM16_MOUSE MATRIX METALLOPROTEINASE-16 Score = 34.4 bits (78), Expect = 0.18 Identities = 21/65 (32%), Positives = 25/65 (38%), Gaps = 2/65 (3%): FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLLPQPQPPPPPP F Q + + Q Q+ PP PPP LP PP P P+ P PP FYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPADPRRHDRPKPP But not because it is involved in microtubule mediated transport! E - values • An E-value is a measure of the probability of any given hit occurring by chance. • Dependent on the size of the query sequence and the database. • The lower the E-value the more confidence you can have that a hit is a true homologue (sequence related by common descent). Protein sequence analysis Protein Sequence Analysis 1. Physico-chemical properties. 2. Cellular localization. 3. Signal peptides. 4. Transmembrane domains. 5. Post-translational modifications. 6. Motifs & domains. 7. Secondary structure. 8. Other resources. ExPASy (Expert Protein Analysis System) • Swiss Institute of Bioinformatics (SIB). • Dedicated to the analysis of protein sequences and structures. • Many of the programs for protein sequence analysis can be accessed via ExPASy. 1) Physico-chemical properties: • ProtParam tool o o o o o o o o o molecular weight theoretical pI (pH no net electrical charge) amino acid composition atomic composition extinction coefficient estimated half-life instability index aliphatic index grand average of hydropathicity (GRAVY) 2) Cellular localization: • Proteins destined for particular subcellular localizations have distinct amino acid properties particularly in their N-terminal regions. • Used to predict whether a protein is localized in the cytoplasm, nucleus, mitochondria, or is retained in the ER, or destined for lysosome (vacuolar) or the peroxisome. • PSORT • End of the output the percentage likelihood of the subcellular localization. 3) Signal peptides: • Proteins destined for secretion, operation with the endoplasmic reticulum, lysosomes and many transmembrane proteins are synthesized with leading (N-terminal) 13 – 36 residue signal peptides. • SignalP WWW server can be used to predict the presence and location of signal peptide cleavage sites in your proteins. • Useful to know whether your protein has a signal peptide as it indicates that it may be secreted from the cell. • Proteins in their active form will have their signal peptides removed. 4) Transmembrane domains: • TMpred program makes a prediction of membranespanning regions and their orientation. • Algorithm is based on the statistical analysis of TMbase, a database of naturally occurring transmembrane proteins. • Presence of transmembrane domains is an indication that the protein is located on the cell surface. 5) Post-translational modifications: • After translation has occurred proteins may undergo a number of posttranslational modifications. • Can include the cleavage of the pro- region to release the active protein, the removal of the signal peptide and numerous covalent modifications such as, acetylations, glycosylations, hydroxylations, methylations and phosphorylations. • Posttranslational modifications may alter the molecular weight of your protein and thus its position on a gel. • Many programs available for predicting the presence of posttranslational modifications, we will take a look at one for the prediction of type O-glycosylation sites in mammalian proteins. • These programs work by looking for consensus sites and just because a site is found does not mean that a modification definitely occurs. 6) Motifs and Domains: • Motifs and domains give you information on the function of your protein. • Search the protein against one of the motif or profile databases. • ProfileScan, which allows you to search both the Prosite and Pfam databases simultaneously 7) Secondary Structure Prediction: • WHY: – If protein structure, even secondary structure, can be accurately predicted from the now abundantly available gene and protein sequences, such sequences become immensely more valuable for the understanding of drug design, the genetic basis of disease, the role of protein structure in its enzymatic, structural, and signal transduction functions, and basic physiology from molecular to cellular, to fully systemic levels. • JPRED - works by combining a number of modern, high quality prediction methods to form a consensus. Secondary Structure Prediction • Essentially protein secondary structure consists of 3 major conformations;  a Helix.  b pleated sheet.  coil conformation. Accessing Completed Genomes Accessing Completed Genomes 1. 2. 3. 4. 5. 6. GeneDB TigrDB TIGR Gene Indices Ensembl NCBI Genomic Biology Accessing other genomes GeneDB http://www.genedb.org • Multi-organism BLAST • Datasets from Fungi, Bacteria, Protozoa, Parasite Vectors • Curated Genome Database for three major organisms: – Schizosaccharomyces pombe, – Leishmania major and – Trypanosoma brucei TigrDB http://www.tigr.org/tdb/parasites/ • Access to parasites sequenced at TIGR • Genome Annotation Database • Relevant to WHO/TDR Pathogens: – Trypanosomes, Schistosoma mansoni, Brugia malayi TIGR Gene Indices • http://www.tigr.org/tdb/tgi/ • Organism specific databases providing EST and gene sequence transcripts. • Gene indexes available for: – – – – Animals Plants Protists and Fungi. Ensembl • Ensembl is a joint project between EMBL EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. NCBI Genomic Biology http://www.ncbi.nlm.nih.gov • • • • • • Literature Nucleotide Sequence Protein Sequence Complete Genomes Genome Maps … etc Homology Searching What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. “local” means it searches and aligns sequence segments, rather than align the entire sequence. It’s able to detect relationships among sequences which share only isolated regions of similarity. Currently, it is the most popular and most accepted sequence analysis tool. Why BLAST? • Identify unknown sequences - The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a wellcharacterized sequence, then you may have access to a wealth of biological information. • Help gene/protein function and structure prediction – genes with similar sequences tend to share similar functions or structure. • Identify protein family – group related (paralog or ortholog) genes and their proteins into a family. •Prepare sequences for multiple alignments • And more … Blast 1 Blast 2 Low Complexity masking >GDB1_WHEAT MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI PIVQPSVLQQLNPCKVFLQQQCSPVAMPQRLARSQMWQQSSCHVMQQQCCQQLQQIPEQS RYEAIRAIIYSIILQEQQQGFVQPQQQQPQQSGQGVSQSQQQSQQQLGQCSFQQPQQQLG QQPQQQQQQQVLQGTFLQPHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG VGAY >GDB1_WHEAT SEG filtered MKTFLVFALIAVVATSAIAQMETSCISGLERPWXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXLNPCKVFLQQQCSPVAMPQRLARSQMWXXXXXXXXXXXXXXXXXXXXXXX RYEAIRAIIYSIIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG VGAY Blast limit by taxon Blast results Interpret BLAST results - Distribution Query sequence BLAST hits. Click to access the pairwise alignment. This image shows the distribution of BLAST hits on the query sequence. Each line represents a hit. The span of a line represents the region where similarity is detected. Different colors represent different ranges of scores. Interpret BLAST results - Description The description (also called definition) lines are listed below under the heading "Sequences producing significant alignments". The term "significant" simply refers to all those hits whose E value was less than the threshold. It does not imply biological significance. ID (GI #, refseq #, DB-specific Gene/sequence Bit score – higher, better. ID #) Click to access the Definition Click to access the record in GenBank pairwise alignment Links Expect value – lower, better. It tells the possibility that this is a random hit Interpret BLAST results – pairwise alingments Query line: the segment from query sequence. Subj line: the segment from hit (subject) sequence. Middle line: the consensus bases Summary - If your sequence is NUCLEOTIDE Length DB Purpose Program 20 bp or longer Nucl Identify the query sequence MegaBlast blastn Find sequences similar to query sequence blastn Find similar proteins to translated query in a translated database tblastx Prot Find similar proteins to translated query in a protein database blastx Nucl Find primer binding sites or map short contiguous motifs Search for short, nearly exact matches 7-20 bp Summary - If your sequence is PROTEIN Length DB Purpose Program 15 residue or longer Prot Identify the query sequence or find protein sequences similar to query blastp Find members of a protein family or build a custom position-specific score matrix PSI-blast Find proteins similar to the query around a PHI-blast given pattern 5-15 residue Nucl Find similar proteins in a translated nucleotide database tblastn Prot Search for peptide motifs Search for short, nearly exact matches Multiple Sequence Alignments Why Do MSAs? • Although BLAST may give you good E-value – MSA more convincing that protein is related and can be aligned over entire length. • Identification of conserved regions or domains in proteins. – Regions that are evolutionary conserved are likely to be important for structure/function. – Mutations in these areas more likely to affect function. • Identification of conserved residues in proteins. • Prerequisite for doing phylogenetic trees. Identification of conserved domains How MSAs are computed T-Coffee Vs Clustal • ClustalW is standard program for MSAs. • However, new program T-Cofffee often does a better job particularly with more distantly related proteins. Comparative Genomics The Big Picture: Why compare? • Conservation over long evolutionary distances suggests functional constraints; – useful for discovery of genes and other functional elements – lack of conservation over short distances may be indicative of adaptive evolution • Characterizing the differences between organisms reveals insights into the mechanisms of change. • Leveraging knowledge between species, e.g. from wellcharacterized model systems to species of strategic or economic interest. • Correlating intraspecies genotypic and phenotypic variation. Matching Apples and Oranges: Similarity/Homology • Regardless of what the unit is, all comparisons require some objective metric for defining how to match; e.g. in silico hybridization protocol, substitution matrix, similarity scores • According to the selected definition, similarity is an observed/computed fact • Homology is an inference about common ancestry usually based on similarity and some underlying model of evolution • Convergent evolution can result in similarity without homology • Mutational saturation obscures homology over time, especially in “neutral” areas Matching Lemons and Limes: Shades of homology • Orthology: denotes descent from common precursor via “speciation” event; the basic “copy” operation with divergence • Paralogy: denotes descent from common precursor via intraspecies duplication event; single element, segmental, whole genome • Horizontal transfer: denotes descent from common precursor via interspecies transfer • Gene fusion, Gene loss, Exon skipping/shuffling… Artemis • • • • Artemis is a DNA viewer program. View EMBL and GenBank style files. View Prokaryotic and Eukaryotic annotations. Display genome features on a six-frame translation. • Is the main annotation tool used for analysis of microbial genomes at the Sanger Institute. ARTEMIS - example ACT (Artemis Comparison Tool) • • • • A DNA sequence comparison viewer. Based on Artemis. Visualise multiple genome comparisons. ACT is usually the result of running a blastn or tblastx search. • Retains all functionality of Artemis The ACT Display genome1 Zoom scroll bar Filter scroll bar genome2 Genome2 Blast HSPs genome3 ACT • Designed for looking at complete bacterial genomes. ACT - example • Trypanosoma brucei chromosome 1 versus Trypanosoma cruzi chromosome 3 Running ACT Sequence 1 Sequence 2 BLASTn tBLASTx MSPcrunch Reformat MSPcrunch output 62 140 73 92 59 56 165 79 95 135 87 56 89 52 51 54 90 231 93 49.00 64.00 62.00 58.00 58.00 57.00 62.00 79.00 67.00 55.00 65.00 46.00 51.00 67.00 73.00 53.00 72.00 72.00 67.00 22 232 793 3498 3724 4333 4825 5239 5103 7486 8014 8698 8812 11117 12611 12622 14131 14374 14803 168 495 936 3752 3873 4458 5199 5367 5354 7770 8175 8835 9012 11215 12709 12750 14304 14829 14994 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 py.synt.contigs.00000001 3928 4074 chr_cm none 4141 4404 chr_cm none 4642 4785 chr_cm none 7271 7525 chr_cm none 7497 7646 chr_cm none 8074 8199 chr_cm none 8728 9102 chr_cm none 9142 9270 chr_cm none 9006 9257 chr_cm none 12766 13050 chr_cm none 13345 13506 chr_cm none 14149 14286 chr_cm none 14266 14466 chr_cm none 16541 16639 chr_cm none 18019 18117 chr_cm none 18030 18158 chr_cm none 19094 19267 chr_cm none 19337 19792 chr_cm none 19769 19960 chr_cm none

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Genomics of Theileria parva