* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Investigating Sequences - BioQUEST Curriculum Consortium
Microevolution wikipedia , lookup
Protein moonlighting wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
History of RNA biology wikipedia , lookup
Non-coding DNA wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Point mutation wikipedia , lookup
Genetic code wikipedia , lookup
Human genome wikipedia , lookup
Genome editing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Metagenomics wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Investigating Sequences Stephen Everse University of Vermont Biochemical similarities among all organisms GENOTYPE (i.e. Aa) • Genetic information encoded in nucleic acids • Protein synthesis by ribosomes using common genetic code* • Many common families of genes and proteins (rRNA, enzymes, proteins for transport, replication and expression of DNA) • All modern day cells descended from a common ancestor • Evolutionary relationships revealed by gene sequences PHENOTYPE (pink flower) The tree of life Phylogenetic relationships among organisms determined by ribosomal RNA sequences Red lines indicate pathogens First cells are thought to have existed as early as 3.8 billion years ago. They were probably prokaryotes. Oldest eucaryotic cell fossils are about 1.8 billion yrs ago J. Burke 2005 What is Bioinformatics? • (Molecular) Bio - informatics • One idea for a definition? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. • Bioinformatics is “MIS” for Molecular Biology Information. It is a practical discipline with many applications. Math & Stats Bioinformatics Bio Alexandrov and Gerstein © 2000 Comp Sci Areas of current and future development of bioinformatics • Molecular biology and genetics • Phylogenetic and evolutionary sciences • Different aspects of biotechnology including pharmaceutical and microbiological industries • Medicine • Agriculture • Eco-management Bioinformatics key areas e.g. homology searches organisation of knowledge (sequences, structures, functional data) M. Nilges © 2003 Why do we want to compare sequences? • Relationships – Phylogenetic trees can be constructed based on comparison of the sequences of a molecule (example: 16S rRNA) taken from different species – Residues conserved during evolution play an important role • Prediction of protein structure and function – Proteins which are very similar in sequence generally have similar 3D structure and function as well – By searching a sequence of unknown structure against a database of known proteins the structure and/or function can in many cases be predicted Center for Biological Sequence Analysis © 2001 • Aligning Text Strings Marc Gerstein © 1999 Mol Bio Information - Protein Marc Gerstein © 1999 Summary • Central dogma of biology generates material appropriate for bioinformatical study (DNA, RNA, proteins, phenotype, etc) • One form of bioinformatics is the comparison of sequences • BUT, How do we bring this to the classroom? Creating Inquiry Opportunities Domain Principles Analysis Tools Data Sets Establishing a Problem Space Domain Principles Problem Space Analysis Data Tools Sets Creating problem spaces that provide a rich context for using bioinformatics data and tools allows students to focus on using their understanding of biology to investigate meaningful questions. Problem Spaces • Foundation – – – – – – Introduction Background Data Tools Bibliography Curricular Resources • Starting Points Malaria Malaria is caused by one of four species of Plasmodium (falciparum, vivax, malariae and ovale). Of these P. falciparum is the most lethal being estimated to cause 200 million clinical cases, and 1-3 million deaths (including many children) every year world-wide. Lifecycle The Plasmodium falciparum Genome- A Consortium Project (chromosomes 1, 3-9, 13) (chromosomes 2, 10, 11 and 14) (chromosome 12) Plasmodium genomics special issue Nature 3rd October 2002 Plasmodium falciparum Genome Project Curation To maximise the benefits to the scientific community of Plasmodium genome sequencing, the Pathogen Genomics group is committed to the curation of Plasmodium spp. This will ensure that annotation is updated and maintained, and will form a framework that underpins global efforts to understand the parasite and the disease it causes. If you would like to contribute to the curation of any gene(s) please contact the curator [email protected] and visit GeneDB. http://www.sanger.ac.uk/Projects/P_falciparum/ See how Brad Goodner at Hiram College involves his students in curation: http://www.hiram.edu/biology/faculty/goodner.html Drug search … International computing grid searches for malaria drugs 2/2/07. Using an international computing grid spanning 27 countries, scientists on the WISDOM project analysed an average of 80 000 possible drug compounds against malaria every hour. In total, the challenge processed over 140 million compounds, with a UK physics grid providing nearly half of the computing hours used. http://malaria.wellcome.ac.uk/doc_WTX037265.html Enabling Grids for E-sciencE (EGEE) is the largest multidisciplinary grid infrastructure in the world, which brings together more than 120 organizations to produce a reliable and scalable computing resource available to the global research community. At present, it consists of 250 sites in 48 countries and more than 68,000 CPUs available to some 8,000 users 24 hours a day, 7 days a week http://www.eu-egee.org/ The data Genetic structure of Plasmodium falciparum field isolates in eastern and north-eastern India H. Joshi, N. Valecha, A. Verma, A. Kaul, P.K. Mallick, S. Shalini, S.K. Prajapati, S.K. Sharma, V. Dev, S. Biswas, N. Nanda, M.S. Malhotra, S.K. Subbarao & A.P. Dash. Malaria Journal 2007 Vol 6 Page 60 http://www.malariajournal.com/content/6/1/60 The study… • Isolates were collected from microscopically diagnosed P. falciparum positive subjects in three Indian states with varied malaria epidemiology; • Merozoite surface protein-1 (MSP-1, 17 kDa) & protein-2 (MSP-2, 46-53 kDa) of P. falciparum is a target of the host's humoral immunity and a malaria vaccine candidate; and • 131 P. falciparum isolates of msp-1 (block 2) and msp-2 (central repeat region, block3) were obtained as well as others from Genbank. msp-1 blocks Hoffmann et al. (2003) Malaria J. 2:24 Block 2 of msp-1 Nucleotide Sequence aatgaagaag gtggtgcaag tgcaagtgct agtgctcaaa gtacaagtcc aaatacttca aaattactac tgctcaaagt caaagtggtg gtggtacaag atcatctcgt tctggtgcaa aaaaggtgca ggtgcaagtg caagtgctca tggtccaagt tcaaacactt gccctccagc agtgctcaaa ctcaaagtgg aagtggtgca ggtccaagtg tacctcgttc tgatgcaagc Amino Acid Sequence NEEEITTKGASAQSGASAQSGASAQSGASAQSGASAQ SGASAQSGTSGPSGPSGTSPSSRSNTLPRSNTSSGASPP ADAS Our Notation … Subset of protein & nucleotide sequences available (n): – – – – Consortium (1) Indian (9) Community (12) Sudan (1) Our workspace … • National Center Biological Information (http://ncbi.nlm.nih.gov) • Biology Workbench (http://workbench.sdsc.edu) Malaria Triad: Genetics & Genomics This web resource provides data and information relevant to malaria genetics and genomics. These resources include organism specific sequence BLAST databases (Plasmodium falciparum only, all Plasmodium ), genome maps, linkage markers, and information about genetic studies. Links are provided for other malaria web sites and genetic data on related apicomplexan parasites . http://www.ncbi.nlm.nih.gov/projects/Malaria/ The tools … • Session Tools ~ file folders • Protein Tools/Nucleic Tools – Find sequences (Ndjinn) – Upload sequences (Add) – Align sequences (CLUSTALW) • Alignment Tools – Display options (BOXSHADE, DRAWGRAM) Let’s explore … > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… • Form groups of ~3/computer • Look at the data (http://bioquest.org/oakwood_2008/malaria-problem-space) • Choose a problem/question to explore Favorite movie of the week … Inside the Cell Harvard BioVisions Video What you are seeing is discussed here Homology • Homologous sequences can be divided into two groups – orthologous sequences: sequences that differ because they are found in different species (e.g. human a-globin and mouse a-globin) – paralogous sequences: sequences that differ because of a gene duplication event (e.g. human a-globin and human bglobin, various versions of both ) M. Craven @ 2002 So this means … Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html Search algorithms • Smith-Waterman (1981) • FASTA (Pearson 1995) • BLAST (Altschul 1990, 1997) – demanding of time and memory resources – Speed up searches by an order of magnitude compared to SmithWaterman – Good statistics – Extremely fast • One order of magnitude faster than FASTA • Two orders of magnitude faster than Smith-Waterman – Almost as sensitive as FASTA Things to keep in mind when working with alignments • Pairwise alignment programs always find the optimal alignment of two sequences – They do so even if it does not make any sense at all to align the two sequences – ”Optimal” means optimal according to the substitution matrix and gap penalties you choose – also if you choose the wrong ones • Generally the underlying assumptions are wrong – The frequency of substitution is not the same at all positions – Nor is the frequencies of insertions and deletions the same – Affine gap penalties do not properly model indel events Center for Biological Sequence Analysis © 2001 • • • • Simplest way: the identity matrix A very crude model : to use the genetic code How to score the exchange of matrix, the number of point mutations two amino acids in an necessary to transform one codon into the other. alignment? Other similarity scoring matrices might be constructed from any property of amino acids that can be quantified -partition coefficients between hydrophobic and hydrophilic phases – charge – molecular volume, etc. Unfortunately, all these biophysical quantities suffer from the fact that they provide only a partial view of the picture there is no guarantee, that any particular property is a good predictor for conservation of amino acids between related proteins. Marc Gerstein © 1999 Pairwise alignment of hemoglobin alpha chain and myoglobin 24.7% identity; Global alignment score: 130 10 20 30 40 50 HBA_HU VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--::: .. : .:.:: : .. .: . : :.: : : : : .: . :..:. MYG_PH VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 10 20 30 40 50 60 60 70 80 90 100 110 HBA_HU ---HGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF-KLLSHCLLVTLAAHL :: : :: . . :. :.. :: : .. :... ...:. .. .: .. MYG_PH LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI-PIKYLEFISEAIIHVLHSRH 70 80 90 100 110 120 130 140 HBA_HU PAEFTPAVHASLDKFLASVSTVLTSKYR-----:..: ......: : ...::. MYG_PH PGDFGADAQGAMNKALELFRKDIAAKYKELGYQG 120 130 140 150 Center for Biological Sequence Analysis © 2001 Important things to remember when using alignment to search databases • When searching in databases, size does matter! – Searching large databases take very long time – The significance of matches drops when the database is expanded • Doing things differently can lead to different conclusions – Nucleotide comparison vs. protein comparison • Think before and after you search – The obvious thing to do is not always the right thing to do – Conclusions based on matches should be drawn with greater care Marc Gerstein © 1999 Why multiple alignment is better • More sequences contain more information • Multiple sequence alignment allows us to compare all related proteins simultaneously • It allows us to identify features that are conserved among the sequences • Using a multiple sequence alignment (a profile) one can find more related sequences than by simple pairwise comparison Center for Biological Sequence Analysis © 2001 A multiple sequence alignment of globins HBB_HUMAN HBB_HORSE HBA_HUMAN HBA_HORSE MYG_PHYCA GLB5_PETMA LGB2_LUPLU --------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . HBB_HUMAN HBB_HORSE HBA_HUMAN HBA_HORSE MYG_PHYCA GLB5_PETMA LGB2_LUPLU PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Center for Biological Sequence Analysis © 2001