* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download BIO2093_DMS4_sequence_similarity
Amino acid synthesis wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Biosynthesis wikipedia , lookup
Biochemistry wikipedia , lookup
Metalloprotein wikipedia , lookup
Expression vector wikipedia , lookup
Magnesium transporter wikipedia , lookup
Interactome wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene expression wikipedia , lookup
Western blot wikipedia , lookup
Protein purification wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Genetic code wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Proteolysis wikipedia , lookup
Point mutation wikipedia , lookup
Phylogeny IV BIO2093 – Sequence Similarity Darren Soanes Central Dogma Open reading frame Sequence similarity • Protein sequence determines function. • Proteins with similar sequences have similar functions. • Sequence similarity may also suggest evolutionary relationship. • Function of unknown protein can be inferred by similarity of sequence to known proteins. Protein sequence determines function Protein databases • Protein Information Resource (PIR) was the first protein sequence database. • Proteins organised into families based on degree of sequence similarity. • PIR-International Protein Sequence database. • Swiss-Prot, manually annotated protein database, crossreferenced, literature citations. • TrEMBL - (Translated EMBL Nucleotide Sequence Data Library), automated annotations for those proteins not in Swiss-Prot. • Uniprot – combination of PIR+Swiss-Prot+TrEMBL. • Most sequences in protein databases translated from DNA sequences. DNA sequence databases • GenBank (1974), European Molecular Biology Laboratory (EMBL) Data Library (1980), DNA Databank of Japan (DDBJ) (1984). • Genbank, EMBL and DDBJ formed Nucleotide Sequence Database Collaboration – data exchanged daily. Sequence alignment (1) • Sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that residues with identical or similar characters are aligned in successive columns. Sequence alignment (2) • Pairwise alignment – comparing two sequences. Generally a query sequence is compared to every sequence in a database to find the best match. * = identical amino acid : = conserved substitution (same chemical property) . = semi-conserved substitution (same shape) Families of amino acids Sequence alignment (2) • Global alignment – attempt to match every residue in two sequences, most useful when sequences are of equal length. • Local alignment - more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. BLAST • BLAST (Basic Local Alignment Search Tool): Local alignment algorithm that has been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. • Useful in large-scale database searches where most of the candidate sequences will have no significant match with the query sequence. – Exact matches found to short sections of query sequence in database (3 amino acids in protein alignment). – Match extended in each direction (ungapped). – Matches with a score over a certain threshold are subjected to more sensitive gapped alignment algorithm. BLAST programs (1) • • • • • blastn – nucleotide query v nucleotide database blastp – protein query v protein database blastx – nucleotide query v protein database tblastn – protein query v nucleotide database tblastx - nucleotide query v nucleotide database (translated) • Low complexity sequences can be filtered out, reduces the likelihood of false positives in some situations. BLAST programs (2) • Best to compare protein sequences between species – they evolve more slowly than nucleotide sequences. • Many changes in nucleotide sequence don’t change protein sequence (neutral mutations). • Use blastn when mapping mRNA or gene sequences to genomic DNA from the same organism, or comparing RNA sequences. Genetic Code BLASTP 2.2.13 [Nov-27-2005] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= YJL052W Chr 10 (332 letters) Database: magnaporthe_grisea_2.3_proteins_nt.fas 11,109 sequences; 5,221,248 total letters Searching..................................................done Sequences producing significant alignments: Score E (bits) Value MG01084.4 hypothetical protein similar to (AL670003) glyceraldeh... 449 >MG01084.4 hypothetical protein similar to (AL670003) glyceraldehyde 3-phosphate dehydrogenase (ccg-7) [Neurospora crassa] 33022 34253 Length = 336 Score = 449 bits (1156), Expect = e-127 Identities = 215/331 (64%), Positives = 262/331 (79%) Query: 1 Sbjct: 1 Query: 61 Sbjct: 61 MIRIAINGFGRIGRLVLRLALQRKDIEVVAVNDPFISNDYAAYMVKYDSTHGRYKGTVSH 60 M++ INGFGRIGR+V R A++ D E+VAVNDPFI YA YM++YDSTHGR+KGTV MVKCGINGFGRIGRIVFRNAIEHPDCEIVAVNDPFIEPKYAKYMLEYDSTHGRFKGTVEV 60 DDKHIIIDGVKIATYQERDPANLPWGSLKIDVAVDSTGVFKELDTAQKHIDAGAKKVVIT 120 ++++G K+ Y ERDPAN+PW + V+STGVF D A H+ GAKKV+I+ SGSDLVVNGKKVKFYTERDPANIPWSETGAEYVVESTGVFTTTDKASAHLKGGAKKVIIS 120 Query: 121 APSSSAPMFVVGVNHTKYTPDKKIVSNASCTTNCLAPLAKVINDAFGIEEGLMTTVHSMT 180 APS+ APM+V+GVN Y ++SNASCTTNCLAPLAKVIND FGI EGLMTTVHS T Sbjct: 121 APSADAPMYVMGVNEKSYDGSASVISNASCTTNCLAPLAKVINDKFGIVEGLMTTVHSYT 180 Query: 181 ATQKTVDGPSHKDWRGGRTASGNIIPSSTGAAKAVGKVLPELQGKLTGMAFRVPTVDVSV 240 ATQKTVDGPS KDWRGGR A+ NIIPSSTGAAKAVGKV+P L GKLTGM+ RVPT +VSV Sbjct: 181 ATQKTVDGPSAKDWRGGRGAAQNIIPSSTGAAKAVGKVIPALNGKLTGMSMRVPTANVSV 240 Query: 241 VDLTVKLEKEATYDQIKKAVKAAAEGPMKGVLGYTEDAVVSSDFLGDTHASIFDASAGIQ 300 VDLT +LEK A+Y++IK A+K AA+GP+KG+L YTED VVSSD +G+ +SIFDA AGI Sbjct: 241 VDLTCRLEKGASYEEIKAAIKEAADGPLKGILEYTEDDVVSSDMIGNNASSIFDAQAGIA 300 Query: 301 LSPKFVKLISWYDNEYGYSARVVDLIEYVAK 331 L+ KFVKL+SWYDNE+GYS RV+DL+ Y++K Sbjct: 301 LNDKFVKLVSWYDNEWGYSRRVIDLVTYISK 331 e-127 Output values • Score – value calculated from number of matching or similar amino acids in alignment. • Expect – probability that alignment could happen by chance. • Identities – number of identical amino acids in alignment. • Positives – number of similar amino acids in alignment. Families of amino acids Protein family • A protein family is a group of evolutionarilyrelated proteins. • Members of a protein family have similar threedimensional structures, functions and sequence similarity. • Families can include proteins with the same function in different organisms (orthologues). • Can also include members of multigene families derived from gene duplication and rearrangements (paralogues). Gene duplication • Gene duplication due to unequal crossing over during meiosis can create gene families. • Sequence and function of different members of a gene family can diverge. Gene duplication Cytochrome P450s • A group of enzymes involved in the oxidative metabolism of a large number of natural compounds, as well as drugs, carcinogens and mutagens. • Contains haem group. • Found in animals, plants, fungi and bacteria. Cytochrome P450 Functions of cytochrome P450 • Detoxification of drugs, carcinogens and toxins. • Biosynthesis of steroids, fatty acids and bile acids. • Biosynthesis of toxins. • Bioconversion of polyaromatic hydrocarbons. • Alkane assimilation. Two fungi Magnaporthe oryzae – rice blast fungus – pathogen (invades living plant) Neurospora crassa – red bread mould – saprophyte (lives on dead organic matter) Number of cytochrome P450s • M. oryzae – 122 • N. crassa – 37 • Cytochrome P450s important for pathogens. • Needed to detoxify anti-fungal chemicals produced by plant and to synthesise toxins to help M. oryzae invade the host-plant. Cytochrome P450s • Cytochrome P450s classified into families based on sequence homology. • Amino acid sequence not well conserved between cytochrome P450 families. • 3D structure of members of different cytochrome P450 families are similar. Cytochrome P450 structure Pfam • Pfam is protein family database based on hidden Markov models (HMMs). • http://pfam.xfam.org/ • HMM is a statistical model that considers all possible combinations of matches, mismatches and gaps to generate an alignment of a set of sequences. • Used to represent protein families at Pfam. Domains • A segment of a polypeptide chain that can fold into a three-dimensional structure irrespective of the presence of other segments of the chain. • Different domains in the same protein may have specific functions. • Example – myosin family, a family of ATPdependent motor proteins involved in muscle contraction and motility. Myosin V (involved in actin-dependent transport of vesicles) Head domain (motor) – binds actin, nucleotide-binding IQ – calmodulin-binding motif (calcium sensing) Coiled-Coiled – dimerisation Globular domain – binding of myosin to vesicles Summary • Protein sequence determines function. • BLAST can be used to search for protein / DNA sequences that are similar. • Proteins can be grouped into families based on sequence / phylogeny.