* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Day 6 Carlow Bioinformatics
Paracrine signalling wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression wikipedia , lookup
Signal transduction wikipedia , lookup
Expression vector wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Biochemistry wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Magnesium transporter wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Structural alignment wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein purification wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Anthrax toxin wikipedia , lookup
School B&I TCD Bioinformatics Proteins: structure,function,databases,formats Wot’s a protein,then? Hierarchical • A collection of amino acids (0-D) – AACompIdent can identify a protein from AA%s • A sequence (string) of AAs (1-D) • • • • • 2ndry structural elements: -helix etc. (2-D) Domains – (independent) functional units Whole Protein (from single CDS) (3-D) Quaternary structure: dipeptides,ribosomes Interactome, pathways Protein functions Amino acid properties again … and again and again Amino acid groups • • • • • KR (Lys Arg) NH3+ basic DE (Glu Asp) COO- acidic WYF (Trp Tyr Phe) large aromatic GP (Gly,Pro) -breaking C (Cys) disulphide –S – S – bridges – C also not disulphide bridges • etc. Secondary structure • -helix (no Pro Gly) – – – – Easy like exon prediction 3.4 residues per turn Leucine zipper …LXXXXXXLXXXXXXL… Amphipathic helix (charged on one side) Transmembrane (-helix,hydrophobic ~21AA long) • -sheet – 2 dimensional zigzag • Coil,random • Turn (kink) Patterns to recognise (more reliable in MSA than in single seq) MSA improves 2ndary structure (-helix -sheet) prediction by >6%) • Alternate hydrophobic residues – Surface -sheet (zig-zag-zig-zag) • Runs of hydrophobic residues – Interior/buried -sheet • Residues with 3.5AA spacing (amphipathic) – -helix WNNWFNNFNNWNNNF • Gaps/indels – Probably surface not core Conserved residues • W,F,Y large hydrophobic, internal/core – conserved WFY best signal for domains • G,P turns, can mark end of -helix -sheet • C conserved with reliable spacing speaks C-C disulphide bridges - defensins • H,S often catalytic sites in proteases (and other enzymes) • KRDE charged: ligand binding or salt-bridge • L very common AA but not conserved – except in Leucine zipper L234567L234567L234567L Basic information How big is my protein? Where beta-sheets? Is there a signal peptide? Is there a trypsin cleavage site? • ProtParam tool (MWt etc.) • Tmpred,TMHMM transmembrane helix inside/outside,external loops. • JPRED for 2-D structure • see practical manual for examples Tertiary structure • The holy grail of bioinformatics Difficult like Gene prediction • 3-D orientation of known , • Proteins made of functional units “domains” – Tried tested module – Domain shuffling and exon boundaries • Bioinf tries to make predictive calls on aspects of the 3-D structure • Q. Why is 3-D important ? A. Structure = function What binf can do about 3-D • Expressed/exported proteins have signal peptide • Hydropathy plot,antigenicity index,amphipathicity get handle on surface probability • But homology to known 3-D structure (Xray,NMR) is best predictor – threading. • Plan to X-ray all “folds” in human genome. recaA SwissProt/UniProt Some of the 194 lines of info in a SwissProt entry ID AC RX RA RT RL DR DR DR DR DR DR DR FT FT FT FT RECA_ECOLI STANDARD; PRT; 352 AA. P0A7G6; P03017; P26347; P78213; MEDLINE=92114994; PubMed=1731246;; Story R.M.,Weber I.T.,Steitz T.A.; "The structure of the E. coli recA protein"; Nature 355:318-325(1992). EMBL; V00328; CAA23618.1; -; Genomic_DNA. PDB; 2REB; X-ray; @=-. PRINTS; PR00142; RECA. ProDom; PD000229; RecA; 1. SMART; SM00382; AAA; 1. TIGRFAMs; TIGR02012; tigrfam_recA; 1. PROSITE; PS00321; RECA_1; 1. HELIX 72 85 UniProt is the key hub of TURN 86 87 Bioinformatics databases STRAND 90 94 HELIX 101 106 Homology? LVMFWSIVGE Known1 L W GE LIVYWTVIGE Unknown 40% ID ILVFYTVVGD Known2 V TV G LIVYWTVIGE Unknown 40% ID Is Unknown part of the same family? Or is this just a 4/10 co-incidence? RegEx LVMFWSIVGE Known1 ILVFYTVVGD Known2 RegEx [MILV](3)-[FYW](2)-[STA]-[MILV]-V-G-[DE] LIVYWTVIGE Unknown * ***** ** More convincing that it is same family? How modify RegEx to include 3rd sequence? Family Databases Three methods Prosite • Groups families by conserved motif. Which is • Present in all family members • Absent in all other proteins • No/few false positives (selectivity) • All true positives (sensitivity) • Motif defined with a Regular expression What prosite looks likecf SwissProt ID DT PA NR NR NR DR DR DR DR 3D DO RECA_1; PATTERN.AC PS00321; APR-1990 (CREATED); NOV-1997 DE recA signature. A-L-[KR]-[IF]-[FY]-[STA]-[STAD]-[LIVMQ]-R. /RELEASE=49.0,207132; /TOTAL=281(281); /POSITIVE=279(279); /UNKNOWN=0(0); /FALSE_POS=2(2); /FALSE_NEG=11; /PARTIAL=10; Q01840,RECA1_LACLA,T; P48291,RECA1_MYXXA,T; P48292,RECA2_MYXXA,T; Q9ZUP2,RECA3_ARATH,T; Etc for 70 lines Q7UJJ0,RECA_RHOBA ,N; Q9EVV7,RECA_STRTR ,N; Q4X0X6,EXO70_ASPFU,F; Q5AZS0,EXO70_EMENI,F; 2REB; 2REC; Documentation PDOC00131; False negatives False positives PDB structures Prosite problems • RegEx now breaking down as recAs increase so no longer defines the protein • Database now huge so prob of finding any short motif is high. • Many copies of ELVIS hiding in UniProt • May be more than 1 motif defining a family • A great first attempt and still useful but too crude Prints • • • • A database of multiple domains/motifs. Multiple motifs abstracted to database Stored as probability matrix If two proteins have the same motifs in the same order they are likely to be homologous. • More biological/real/sensitive than ProSite ProDom • A French DB • All against all search of the nr protein Db. • Includes domains with no known function – cf synteny of non coding regions • Great for determining the domain structure of a particular protein. Pfam • Moves up from the short; highly conserved; easily aligned bits of protein family. • Uses PSSM position specific scoring matrix • … on complete aligned family members PSSM • Multiple sequence alignment: 1234567890 NSGTIVFLWP DSGTAIFLKP ESGTIIFLHN DSDTVRSLKP Posn1 Posn2 Posn3 Posn4 Posn5 Posn6 Posn7 Posn8 Posn9 Posn0 50% D,N,E 100% S 75% G,D 100% T 50% I,A,V 50% I,V,R 75% F,S 100% L 50% K,H,W 75% P,N Domain take home • Run your protein against – InterproScan – CD server at NCBI – Pfscan • Likely that the crucial bit of info is only in one of the above.