* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Endogenous retrovirus wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Point mutation wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Gene expression wikipedia , lookup
Gene regulatory network wikipedia , lookup
Interactome wikipedia , lookup
Magnesium transporter wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Silencer (genetics) wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Homology modeling wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Gene expression profiling wikipedia , lookup
Western blot wikipedia , lookup
Protein structure prediction wikipedia , lookup
Paracrine signalling wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Signal transduction wikipedia , lookup
Metabolic network modelling wikipedia , lookup
Functional Annotation Episode 2: Preliminary Results The Group 27th Feb 2012 Lavanya Rishishwar Artika Nath Lu Wang Haozheng Tian Shengyun Peng Ashwath Kumar Hamidreza Hassanzadeh 1 Recap • • • • • • What is Functional Annotation The Importance of Functional Annotation The Biology of H. haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach – Breadth – Depth 27th Feb 2012 2 Flowchart 27th Feb 2012 3 Flowchart 27th Feb 2012 4 PRELIMINARY RESULTS 27th Feb 2012 5 Subject Organisms Species Disease State State Isolated Hemolysis Hpd fuculosekinase M19107 H. haemolyticus Asymptomatic Minnesota Y - - M19501 H. haemolyticus Asymptomatic Minnesota N + - M21127 H. haemolyticus Pathogenic Georgia Y - - M21621 H. haemolyticus Pathogenic Texas Y - - M21639 H. haemolyticus Pathogenic Illinois N - - M21709 H. influenzae Pathogenic NY N - + fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolates Hpd: encoding a lipoprotein protein D, 27th Feb 2012 6 BLAST: Output and Parsing • Once the results received from gene prediction tools, we should blast them against different databases • The selected threshold: 0.005 • This is automatically done by the ad-hoc scripts utilizing the BioPerl lib, for all 6 strains • The results are then processed and the certain metrics elicited for further analysis 27th Feb 2012 7 27th Feb 2012 8 27th Feb 2012 9 BLAST v/s UniProt: Coverage 27th Feb 2012 Organism # of unique organisms in the hits M19107 2338 M19501 2332 M21127 2360 M21621 2364 M21639 2433 M21709 2154 10 BLAST v/s UniProt: M19107 Pasteurella Ralstonia Lactobacillus Rickettsia Brucella Mus Coxiella Legionella Homo Arabidopsis Klebsiella Actinobacillus Xylella Erwinia Rhizobium Acinetobacter Bordetella Francisella Clostridium Mycobacterium Buchnera Neisseria Xanthomonas Others Streptococcus Shigella Haemophilus Bacillus Vibrio Staphylococcus Escherichia Burkholderia Salmonella Yersinia Shewanella Pseudomonas 27th Feb 2012 11 BLAST v/s UniProt: M21709 Listeria Homo Coxiella Legionella Xylella Arabidopsis Rickettsia Erwinia Klebsiella Brucella Rhizobium Acinetobacter Bordetella Actinobacillus Francisella Clostridium Mycobacterium Buchnera Neisseria Xanthomonas Streptococcus Others Shigella Vibrio Burkholderia Staphylococcus Haemophilus Bacillus Escherichia Yersinia Salmonella Shewanella Pseudomonas 27th Feb 2012 12 CONSERVED DOMAIN DATABASE (CDD) Introduction • CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. • These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPSBLAST. • The PSSMs are meant to be used for compiling RPS-BLAST search databases only. RPS-BLAST • Reversed Position Specific Blast • It searches a query sequence against a database of profiles (opposite of PSI-BLAST). • Use pre-computed lookup table for the profiles to allow the search to proceed faster (architecture dependent). • The CD-Search databases for RPS-BLAST: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/ Strategy FORMATRPSDB • Formatrpsdb is a utility that converts a collection of input sequences into a database suitable for use with RPS-Blast. • Formatrpsdb is designed to perform the work of formatdb, makemat and copymat simultaneously, without generating the large number of intermediate files these utilities would need to create an RPS Blast database. Build Database Title for database file Input file containing list of ASN.1 Scoremat filenames Create index files for database For scoremats that contain only Threshold residue for Base name of frequencies, the extending output scaling factor to hits for RPS database apply when database creating PSSMs RUN RPS-BLAST Results for CDD: COGs Organism: M19107 >10 27th Feb 2012 22 Results for CDD: COGs Organism: M21709 >10 27th Feb 2012 23 LipoP 27th Feb 2012 24 LiopP • LipoP classifies genes into 4 classes: – SpI: Signal peptide I – SpII: Lipoprotein signal peptide – TMH: N-terminal transmembrane helix (Not very reliable, It is used to avoid TMH being falsely predicted as signal peptides) – CYT: Cytoplasmic. (All the rest) • The classification system in LipoP uses HMM with four branches, one each for SpI, SpII, TMH, CYT. • Protein sets for training and testing was extracted from SWISS-PROT. • They consisted of lipoproteins, SPaseI-cleaved proteins, cytoplasmic proteins from the two Gram-negative phyllums Proteobacteria and Spirochetes. • Transmembrane proteins were taken from phyllums Proteobacteria and Gracilicutes. Output Example # M19107_final_1488 SpI score=11.1193 margin=11.320213 cleavage=31-32 # Cut-off=-3 M19107_final_1488 LipoP1.0:Best SpI M19107_final_1488 LipoP1.0:Margin SpI 1 M19107_final_1488 LipoP1.0:Class CYT M19107_final_1488 LipoP1.0:Class SpII M19107_final_1488 LipoP1.0:Signal CleavI 31 M19107_final_1488 LipoP1.0:Signal CleavI 30 M19107_final_1488 LipoP1.0:Signal CleavII 19 1. 2. 3. 4. 5. 6. 7. 1 1 1 1 32 31 20 1 11.320213 1 1 11.119 -2.18348 -1.80091 11.1193 -0.200913 -1.80091 # PISHA|SDLNQ # SPISH|ASDLN # TALFS|CGLLI Pos+2=G Sequence ID Type of prediction. Best means the highest scoring class, Margin gives the difference between the best score and the second best score, Class gives the score of other classes and Signal lines contain predicted cleavage sites. Feature type. Location in the sequence. For lines with a class prediction it is always 1. For cleavage sites it is the last amino acid of the signal peptide relative to the predicted cleavage site. Location same as above except that for cleavage sites it is the first amino acids after the cleavage site. Score. For the "Margin" type it is the difference between the best and the second best class score. For the cleavage sites the ±5 context is shown after the #, and for lipoprotein cleavage sites the amino acid in postition +2 is shown (which may determine whether the lipoprotein is attached to the inner or outer membrane) - An aspartic acid (D) in position +2 after the cleavage site of a lipoprotein means that it is attached to the inner membrane, and most other lipoproteins are attached to the outer membrane (“Testing the '+2 rule' for lipoprotein sorting in the Escherichia coli cell envelope with a new genetic selection”, Seydel et al (1999) Molecular Microbiology 34: 810-821) Results Hh Hi Strain SpI SpII Inner Membrane Lipoproteins M19107 164 54 2 241 1470 1929 M19501 176 60 3 228 1293 1757 M21127 174 67 3 244 1564 2049 M21621 178 64 2 244 1413 1899 M21639 194 82 4 267 2072 2615 M21709 144 53 2 225 1383 1805 TMH CYT Total SignalP Biological background • Many different types of secretory signals are found. SignalP focused on prediction of classical signal peptides, which are the far most common type of signal peptide cleaved by signal peptidase I (SPase). • In bacteria signal peptide is targeted directly to the cell membrane. SignalP • SignalP 3.0 was the best method among PrediSi, SPEPlip, Signal-CF, Signal-3L and Signal-BLAST. (Choo, K., Tan, T. & Ranganathan, S. BMC Bioinformatics 10, S2 (2009).) • SignalP4.0 is even better, and hence was included in our method. (SignalP 4.0: discriminating signal peptides from transmembrane regions Thomas Nordahl Petersen, et al. Nature Methods, 8:785-786, 2011) SignalP • SignalP 4.0 is a purely neural network–based method. • Two types of networks in SignalP 4.0: – SignalP-TM networks – SignalP-noTM networks • The decision to select network: If SignalP-TM predicts four or more positions as being transmembrane positions, SignalP-TM is used for the final prediction, otherwise SignalPnoTM is used. Results from SignalP Organism No. Signal Pep Total Genes Percentage M19107 144 1929 7.47% M19501 150 1757 8.54% M21127 152 2049 7.42% M21621 151 1899 7.95% M21639 178 2615 6.81% M21709 122 1805 6.76% Comparison between LipoP and SignalP • The results obtained from LipoP and SignalP were compared with the help of a script. • Both SpI and SpII were taken from LipoP and all the positive outputs were taken from SignalP. • They were also analyzed for similar cleavage sites. Comparison table Organism No. Genes Predicted to have No. of Cleavage Sites detected Signaling Peptides Total # of Negatives Genes LipoP SignalP Common Consistent Sites Conflicting Sites Unique Unique M19107 75 1 143 1710 1929 112 31 M19501 86 0 150 1521 1757 115 35 M21127 89 0 152 1808 2049 114 38 M21621 91 0 151 1657 1899 117 34 M21639 100 2 176 2337 2615 126 50 M21709 75 0 122 1608 1805 93 29 75 143 86 1 152 M21127 100 M19501 M19107 89 150 91 122 M21709 M21621 Signal P M21639 75 151 176 LipoP 2 Comparison between LipoP and SignalP • Bottom-line: As was clearly visible by the Venn Diagram, the SignalP didn’t provided much of new information as compared to LipoP. 27th Feb 2012 36 Prediction of transmembrane helices in proteins TMHMM TMHMM Organism No. Transmembrane Helices Total Genes Percentage M19107 392 1929 20.32% M19501 385 1757 21.91% M21127 417 2049 20.35% M21621 413 1899 21.75% M21639 464 2615 17.74% M21709 361 1805 20.00% Member signature databases Similar coverage in size; Different content Member Database PFAM PROSITE Focus/Features divergent domains functional sites PRINTS hierarchical definitions from superfamily to subfamily levels TIGRFAMs building HMMs for functionally equivalent proteins PIRSF produce HMMs over the full length of a protein and have protein length restrictions together family members HAMAP profiles manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies PANTHER build HMMS based on the divergence of function within families SUPERFAMILY Structure using the SCOP as a basis for building HMMs GENE3D Use Structure using the CATH superfamilies as a basis for building HMMs Querying with InterProScan About • • • • A wrapper of sequence analysis applications Database and output files scanning Bulk data processing Efficient(parallel) internal architecture Query Sequence InterProScan Querying with InterProScan • Input – Nucleotide* or protein sequences – Recognized sequence format: raw, FASTA or EMBL – Reformat and translate(if necessary) *Nucleotide sequences will translated and scanned in all 6 frames without any further assumption Querying with InterProScan • Running InterProScan screenshot at<60s Querying with InterProScan Querying with InterProScan • Output – InterProScan makes results available in four formats: raw, ebixml, xml, txt, html • Parse InterProScan Output(BioPerl) – Bio::SeqIO::interpro • Interpretation of Output Data(example) Querying with InterProScan Key: Intepretation 10683_1_ORF1 the id of the input sequence. 024307F93E501F2C the crc64 (checksum) of the protein sequence (supposed to be unique). 404 the length of the sequence (in AA). HMMPfam the anaysis method launched. PF03453 the database members entry for this match. MoeA_N the database member description for the entry. 1 the start of the domain match. 163 the end of the domain match. 1.49999999999999999E-56 the evalue of the match (reported by member database method). T the status of the match (T: true, ?: unknown). 26-Feb-12 the date of the run. IPR005110 the corresponding InterPro entry (if iprlookup requested by the user). MoeA, N-terminal and linker domain the description of the InterPro entry. Biological Process: molybdopterin cofactor the GO (gene ontology) description for the InterPro entry. biosynthetic process (GO:0032324) Preliminary Results M19107 1,391 325 Total Searched Protein 1,769 Match 1,716 Unmatch 378 Total Hits: 12,393 53 Next Up • Major Challenge: Funneling all the annotation information into a consolidated GenBank/GFF3 entry. • Level 2! 27th Feb 2012 48 Level 2 Operons, Virulence Factors and Metabolic Pathways 27th Feb 2012 49 VIRULENCE Likelihood of a pathogen causing disease 27th Feb 2012 50 H.haemolyticus • As the name of the species implies, is generally hemolytic on blood agar plates • Beta-hemolytic phenotype routinely used in the clinical setting to distinguish H.h from NTHi • Nonhemolytic H. haemolyticus strains are being isolated > misidentified as NTHI Gene(s) encoding hemolysin Unknown (Xin WangMeningitis Laboratory, CDC) Photograph from From MicrobeLibrary.org Subject Organisms Species Disease State State Isolated Hemolysis Hpd fuculosekinase M19107 H. haemolyticus Asymptomatic Minnesota Y - - M19501 H. haemolyticus Asymptomatic Minnesota N + - M21127 H. haemolyticus Pathogenic Georgia Y - - M21621 H. haemolyticus Pathogenic Texas Y - - M21639 H. haemolyticus Pathogenic Illinois N - - M21709 H. influenzae Pathogenic NY N - + fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolates Hpd: encoding a lipoprotein protein D, 27th Feb 2012 52 Virulence factors • Refer to the traits encoded by `virluence genes` that pathogenic microbes are equipped to cause infection. HOW??? – – – – Attach selectively to host tissues Colonize parts of the host body Gain access to nutrients by invading or destroying host tissues Avoid host defenses • Virulence factors include: – – – – Bacterial toxins Cell surface proteins that mediate bacterial attachment Cell surface carbohydrates and proteins that protect a bacterium Hydrolytic enzymes that may contribute to the pathogenicity of the bacterium 27th Feb 2012 53 VFDB: Virulence factor Database • Set up in 2004 • Up-to date information regarding validated VF’s from 24 genera of medically important bacterial pathogens. • Detailed tabular comparison of virluence composition in terms of V. genes and their composition • Multiple alignment and statistical analysis of homologous VFs • Graphical comparison of V. genes • VF’s – – – – Adhesion & invasion Bacterial secretion systems& effectors Toxins Iron-acquisition system • Pathogenicity island 27th Feb 2012 54 Operon and Pathway Analysis • As was pointed out by Alejandro Caro, usually a missing gene in an otherwise complete pathway reflects a hole in the annotation process. • This path serves to fill such holes in the annotation process. 27th Feb 2012 55 DOOR(Database of prOkaryotic OpeRons) • DOOR (Database of prOkaryotic OpeRons) is an operon database developed by Computational Systems Biology Lab (CSBL) at UGA. The operons in the database are based on prediction. • DOOR is the biggest operon database available until now(2009). • This algorithm is consistently best at all aspects including sensitivity and specificity for both true positives and true negatives, and the overall accuracy reach ~90%. • Currently DOOR has operons for 971 prokaryotic genomes. • Although most of operons in DOOR are not verified by experiments, they are also trying to provide some limited literature information, which is extracted from ODB. FOUR STRAINS IN DOOR Strategy THE PATHWAY TOOLS A Glance at the End of Annotation Enable • Browsing of Annotated Genes • Analysis of pathways Database "Do not use a DBMS when the initial investment in hardware, software, and training is too high.” - Shamkant Navathe, Georgia Institute of Technology The Pathway Tools "Pathway Tools is a production-quality software environment for creating a type of model- organism database called a Pathway/Genome Database (PGDB)" The Pathway Tools • Prediction – Metabolic pathways – Metabolic pathway hole filler – Operons • Curating • PGDB web service – Publish PGDB – Query – Visualization • Metabolic Network Analysis WHY “The Pathway Tools” ? • Pros – BioCyc Tier 1 and Tier 2 databases are highly curated – Enables editing(curation) and querying of PGDB • Cons – BioCyc have less number of genomes than other databases – Some tools are only available in the local version(eg. PathoLogic) The Pathway Tools • Prediction – Metabolic pathways – Metabolic pathway hole filler – Operons • Curating • PGDB web service – Publish PGDB – Query – Visualization • Metabolic Network Analysis PathoLogic The Pathway Tools Local Version(GUI) PathoLogic Inputs and outputs of the computational inference modules within PathoLogic