Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exercise 1: Mining transcriptomics data In the Patric ( https://www.patricbrc.org/database) search for VBIEscCol129921_2790, find the corresponding recA entry and go to the gene page. Click on the transcriptomics tab and find the conditions ( at least 3 or 4) were this gene is the least and the most expressed. What do these conditions have in common? Find the genes that are positively correlated, meaning they are co-expressed with recA. Download the list of the corresponding protein sequences in FASTA format and keep that window open, you will need it later). What do all these genes have in common? To find this out if you know nothing about gene names, you can: click on the pathway link in that page in PATRIC. Download the list of correlated genes as an excel sheet. Copy the column of the refseq locus tag and copy it in the DAVID function ontology analysis tools. (https://david.ncifcrf.gov/summary.jsp) Paste your list and choose locus Tag as Identifier and submit your list What process are the genes co-expressed with recA involved in? Why did the PATRIC/Pathway tool give you poor results? In the microbesonline (http://www.microbesonline.org) database. Find the recA gene from Escherichia coli click on the “E” link. Click on the + correlated genes, (let it run while we continue, we will get back to it later). Do You find the same result than with PATRIC? Exercise 2: Identifying regulators Paste the list of fasta sequences of recA co-regulated proteins you found in PATRIC in the GenBrowser P2RP ( http://www.p2rp.org) a tool to identify prokaryotic regulators. Which gene in the list is a regulator. Capture its name and RefSeq Id for future use. Go to RegPrecise ( http://regprecise.lbl.gov/RegPrecise/ ) and Find the page for the regulator you identified in 1) in E. coli. Get the list of genes that are predicted to be regulated by this gene based on the presence of a specific DNA binding site upstream the transcription unit start. Download the list of RefSeq Ids. To know if this is a positive or negative regulator go and find the regulon entry for this regulator in RegDB ( http://regulondb.ccg.unam.mx/). If this gene is deleted do you predicted the regulated genes will be up regulated or down regulated ? You can also go and fin this regulator in Prodoric (http://prodoric.tu-bs.de/) , and go Prodonet link to see the network ( JAVA will have to work) From the PATRIC database home page. Go to the Organisms Tab and select Escherichia. Then Go to the Transcriptomics tab. Filter experiment with the gene name of the regulator you identified above, two experiments should come up , select the one with PubMed 11333217 and view the gene list. In the “Filter by one or more keywords or locus tags” Post all the Refseq locus Tags of the genes that are regulated by this regulator (that you got on RegPrecise) and look at the data using the heatmap view. Do you find the expected results, you can check the paper where this study was published to help you. From PATRIC can you access the GEO datasets. You can find the ones that that are curated starting with GDS or d do the analysis on Geo2R yourself. Your goal is to visualize how the expression of recA changes with UV in the WT and in the mutant. For those who are fast you can post the same list of genes in http://colombos.net/ and play with the analysis tools Exercise 3: Mining phenotype data For E. coli, the Keio collection is available. It is a set of individual deletions of every non-essential genes. This collection was profiled under a set of growth conditions and chemical stresses. Go to ecoliwiki (http://ecoliwiki.net/tools/chemgen/) to access that data. Then, choose growth data and type in recA in the strain box and click SUBMIT. What do the chemical where the growth the recA is the most affected have in common? Chose correlation among strains and leave strain 2 empty. Look at the list of genes with correlation coefficient >0.4 and click SUBMIT. What does this suggest on the function of the protein? Only a few organisms have mutant collections available (see http://ogee.medgenius.info/browse/) but now TnSeq technology allows us to obtain fitness data in a wide range of organisms and conditions (See PMID 26336012 for review). Microbesonline has fitness data for four organisms. Find the recA gene in Zymomonas mobilis subsp. mobilis ZM4. Click on the fitness (F) data and look at the (+) fitness profile. Looking at the genes that have the highest correlation what does it tell you about TnSeq experiments compared to chemical genomics?. A more comprehensive (~20 organisms in 300 conditions) and very user friendly TnSeq analysis platform can be found at http://fit.genomics.lbl.gov/cgi-bin/myFrontPage.cgi. Compare cofitness data for recA from E. coli and Shewanella oneidensis MR-1. Why do you think the data from MR-1 seems so much more informative? You can read the original paper (http://mbio.asm.org/content/6/3/e00306-15.full ) or the help (http://fit.genomics.lbl.gov/cgi-bin/help.cgi) to help you answer. Exercise 4: Data integration and enrichment analyses, when it works We analyzed the expression and phenotype fitness data linked to recA, now we are going to explore platforms that integrate different types of omics data. Go to inetbio.org. Find what this platform predicts as function for recA. Look at the top ranking prediction GO_biological_process_term and score and explain what type of evidence is used to make this prediction. Go to Genemania (http://genemania.org/) look at the E .coli recA networks, find how to color the top 5 enrichments terms. You can do the same with the human one and see how conserved is the RecA function between coli and man! Exercise 5; Data integration and enrichment analyses II, when it does not works Using only comparative genomics tools and different types of omics data we were able to predict that the yeast YLR143W protein was the missing last enzyme diphtamide synthesis DPH6.(https://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-7-32). Here we are going to go the different sites to reproduce how some these predictions were made. Chemical genomics data is available for yeast on the Yeast Fitness database (http://fitdb.stanford.edu ). Look at the data for YLR143W. What does this tell you on the function of the protein? Now let's see if the integrative databases can predict the function. Go first to the String database (there to go back in time you should turn off text mining).Then try the Stitch database (http://stitch.embl.de), GeneMania (http://genemania.org.), BioPixie (http://imp.princeton.edu.) and ConsensusPathDB-yeast ) (http://cpdb.molgen.mpg.de/YCPDB). For all these databases look at network analysis and what the enrichment analysis gives you. How good are these databases at predicting the function of YLR143W? Why do you think they are not as successful as a human for this protein while they were for RecA.