Download HW3 - solutions

Introduction to Bioinformatics (236523) HW 3 – Winter 2017 General Instructions:  Deadline: 3/1/17 23:55.  Submission according to published pairs only.  The submission is electronic only in the course website. Part 1 – Resequencing pipeline In this exercise you will use Galaxy in order to map reads to a reference genome. Note that in order to make the run time of the pipeline fast we limit our analysis to only a part of the sample. We will first detail all the instructions of the pipeline, and then present all the questions regarding the different steps. a. Visit the Galaxy website at https://usegalaxy.org/, go to the user tab in the upper panel and register to the site. b. Upload the 2 data files: NC_000913.fa and samp_12.fq to Galaxy by using the tools Get Data -> Upload Data in the left panel. For the fastq file label it as "fastqsanger" under "Type" scroll-down loast and for the fasta file, pick "fasta" under the "Type" list. Then press "Start" and wait for the files to load. c. Next, you will map the reads to the given reference genome. Go to NGS: Mapping -> Bowtie2. Leave all the default parameters apart from the following two: a. Will you select a reference genome from your history or use a builtin index? - Pick the reference genome from the history – the file you uploaded will be picked automatically. b. Save the bowtie2 mapping statistics to the history – select yes Run "Execute". d. Go to the history panel on the right and look for the BAM file created in the mapping step. Open this object by clicking its name and then download it by clicking on the "disc" icon at the bottom of the object. You will be given 2 options "Download dataset" or "Download bam_index" – download both of them. e. Open IGV by clicking at one of the "Launch" buttons at the bottom of the page here: http://software.broadinstitute.org/software/igv/download. First upload the reference genome by clicking on Genomes -> Load genome from file. Note that you need to have the *.fai file in the same directory as the fasta file in order for the reference genome to properly load. Then upload the alignments file by clicking on File -> Load from file, and pick the BAM file. Note that you need to have the *.bai file in order for the alignments file to properly load. f. Now focus on this location: gi|49175990|ref|NC_000913.2|:890,503-891,198 1. How many reads are in the fastq file? 1,371 reads 2. We would first want to check the quality of the reads. This can be done by examining the quality scores of the reads. In order to sum-up quality scores of all the reads at once we will use the tool FastQC (learn about this tool more here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Go to "NGS: QC and manipulation" -> FastQC. Pick the fastq file you have uploaded and run "Execute". When finished look at the "webpage" results. a. According to the "per base sequence quality" – are the reads in good quality? Explain and give screen shots. Yes they are, all bases in the reads are above the high score of 28: b. How would you expect the "per base sequence content" plot to look like for a normal and good quality sample? How can you explain the plot for our fastq file (remember, the file contains only a small part of the original sample)? A normal sample would have all 4 bases in the same frequency throughout the read. Certain bacteria strains may show different CG frequency. In our sample we see a different frequency of G, and this is because of look at a very small loci in the genome, that contains a higher amount of Gs. 3. In IGV, after completing step f – give 2 examples for variants and 2 example for sequencing errors. Specify locations, and how each differs from the reference genome, show screen shots, and explain how you did you get to your conclusions. 890,914 – variant from T->G in 92% of the reads 890942 – sequencing error from A->C - in 1 read 890756 – variant from T->A in 71% of the reads 890831 – sequencing error T->C in 1 read 4. Understanding the variants: a. Find out from what organism this reference genome and reads originated from? Explain how you got to this conclusion. BLASTing one of the reads or a part of the reference genome shows they are taken from E.coli. b. Detail for each of the 2 variants you found in question 2 – what is the percent of reads showing this variant. Given the organism you found in the last step – what biological explanation may explain the variations? E.coli is a haploid organism and these variants contain 2 different possible bases, so these seems like 2 different populations of E.coli. 5. In the original experiment there were 6,400,000 reads. Find out what is the size of the reference genome in Galaxy (click on the object, and then click on the "I" icon and find the file size. The file size is approximately the genome size Megabite = MB ~ mega-base = million bases). Given this info and the fact that 95% of the reads were mapped to the genome – calculate the average coverage. 𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 6.4𝑀 ⋅ 51 ⋅ 0.95 = 68.9 4.5𝑀 6. Lets assume that for the given mapping used in question 4 we allowed 3 mismatches per read – and got 95% of the reads mapped. How will the “coverage” be affected if we will lower the number of allowed mismatches per read to 1? Explain your answer Lower it, as we expect to see less reads mapped to the genome. Part 2 – Gene expression and RNA-Seq 7. "Coverage" in RNA-seq: a. How does lowering the amount of reads in RNA-seq or in resequencing experiment different in the way it effects the results? Does it affect these two application in the same way? Explain. In resequencing we expect the amount of reads to be reduced from the whole genome in the same extent, whereas in RNA-seq it first affects the lowly expressed genes, because they are less abundant in the sample. Highly expressed genes will be less affected by lowering the amount of reads, because they are more abundant in the sample. b. A researcher is about to sequence 50 tomato mRNA samples in order to understand the developmental process of the young plant. The researcher wants to have a good estimate on how many reads she needs to sequence in each sample in order to get a good measure of the gene expression in each sample. The researcher has a limited budget and so she want to sequence the minimal amount of reads that will give her meaningful results. She has a set of 10 genes she expects to see expressed in all samples in a moderate expression level. Suggest an experiment/ experiments that will help a researcher decide how many reads are enough to sequence per sample. You can assume you have unlimited amounts of each sample. The researcher may conduct a series of experiment on one given sample, each containing a higher number of reads. For each sample RNA-seq analysis pipeline will be done and the gene expression of each gene will be found. The researcher can then find the expression level in the 10 genes of interest and see how many reads is sufficient to detect them in her sample. This minimal but sufficient amount of reads can be used in the rest of the experiments. 8. Explain why are replicates important in gene expression studies? What might happen if you don't have replicates at all? Why do you think triplicates (three replicates) are more often used in biological experiments, rather than duplicates (two replicates)? Replicates help us determine the true measurement and eliminate noise. Measurement a single sample may be misleading. Replicates also allow us to have a measure of reliability – how much we can trust a measurement – if the variance between replicates it may be less reliable. Triplicates are used to eliminate the need to choose between two very different measurements – if one replicate showed a high value and the second a very low one; we may not know which one to "believe". Part 3 - Enrichment Analysis In this part you will use the tool Gorilla (http://cbl-gorilla.cs.technion.ac.il/) in order to find an enrichment of GO terms in a given list of genes. In the tool pick the following parameters:    Humans as the organism. Two unranked lists of genes. Use the lists gene_list1.txtx and background1.txt as the inputs. 9. Run the tool 3 times – one for each ontology (Process/ Function/ Component). a. Explain in one-two sentences what does each ontology refer to and give one example for each. Cellular component – annotations of genes by the location of the protein they code to in the cell. For example – cell membrane. Biological process – annotations of genes by the pathway or process the proteins they code to take a role in. For example – biological adhesion. Molecular function – annotations be the molecular function of the proteins coded by the genes. For example – DNA binding. b. Describe the result you get for each run – in terms of FDR q-values and the spreading of the results over the ontology tree. Provide screenshots and summarize the results in your own words. The results are shown below. The biological process got the most significant result for the GO term "rRNA binding" (FDR=10^-16). Cellular composition got also a significant result of location in the ribosomal subunit (FDR=10^-2). c. Did you get the same results for the three different runs? Explain These three ontologies are not supposed to correlate necessarily. They each annotate different aspects of the genes. It seems like the gene list consists of genes that are all rRNA binding – so they share the molecular function, but that does not mean they have to function in the same location in the cell or in the same processes. Since they bind rRNA, the location was also significantly enriched, because they bind ribosomal RNA, and so all bind the same molecule in the cell. But as can be seen from the biological process run these genes are active in many different processes. 10. Choosing the background list: a. Run gene_list_2.txt with the background background_2.txt and look for enrichment in Cell Component. Describe your results. b. Run gene_list_2.txt with the background background_3.txt and look for enrichment in Cell Component. Describe your results. c. Can you explain the difference in the results obtained in the last 2 steps? Suggest an in-silico experiment (meaning a computational experiment) that could validate your answer (you are not requested to actually perform it but you are welcome to do so and add it to your answer). In step a no enrichment is found at all. In step b there are enrichments to a few organelle membranes – Golgi membranes, ER membranes etc. Of course the difference between the runs is the background genes and that is what causing the change. The first background list is composed of genes that are also located in various cell membranes and so no enrichment is found, whereas the second list is more diverse in terms of cellular location and that is why an enrichment is found. To test this we can run the two background lists in Gorilla with a general background list that contains all the human genes, to see what is the enrichment of genes in each of them.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download HW3 - solutions