Download HW3 - solutions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nutriepigenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genetic engineering wikipedia , lookup

Oncogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Transposable element wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

NEDD9 wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Public health genomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Human genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Designer baby wikipedia , lookup

Genomic library wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Human Genome Project wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Genome editing wikipedia , lookup

Gene expression profiling wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Introduction to Bioinformatics (236523)
HW 3 – Winter 2017
General Instructions:
 Deadline: 3/1/17 23:55.
 Submission according to published pairs only.
 The submission is electronic only in the course website.
Part 1 – Resequencing pipeline
In this exercise you will use Galaxy in order to map reads to a reference genome. Note that
in order to make the run time of the pipeline fast we limit our analysis to only a part of the
sample. We will first detail all the instructions of the pipeline, and then present all the
questions regarding the different steps.
a. Visit the Galaxy website at https://usegalaxy.org/, go to the user tab in the upper
panel and register to the site.
b. Upload the 2 data files: NC_000913.fa and samp_12.fq to Galaxy by using the tools
Get Data -> Upload Data in the left panel. For the fastq file label it as "fastqsanger"
under "Type" scroll-down loast and for the fasta file, pick "fasta" under the "Type"
list. Then press "Start" and wait for the files to load.
c. Next, you will map the reads to the given reference genome. Go to NGS: Mapping ->
Bowtie2. Leave all the default parameters apart from the following two:
a. Will you select a reference genome from your history or use a builtin index? - Pick the reference genome from the history – the file you
uploaded will be picked automatically.
b. Save the bowtie2 mapping statistics to the history – select yes
Run "Execute".
d. Go to the history panel on the right and look for the BAM file created in the mapping
step. Open this object by clicking its name and then download it by clicking on the
"disc" icon at the bottom of the object. You will be given 2 options "Download
dataset" or "Download bam_index" – download both of them.
e. Open IGV by clicking at one of the "Launch" buttons at the bottom of the page here:
http://software.broadinstitute.org/software/igv/download. First upload the
reference genome by clicking on Genomes -> Load genome from file. Note that you
need to have the *.fai file in the same directory as the fasta file in order for the
reference genome to properly load. Then upload the alignments file by clicking on
File -> Load from file, and pick the BAM file. Note that you need to have the *.bai file
in order for the alignments file to properly load.
f. Now focus on this location: gi|49175990|ref|NC_000913.2|:890,503-891,198
1. How many reads are in the fastq file?
1,371 reads
2. We would first want to check the quality of the reads. This can be done by
examining the quality scores of the reads. In order to sum-up quality scores of all the
reads at once we will use the tool FastQC (learn about this tool more here:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Go to "NGS: QC and
manipulation" -> FastQC. Pick the fastq file you have uploaded and run "Execute".
When finished look at the "webpage" results.
a. According to the "per base sequence quality" – are the reads in good quality?
Explain and give screen shots.
Yes they are, all bases in the reads are above the high score of 28:
b. How would you expect the "per base sequence content" plot to look like for a
normal and good quality sample? How can you explain the plot for our fastq
file (remember, the file contains only a small part of the original sample)?
A normal sample would have all 4 bases in the same frequency throughout
the read. Certain bacteria strains may show different CG frequency. In our
sample we see a different frequency of G, and this is because of look at a
very small loci in the genome, that contains a higher amount of Gs.
3. In IGV, after completing step f – give 2 examples for variants and 2 example for
sequencing errors. Specify locations, and how each differs from the reference
genome, show screen shots, and explain how you did you get to your conclusions.
890,914 – variant from T->G in 92% of the reads
890942 – sequencing error from A->C - in 1 read
890756 – variant from T->A in 71% of the reads
890831 – sequencing error T->C in 1 read
4. Understanding the variants:
a. Find out from what organism this reference genome and reads originated
from? Explain how you got to this conclusion.
BLASTing one of the reads or a part of the reference genome shows they are
taken from E.coli.
b. Detail for each of the 2 variants you found in question 2 – what is the percent
of reads showing this variant. Given the organism you found in the last step –
what biological explanation may explain the variations?
E.coli is a haploid organism and these variants contain 2 different possible
bases, so these seems like 2 different populations of E.coli.
5. In the original experiment there were 6,400,000 reads. Find out what is the size of
the reference genome in Galaxy (click on the object, and then click on the "I" icon
and find the file size. The file size is approximately the genome size Megabite = MB ~
mega-base = million bases). Given this info and the fact that 95% of the reads were
mapped to the genome – calculate the average coverage.
𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 =
6.4𝑀 ⋅ 51 ⋅ 0.95
= 68.9
4.5𝑀
6. Lets assume that for the given mapping used in question 4 we allowed 3 mismatches
per read – and got 95% of the reads mapped. How will the “coverage” be affected if
we will lower the number of allowed mismatches per read to 1? Explain your answer
Lower it, as we expect to see less reads mapped to the genome.
Part 2 – Gene expression and RNA-Seq
7. "Coverage" in RNA-seq:
a. How does lowering the amount of reads in RNA-seq or in resequencing
experiment different in the way it effects the results? Does it affect these two
application in the same way? Explain.
In resequencing we expect the amount of reads to be reduced from the
whole genome in the same extent, whereas in RNA-seq it first affects the
lowly expressed genes, because they are less abundant in the sample.
Highly expressed genes will be less affected by lowering the amount of
reads, because they are more abundant in the sample.
b. A researcher is about to sequence 50 tomato mRNA samples in order to
understand the developmental process of the young plant. The researcher
wants to have a good estimate on how many reads she needs to sequence in
each sample in order to get a good measure of the gene expression in each
sample. The researcher has a limited budget and so she want to sequence the
minimal amount of reads that will give her meaningful results. She has a set
of 10 genes she expects to see expressed in all samples in a moderate
expression level.
Suggest an experiment/ experiments that will help a researcher decide how
many reads are enough to sequence per sample. You can assume you have
unlimited amounts of each sample.
The researcher may conduct a series of experiment on one given sample,
each containing a higher number of reads. For each sample RNA-seq analysis
pipeline will be done and the gene expression of each gene will be found.
The researcher can then find the expression level in the 10 genes of interest
and see how many reads is sufficient to detect them in her sample. This
minimal but sufficient amount of reads can be used in the rest of the
experiments.
8. Explain why are replicates important in gene expression studies? What might
happen if you don't have replicates at all? Why do you think triplicates (three
replicates) are more often used in biological experiments, rather than duplicates
(two replicates)?
Replicates help us determine the true measurement and eliminate noise.
Measurement a single sample may be misleading. Replicates also allow us to
have a measure of reliability – how much we can trust a measurement – if
the variance between replicates it may be less reliable. Triplicates are used
to eliminate the need to choose between two very different measurements
– if one replicate showed a high value and the second a very low one; we
may not know which one to "believe".
Part 3 - Enrichment Analysis
In this part you will use the tool Gorilla (http://cbl-gorilla.cs.technion.ac.il/) in order to find
an enrichment of GO terms in a given list of genes.
In the tool pick the following parameters:



Humans as the organism.
Two unranked lists of genes.
Use the lists gene_list1.txtx and background1.txt as the inputs.
9. Run the tool 3 times – one for each ontology (Process/ Function/ Component).
a. Explain in one-two sentences what does each ontology refer to and give one
example for each.
Cellular component – annotations of genes by the location of the protein
they code to in the cell. For example – cell membrane.
Biological process – annotations of genes by the pathway or process the
proteins they code to take a role in. For example – biological adhesion.
Molecular function – annotations be the molecular function of the proteins
coded by the genes. For example – DNA binding.
b. Describe the result you get for each run – in terms of FDR q-values and the
spreading of the results over the ontology tree. Provide screenshots and
summarize the results in your own words.
The results are shown below. The biological process got the most significant
result for the GO term "rRNA binding" (FDR=10^-16). Cellular composition got
also a significant result of location in the ribosomal subunit (FDR=10^-2).
c. Did you get the same results for the three different runs? Explain
These three ontologies are not supposed to correlate necessarily. They each
annotate different aspects of the genes. It seems like the gene list consists of
genes that are all rRNA binding – so they share the molecular function, but that
does not mean they have to function in the same location in the cell or in the
same processes. Since they bind rRNA, the location was also significantly
enriched, because they bind ribosomal RNA, and so all bind the same molecule
in the cell. But as can be seen from the biological process run these genes are
active in many different processes.
10. Choosing the background list:
a. Run gene_list_2.txt with the background background_2.txt and look for
enrichment in Cell Component. Describe your results.
b. Run gene_list_2.txt with the background background_3.txt and look for
enrichment in Cell Component. Describe your results.
c. Can you explain the difference in the results obtained in the last 2 steps?
Suggest an in-silico experiment (meaning a computational experiment) that
could validate your answer (you are not requested to actually perform it but
you are welcome to do so and add it to your answer).
In step a no enrichment is found at all.
In step b there are enrichments to a few organelle membranes – Golgi membranes, ER
membranes etc. Of course the difference between the runs is the background genes and
that is what causing the change. The first background list is composed of genes that are
also located in various cell membranes and so no enrichment is found, whereas the
second list is more diverse in terms of cellular location and that is why an enrichment is
found.
To test this we can run the two background lists in Gorilla with a general background list
that contains all the human genes, to see what is the enrichment of genes in each of
them.