Download Homework 1 / Introduction General questions Programming tasks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Molecular cloning wikipedia , lookup

Epigenomics wikipedia , lookup

Designer baby wikipedia , lookup

DNA supercoil wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

DNA vaccination wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Genome evolution wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomic library wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Polyadenylation wikipedia , lookup

Microevolution wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

RNA world wikipedia , lookup

Expanded genetic code wikipedia , lookup

RNA silencing wikipedia , lookup

Nucleic acid tertiary structure wikipedia , lookup

RNA wikipedia , lookup

Epitranscriptome wikipedia , lookup

Metagenomics wikipedia , lookup

Human genome wikipedia , lookup

Microsatellite wikipedia , lookup

Gene wikipedia , lookup

Non-coding DNA wikipedia , lookup

Frameshift mutation wikipedia , lookup

Genome editing wikipedia , lookup

Non-coding RNA wikipedia , lookup

History of RNA biology wikipedia , lookup

RNA-Seq wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genetic code wikipedia , lookup

Primary transcript wikipedia , lookup

Genomics wikipedia , lookup

Point mutation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Homework 1 / Introduction
Ps! Your programming task for homework should be done either in R, perl or python. Other
programming languages are not accepted and your homework will not be graded. The programming
script needs to be used by running system command in linux (e.g. python myscript.py
input.txt or Rscript myscript.R input.txt). Each presented script needs to display your
name in the beginning as the author of the script. Make a clear distinction between the exercises and
format the output in a clear and understandable way. If for some reasons some of the exercises are
not completed, write to the output: "5. Task not completed due to ..." and you can also describe why
it wasn't complete - too difficult, not understandable, too little time, etc (you won't get any points
but it will be helpful for the future). Also you can output on the screen any comments and remarks
about the task, like some observations that you made. For example: "Only two out of the three
polypeptide sequences would provide a complete protein for the third one contains stop codons
within its sequence". These remarks will help to get full points as well as show how you understand
the problems and solutions. Provide the answers to the general questions in
general_questions.pdf. Be sure to include your name in the pdf.
General questions
1. What is the Central Dogma of Molecular Biology?
2. What is the difference between DNA and RNA?
3. In what direction is transcription on RNA strand, from 5' to 3' or from 3' to 5'? Explain the
coding and template-strand.
4. How many chromosomes is in a single human cell?
5. How many nucleotides is in the human genome?
6. Is human genome made out of DNA or RNA?
7. What is the difference between eukaryotic and prokaryotic cells?
Programming tasks
1. Save the following DNA sequence to your computer and name it input.txt.
>randomsequence
AGATGTTAAAAAGTACAATTATGAATGGAATATGAATGGAGCTTTTGTAAAAATTTATCATGAAAAATAATGAGTCTTTGAGTTTT
AAATATAGAAAGCAATTAGTGGCACAAATAAAAACAACAAATATTTTGTTATAAAAAATGTTTACGCCAAACCAACATTCTACTGC
AACAGAGAATTCTTCCGAGTATTCGAAGCTGTAAACTTAACTTCTTATCACCATTTTTAAGGGCAAATAATTTCTATAGAAAGTAA
GTTCAAATTAAACTTTTAGCTCTCGCCACAATTTTTCCCAGGAAATGACCTCCTTCTAAATTATATACTCATGCACTTGAAAGTGA
AAACACAGCTTGGAAAGTCAGAGTAAAGAAAGAATGACAAGATCAGGAAATTGGGAAGTTGCTTTAGCACTTTTTTCTTTGTAAAG
AGAGAATATTATCAAGTGTCTAGAGCATATAATAGCTTCAAAGCATAGTTGAGAAAAGATCACTATGGAGCTGAAGGTAGAGAACT
GGGTAGTGTGACCACTTGATCCAGTCAGTGCTTAGGATAAAATGAACTGTCATGATCTAGAAAGCTGTTTTTGTTGTAAATCCCAA
GCCCCTCACTATGCTCAATATTCTCTCCTTGAAGCATTAAGTAAAACTAATCAAGGAAAATGGAAGGCTTGGCATTTTAGCTGATG
AGAATTCACTAGCTGGATTACTGTGTGGTAGAGGGAGGTGATTAGCACCTGTGAGAACAGAACGCAGTGTCATACTGGTGGAGGGA
AAGCAATAGTAATATGTTCCCTTCCTTTCTCATTTTAAGTGGAGTGGCCTGCTATCAGCTACCTATCCAAGGTTAAGCAAAAGAGA
GGGGAAAAAAAGGGAGTGGGGTAATGTAAGACTGATAATTTGGTATACTGGCCAAATCATAAAACAATCATGGGGAACAACACAGG
GTGGAGAGGTTTTAATTATAGACAAGTGTAGTGA
2. Write a script (name it general_script.*) that would calculate the percentage of
nucleotides in input.txt and that would generate a corresponding RNA sequence of the
DNA in input.txt (PS! Do not output the RNA sequence on the screen or in a file, just
generate it in a variable; and consider the DNA in input.txt as a coding strand which
means 5' to 3' direction). Determine the most overrepresented nucleotide of both
sequences. As output, display the percentage of all nucleotides of both sequences and
demonstrate the most overrepresented nucleotide on the screen in a form shown below or
any convenient put understandable form of your choosing.
example output
DNA -> A = 25%, T = 15%, G = 33%, C = 25%, most overrepresented is nucleotide G
RNA -> A = 25%, U = 17%, G = 33%, C = 23%, most overrepresented is nucleotide G
3. Update your script, so that it would calculate the GC% of the DNA in the input.txt.
Display the GC%
hint: CG% = ( No. of G's in the sequence + No. of C's in the sequence /
sequence length ) * 100
example output: DNA sequence CG% -> 58%
4. Update your script so that it would generate a protein sequences from the RNA sequence
generated in task 2. Display both the RNA sequence and resulting protein sequences. Now
generate a protein sequences from the same RNA sequence but with frameshit 1 and
frameshift 2, meaning start matching up codons from the second or third nucleotide. Your
output can look something like shown below or you can use your own formatting. PS! Display
only the first 21 nucleotides and the corresponding amino acids.
No frameshit
CUC CAA AGC GUC
"L" "Q" "S" "V"
Frameshift by 1
UCC AAA GCG UCA
"S" "K" "A" "S"
Frameshift by 2
CCA AAG CGU CAG
"P" "K" "R" "Q"
AGC UUU AAC
"S" "F" "N"
nucleotide
GCU UUA ACG
"A" "L" "T"
nucleotides
CUU UAA CGG
"L" "*" "R"
5. Update your script to calculate the percentage of each amino acid in the polypeptide
sequences in task 4. Display the most abundant amino acid for each frameshift on the
screen. For example:
No frameshit: The most abundant amino acid was "L" with 15%.
Frameshift by 1 nucleotide: The most abundant amino acid was "Q" with 31%.
Frameshift by 2 nucleotides: The most abundant amino acid was "R" with 8%.
6. Write a new script (name it sequence_identifier.*) that takes in a random sequence as
input (like Rscript myscript.R "ATTTCGGG") and identifies if the sequence represents
DNA, RNA or a protein. Depending on the identified sequences in the case of DNA and RNA,
calculate the nucleotide composition (percentage of different nucleotides in the sequence)
and GC% as in programming task 2 and 3. Also determine which is/are the most
overrepresented nucleotides. If the sequence is a protein then determine the most
overrepresented amino acid in the polypeptide chain. Display the results on the screen.
Bonus question
1. Search the internet and try to find information about NANOG gene. Try to answer what are
the genome coordinates of the NANOG gene (chromosome and coordinates on the
chromosome) and how many different transcripts a NANOG gene has. Try also to describe a
bit about the function of the NANOG gene. Output the protein sequence of the shortest
NANOG gene's transcript. Provide the answers in (bonus.pdf) and be sure to include your
name in the pdf.