* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Homework 1 / Introduction General questions Programming tasks
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Molecular cloning wikipedia , lookup
Epigenomics wikipedia , lookup
Designer baby wikipedia , lookup
DNA supercoil wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
DNA vaccination wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Genome evolution wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genomic library wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Polyadenylation wikipedia , lookup
Microevolution wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Expanded genetic code wikipedia , lookup
RNA silencing wikipedia , lookup
Nucleic acid tertiary structure wikipedia , lookup
Epitranscriptome wikipedia , lookup
Metagenomics wikipedia , lookup
Human genome wikipedia , lookup
Microsatellite wikipedia , lookup
Non-coding DNA wikipedia , lookup
Frameshift mutation wikipedia , lookup
Genome editing wikipedia , lookup
Non-coding RNA wikipedia , lookup
History of RNA biology wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genetic code wikipedia , lookup
Primary transcript wikipedia , lookup
Point mutation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Homework 1 / Introduction Ps! Your programming task for homework should be done either in R, perl or python. Other programming languages are not accepted and your homework will not be graded. The programming script needs to be used by running system command in linux (e.g. python myscript.py input.txt or Rscript myscript.R input.txt). Each presented script needs to display your name in the beginning as the author of the script. Make a clear distinction between the exercises and format the output in a clear and understandable way. If for some reasons some of the exercises are not completed, write to the output: "5. Task not completed due to ..." and you can also describe why it wasn't complete - too difficult, not understandable, too little time, etc (you won't get any points but it will be helpful for the future). Also you can output on the screen any comments and remarks about the task, like some observations that you made. For example: "Only two out of the three polypeptide sequences would provide a complete protein for the third one contains stop codons within its sequence". These remarks will help to get full points as well as show how you understand the problems and solutions. Provide the answers to the general questions in general_questions.pdf. Be sure to include your name in the pdf. General questions 1. What is the Central Dogma of Molecular Biology? 2. What is the difference between DNA and RNA? 3. In what direction is transcription on RNA strand, from 5' to 3' or from 3' to 5'? Explain the coding and template-strand. 4. How many chromosomes is in a single human cell? 5. How many nucleotides is in the human genome? 6. Is human genome made out of DNA or RNA? 7. What is the difference between eukaryotic and prokaryotic cells? Programming tasks 1. Save the following DNA sequence to your computer and name it input.txt. >randomsequence AGATGTTAAAAAGTACAATTATGAATGGAATATGAATGGAGCTTTTGTAAAAATTTATCATGAAAAATAATGAGTCTTTGAGTTTT AAATATAGAAAGCAATTAGTGGCACAAATAAAAACAACAAATATTTTGTTATAAAAAATGTTTACGCCAAACCAACATTCTACTGC AACAGAGAATTCTTCCGAGTATTCGAAGCTGTAAACTTAACTTCTTATCACCATTTTTAAGGGCAAATAATTTCTATAGAAAGTAA GTTCAAATTAAACTTTTAGCTCTCGCCACAATTTTTCCCAGGAAATGACCTCCTTCTAAATTATATACTCATGCACTTGAAAGTGA AAACACAGCTTGGAAAGTCAGAGTAAAGAAAGAATGACAAGATCAGGAAATTGGGAAGTTGCTTTAGCACTTTTTTCTTTGTAAAG AGAGAATATTATCAAGTGTCTAGAGCATATAATAGCTTCAAAGCATAGTTGAGAAAAGATCACTATGGAGCTGAAGGTAGAGAACT GGGTAGTGTGACCACTTGATCCAGTCAGTGCTTAGGATAAAATGAACTGTCATGATCTAGAAAGCTGTTTTTGTTGTAAATCCCAA GCCCCTCACTATGCTCAATATTCTCTCCTTGAAGCATTAAGTAAAACTAATCAAGGAAAATGGAAGGCTTGGCATTTTAGCTGATG AGAATTCACTAGCTGGATTACTGTGTGGTAGAGGGAGGTGATTAGCACCTGTGAGAACAGAACGCAGTGTCATACTGGTGGAGGGA AAGCAATAGTAATATGTTCCCTTCCTTTCTCATTTTAAGTGGAGTGGCCTGCTATCAGCTACCTATCCAAGGTTAAGCAAAAGAGA GGGGAAAAAAAGGGAGTGGGGTAATGTAAGACTGATAATTTGGTATACTGGCCAAATCATAAAACAATCATGGGGAACAACACAGG GTGGAGAGGTTTTAATTATAGACAAGTGTAGTGA 2. Write a script (name it general_script.*) that would calculate the percentage of nucleotides in input.txt and that would generate a corresponding RNA sequence of the DNA in input.txt (PS! Do not output the RNA sequence on the screen or in a file, just generate it in a variable; and consider the DNA in input.txt as a coding strand which means 5' to 3' direction). Determine the most overrepresented nucleotide of both sequences. As output, display the percentage of all nucleotides of both sequences and demonstrate the most overrepresented nucleotide on the screen in a form shown below or any convenient put understandable form of your choosing. example output DNA -> A = 25%, T = 15%, G = 33%, C = 25%, most overrepresented is nucleotide G RNA -> A = 25%, U = 17%, G = 33%, C = 23%, most overrepresented is nucleotide G 3. Update your script, so that it would calculate the GC% of the DNA in the input.txt. Display the GC% hint: CG% = ( No. of G's in the sequence + No. of C's in the sequence / sequence length ) * 100 example output: DNA sequence CG% -> 58% 4. Update your script so that it would generate a protein sequences from the RNA sequence generated in task 2. Display both the RNA sequence and resulting protein sequences. Now generate a protein sequences from the same RNA sequence but with frameshit 1 and frameshift 2, meaning start matching up codons from the second or third nucleotide. Your output can look something like shown below or you can use your own formatting. PS! Display only the first 21 nucleotides and the corresponding amino acids. No frameshit CUC CAA AGC GUC "L" "Q" "S" "V" Frameshift by 1 UCC AAA GCG UCA "S" "K" "A" "S" Frameshift by 2 CCA AAG CGU CAG "P" "K" "R" "Q" AGC UUU AAC "S" "F" "N" nucleotide GCU UUA ACG "A" "L" "T" nucleotides CUU UAA CGG "L" "*" "R" 5. Update your script to calculate the percentage of each amino acid in the polypeptide sequences in task 4. Display the most abundant amino acid for each frameshift on the screen. For example: No frameshit: The most abundant amino acid was "L" with 15%. Frameshift by 1 nucleotide: The most abundant amino acid was "Q" with 31%. Frameshift by 2 nucleotides: The most abundant amino acid was "R" with 8%. 6. Write a new script (name it sequence_identifier.*) that takes in a random sequence as input (like Rscript myscript.R "ATTTCGGG") and identifies if the sequence represents DNA, RNA or a protein. Depending on the identified sequences in the case of DNA and RNA, calculate the nucleotide composition (percentage of different nucleotides in the sequence) and GC% as in programming task 2 and 3. Also determine which is/are the most overrepresented nucleotides. If the sequence is a protein then determine the most overrepresented amino acid in the polypeptide chain. Display the results on the screen. Bonus question 1. Search the internet and try to find information about NANOG gene. Try to answer what are the genome coordinates of the NANOG gene (chromosome and coordinates on the chromosome) and how many different transcripts a NANOG gene has. Try also to describe a bit about the function of the NANOG gene. Output the protein sequence of the shortest NANOG gene's transcript. Provide the answers in (bonus.pdf) and be sure to include your name in the pdf.