* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Practical 1
Real-time polymerase chain reaction wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
DNA sequencing wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Restriction enzyme wikipedia , lookup
Genomic library wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Gene expression wikipedia , lookup
SNP genotyping wikipedia , lookup
Molecular cloning wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
DNA supercoil wikipedia , lookup
Non-coding DNA wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Genetic code wikipedia , lookup
Biosynthesis wikipedia , lookup
Homology modeling wikipedia , lookup
Point mutation wikipedia , lookup
Community fingerprinting wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Practical 1 If you haven't already downloaded and installed R in your computer you should do it now. The easiest way is to go to http://www.rstudio.com/ and install the latest version of RStudio. R is a language for statistical computing and graphics that provides a variety of statistical and graphical techniques for data analysis. The Bioconductor project (http://www.bioconductor.org/) provides extentsions to R in the form of packages that can be used as tools for the analysis and comprehension of high-‐throughput genomic data. Throughout this course we incourage you to use R as the primary environment for data analysis. 1. Generate a random DNA sequence of length 100 nucleotides 2. Calculate the count and procentage of every nucleotide in the generated sequence (similar to the picture below). What is/are the most overrepresented nucleotides? Nucleotide Count Percentage 1 A 21 0.21 2 C 23 0.23 3 G 28 0.28 4 T 28 0.28 3. Calculate the CG and AT content of the generated sequences. Contenplate, is your sequence likely to be more or less stable and why? Hint: GC% = ( No. of G's in the sequence + No. of C's in the sequence / sequence length ) * 100 4. Repeate step 1 and 2 for RNA sequences. 5. Repeate step 1 and 2 for protein sequence by generating an amino acid polypeptide of length 100 and retriving the most over-‐represented amino acid in the sequence. 6. From the DNA sequence generated in step 1 generate the corresponding RNA sequence. Consider the DNA sequence given in 3' to 5' direction which means that the DNA would account as template-‐strand. 7. From the RNA sequence generated in the 6th step generate the corresponding polypeptide sequence. Now make a frameshift by generating a new polypeptide chain but starting from the second and third nucleotide. Ignore the last nucleotides that do nut sum up to be a codon. The result can look something like this: no frameshit CUC CAA AGC GUC "L" "Q" "S" "V" frameshift by 1 UCC AAA GCG UCA "S" "K" "A" "S" frameshift by 2 CCA AAG CGU CAG "P" "K" "R" "Q" AGC UUU AAC "S" "F" "N" nucleotide GCU UUA ACG "A" "L" "T" nucleotides CUU UAA CGG "L" "*" "R" GGG UAU GAU AUA CCG GUA AGG AUG AAA UAU CUC UCA GCC CUG CAU ACU UAC CUU UGA AAA AAC UGU CUU GUU UCU UUA "G" "Y" "D" "I" "P" "V" "R" "M" "K" "Y" "L" "S" "A" "L" "H" "T" "Y" "L" "*" "K" "N" "C" "L" "V" "S" "L" GGU AUG AUA UAC CGG UAA GGA UGA AAU AUC UCU CAG CCC UGC AUA CUU ACC UUU GAA AAA ACU GUC UUG UUU CUU UAA "G" "M" "I" "Y" "R" "*" "G" "*" "N" "I" "S" "Q" "P" "C" "I" "L" "T" "F" "E" "K" "T" "V" "L" "F" "L" "*" GUA UGA UAU ACC GGU AAG GAU GAA AUA UCU CUC AGC CCU GCA UAC UUA CCU UUG AAA AAA CUG UCU UGU UUC UUU "V" "*" "Y" "T" "G" "K" "D" "E" "I" "S" "L" "S" "P" "A" "Y" "L" "P" "L" "K" "K" "L" "S" "C" "F" "F" Note how the protein sequences totally changes when frameshift is introduced. Also do you see any proteins that would not be translated to the end? 8. Create a script that takes as input a random DNA, RNA or protein sequence and then decides the type of the sequence. For example: "The input sequence was a DNA sequence". Using EMBOSS package for details see, http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/ 1. Log into server (use terminal or putty from windows) ssh [email protected] 2. Go to folder bioinfo_course_2012/1_Introduction/ 3. Create your own folder, that you can recognize - all your files and working scripts should be kept in that folder. 4. Copy *.seq files from the folder bioinfo_course_2013/1_Introduction/ to your own folder. 5. Explore and run the following commands and try to explain what happens wossname -‐-‐help wossname dna infoseq -‐-‐help infoseq dna.seq sixpack -‐-‐help sixpack dna.seq 6. Try to find a function from Emboss that counts the number of words with specified length. What is the most occurring word with word size 4 in file dna.seq. 7. Analyze the protein.seq file with the function pepinfo. What do you see? 8. Try to find a function that creates a reverse complement of nucleotide sequence. Explore the reverse complement sequence. 9. Find a function to create a codon usage table from nucleotide sequence. Use the dna.seq file for nucleotide sequence and create the output of the function. 10. Explore the following websites. Discover what can you find there? http://biit.cs.ut.ee/ http://www.ebi.ac.uk/citexplore/ http://www.ncbi.nlm.nih.gov/pubmed/ http://www.ensembl.org/index.html http://www.rcsb.org/pdb/home/home.do http://www.ncbi.nlm.nih.gov/taxonomy http://tolweb.org/tree/ http://string-db.org/ http://emboss.sourceforge.net/ http://emboss.sourceforge.net/docs/emboss_tutorial/emboss_tutorial.html