Download Practical 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Real-time polymerase chain reaction wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

DNA sequencing wikipedia , lookup

Nucleosome wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Restriction enzyme wikipedia , lookup

Genomic library wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Gene expression wikipedia , lookup

SNP genotyping wikipedia , lookup

Molecular cloning wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Gene wikipedia , lookup

DNA supercoil wikipedia , lookup

RNA-Seq wikipedia , lookup

Non-coding DNA wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Genetic code wikipedia , lookup

Biosynthesis wikipedia , lookup

Homology modeling wikipedia , lookup

Point mutation wikipedia , lookup

Community fingerprinting wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Transcript
Practical 1 If you haven't already downloaded and installed R in your computer you should do it now. The easiest way is to go to http://www.rstudio.com/ and install the latest version of RStudio. R is a language for statistical computing and graphics that provides a variety of statistical and graphical techniques for data analysis. The Bioconductor project (http://www.bioconductor.org/) provides extentsions to R in the form of packages that can be used as tools for the analysis and comprehension of high-­‐throughput genomic data. Throughout this course we incourage you to use R as the primary environment for data analysis. 1. Generate a random DNA sequence of length 100 nucleotides 2. Calculate the count and procentage of every nucleotide in the generated sequence (similar to the picture below). What is/are the most overrepresented nucleotides? Nucleotide
Count
Percentage
1
A 21 0.21 2
C 23 0.23 3
G 28 0.28 4
T 28 0.28 3. Calculate the CG and AT content of the generated sequences. Contenplate, is your sequence likely to be more or less stable and why? Hint: GC% = ( No. of G's in the sequence + No. of C's in the sequence / sequence length ) * 100 4. Repeate step 1 and 2 for RNA sequences. 5. Repeate step 1 and 2 for protein sequence by generating an amino acid polypeptide of length 100 and retriving the most over-­‐represented amino acid in the sequence. 6. From the DNA sequence generated in step 1 generate the corresponding RNA sequence. Consider the DNA sequence given in 3' to 5' direction which means that the DNA would account as template-­‐strand. 7. From the RNA sequence generated in the 6th step generate the corresponding polypeptide sequence. Now make a frameshift by generating a new polypeptide chain but starting from the second and third nucleotide. Ignore the last nucleotides that do nut sum up to be a codon. The result can look something like this: no frameshit
CUC CAA AGC GUC
"L" "Q" "S" "V"
frameshift by 1
UCC AAA GCG UCA
"S" "K" "A" "S"
frameshift by 2
CCA AAG CGU CAG
"P" "K" "R" "Q"
AGC UUU AAC
"S" "F" "N"
nucleotide
GCU UUA ACG
"A" "L" "T"
nucleotides
CUU UAA CGG
"L" "*" "R"
GGG UAU GAU AUA CCG GUA AGG AUG AAA UAU CUC UCA GCC CUG CAU ACU UAC CUU UGA AAA AAC UGU CUU GUU UCU UUA
"G" "Y" "D" "I" "P" "V" "R" "M" "K" "Y" "L" "S" "A" "L" "H" "T" "Y" "L" "*" "K" "N" "C" "L" "V" "S" "L"
GGU AUG AUA UAC CGG UAA GGA UGA AAU AUC UCU CAG CCC UGC AUA CUU ACC UUU GAA AAA ACU GUC UUG UUU CUU UAA
"G" "M" "I" "Y" "R" "*" "G" "*" "N" "I" "S" "Q" "P" "C" "I" "L" "T" "F" "E" "K" "T" "V" "L" "F" "L" "*"
GUA UGA UAU ACC GGU AAG GAU GAA AUA UCU CUC AGC CCU GCA UAC UUA CCU UUG AAA AAA CUG UCU UGU UUC UUU
"V" "*" "Y" "T" "G" "K" "D" "E" "I" "S" "L" "S" "P" "A" "Y" "L" "P" "L" "K" "K" "L" "S" "C" "F" "F"
Note how the protein sequences totally changes when frameshift is introduced. Also do you see any proteins that would not be translated to the end?
8. Create a script that takes as input a random DNA, RNA or protein sequence and then decides the type of the sequence. For example: "The input sequence was a DNA sequence". Using EMBOSS package for details see, http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/ 1. Log into server (use terminal or putty from windows)
ssh [email protected]
2. Go to folder
bioinfo_course_2012/1_Introduction/
3. Create your own folder, that you can recognize - all your files and working scripts
should be kept in that folder.
4. Copy *.seq files from the folder
bioinfo_course_2013/1_Introduction/ to your own folder.
5. Explore and run the following commands and try to explain what happens
wossname -­‐-­‐help
wossname dna
infoseq -­‐-­‐help
infoseq dna.seq
sixpack -­‐-­‐help
sixpack dna.seq
6. Try to find a function from Emboss that counts the number of words with specified
length. What is the most occurring word with word size 4 in file dna.seq.
7. Analyze the protein.seq file with the function pepinfo. What do you see?
8. Try to find a function that creates a reverse complement of nucleotide sequence.
Explore the reverse complement sequence.
9. Find a function to create a codon usage table from nucleotide sequence. Use the
dna.seq file for nucleotide sequence and create the output of the function.
10.
Explore the following websites. Discover what can you find there?
http://biit.cs.ut.ee/
http://www.ebi.ac.uk/citexplore/
http://www.ncbi.nlm.nih.gov/pubmed/
http://www.ensembl.org/index.html
http://www.rcsb.org/pdb/home/home.do
http://www.ncbi.nlm.nih.gov/taxonomy
http://tolweb.org/tree/
http://string-db.org/
http://emboss.sourceforge.net/
http://emboss.sourceforge.net/docs/emboss_tutorial/emboss_tutorial.html