* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download it_health_summary - Center for Biological Sequence Analysis
Expression vector wikipedia , lookup
Gene expression wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Magnesium transporter wikipedia , lookup
Peptide synthesis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Point mutation wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Genetic code wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Two-hybrid screening wikipedia , lookup
It & Health 2009 Summary Thomas Nordahl Petersen Teachers Quick Time™ and a decompressor are needed to s ee this pic ture. QuickTime™ and a decompressor are needed to see this picture. Quick Time™ and a decompressor are needed to s ee this pic ture. Thomas Nordahl Petersen Quick Time™ and a decompressor are needed to s ee this pic ture. Bent Petersen Rasmus Wernersson Quick Time™ and a decompressor are needed to s ee this pic ture. Ramneek Gupta Lisbeth Nielsen Fink Quick Time™ and a decompressor are needed to s ee this pic ture. Thomas Blicher Quick Time™ and a decompressor are needed to s ee this pic ture. Anders Gorm Pedersen Outline of the course • Topics will cover a general introduction to bioinformatics – Evolution – DNA / Protein – Alignment and scoring matrices • How does it work & what are the numbers – Visualization of multiple alignments • Phylogenetic trees and logo plots – Commonly used databases • Uniprot/Genbank & Genome browsers – Protein 3D-structure – Artificial neural networks & case stories – Practical use of bioinformatics tools • Preparation for exam Topics covered - (some of them) Information flow in biological systems Amino Acids Amine and carboxyl groups. Sidechain ‘R’ is attached to C-alpha carbon The amino acids found in Living organisms are L-amino acids Amino Acids - peptide bond N-terminal C-terminal 1 and 3-letter codes 1. There are 20 naturally occurring amino acids 2. Normally the one/three codes are used Ala - A Cys - C Asp - D Glu - E Phe - F Gly - G His - H Ile - I Lys - K Leu - L Met - M Asn - N Pro - P Gln - Q Arg - R Ser - S Thr - T Val - V Trp - W Tyr - Y CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Theory of evolution Charles Darwin 1809-1882 Phylogenetic tree Global versus local alignments Global alignment: align full length of both sequences. “Needleman-Wunsch” algorithm). (The Global alignment Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm). Seq 1 Local alignment Seq 2 Pairwise alignment: the solution ”Dynamic programming” (the Needleman-Wunsch algorithm) Sequence alignment - Blast Sequence alignment - Blast Blosum & PAM matrices • Blosum matrices are the most commonly used substitution matrices. • Blosum50, Blosum62, blosum80 • PAM - Percent Accepted Mutations • PAM-0 is the identity matrix. • PAM-1 diagonal small deviations from 1, offdiag has small deviations from 0 • PAM-250 is PAM-1 multiplied by itself 250 times. Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK Log-odds scores • BLOSUM is a log-likelihood matrix: • Likelihood of observing j given you have i is – P(j|i) = Pij/Pi • The prior likelihood of observing j is – Qj , which is simply the frequency • The log-likelihood score is – Sij = 2log2(P(j|i)/log(Qj) = 2log2(Pij/(QiQj)) – Where, Log2(x)=logn(x)/logn(2) – S has been normalized to half bits, therefore the factor 2 BLAST Exercise Genome browsers - UCSC Intron - Exon structure Single Nucleotide polymorphism - SNP SNPs Protein 3D-structure Protein structure Primary structure: Amino acids sequences Secondary structure: Helix/Beta sheet Tertiary structure: Fold, 3D cordinates Protein structure -helix helix -helix Pi-helix 3 residues/turn - few, but not uncommon 3.6 residues/turn - by far the most common helix 4.1 residues/turn - very rare Protein structure strand/sheet Protein folds Class 4’th is ‘few secondary structure Architecture Overall shape of a domain Topology Share secondary structure connectivity Protein 3D-structure Neural Networks From knowledge to information Protein sequence Biological feature Use of artificial neural networks • A data-driven method to predict a feature, given a set of training data • In biology input features could be amino acid sequence or nucleotides • Secondary structure prediction • Signal peptide prediction • Surface accessibility • Propeptide prediction C N Signal peptide Propeptide Mature/active protein Prediction of biological features Surface accessible QuickTime™ and a decompressor are needed to see this picture. Predict surface accessible from amino acid sequence only. Logo plots Information content, how is it calculated - what does it mean. Logo plots - Information Content Calculate Information Content I = a palog2pa + log2(4), Maximal value is 2 bits Sequence-logo Completely conserved ~0.5 each • Total height at a position is the ‘Information Content’ measured in bits. • Height of letter is the proportional to the frequency of that letter. • A Logo plot is a visualization of a mutiple alignment.