* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PROTEIN STRUCTURE CLASSIFICATION
Cell-penetrating peptide wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
History of molecular evolution wikipedia , lookup
Genetic code wikipedia , lookup
Molecular evolution wikipedia , lookup
Gene expression wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Magnesium transporter wikipedia , lookup
Protein design wikipedia , lookup
Biochemistry wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Protein folding wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
List of types of proteins wikipedia , lookup
Circular dichroism wikipedia , lookup
Protein moonlighting wikipedia , lookup
Interactome wikipedia , lookup
Protein domain wikipedia , lookup
Western blot wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Structural alignment wikipedia , lookup
Protein adsorption wikipedia , lookup
Homology modeling wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729) Levels of Protein Structure 2 Traditional Architecture WOOD, BRICK etc. Form fits function material/building blocks Molecular Architecture AMINO ACIDS ***************************************************************** Number of Amino Acids found in Eukaryotic Proteins= 20 (found in universal genetic code)+ 2 (synthetically incorporated therefore not included in discussion) Possible number of protein sequence of size 300 = 20300 This number is greater than the total number of atoms in the universe Thanks to: Frank Lloyd Wright for graphics 3 Evolution Evolution has selected a very small subset of those protein sequences < 30,000 in humans and an even smaller number of protein structures (1000–5000) Ratio– 1:6 Conserved structures are expected to reflect functional similarities (interaction with other molecules) 4 Why Compare Protein Structures? Sequence Structure Function Low sequence similarity may yield very similar structures Sometimes high sequence similarity yields different structures 5 Know your dataset FOR THE CURRENT PROJECT PDB: Protein Data Bank • The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. • These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans. Understanding the shape of a molecule helps to understand how it works. • This knowledge can be used to help deduce a structure's role in human health and disease, and in drug development. The structures in the archive range from tiny proteins and bits of DNA to complex molecular machines like the ribosome. • Web address: http://www.rcsb.org/pdb/home/home.do SCOP Dataset “Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.” • http://scop.berkeley.edu/ Starting at the bottom, the hierarchy of SCOP domains comprises the following levels -- Species representing a distinct protein sequence. -- Protein grouping together similar sequences of essentially the same functions. -- Family containing proteins with similar sequences but typically distinct functions. -- Superfamily bridging together protein families with common functional and structural features inferred to be from a common evolutionary ancestor. -- Levels above Superfamily are classified based on structual features and similarity, and do not imply homology:Folds grouping structurally similar superfamilies. Structural Fingerprints/Features Structure comparison is an NP-Hard problem. There are no fast structural alignment algorithms that can guarantee optimality within any given similarity measure. Therefore, existing structure comparison methods employ heuristics. There are different approaches for extracting structural features. We use Triangular Spatial Relationship to generate keys. HUH!! LOOKS LIKE I HAVE DONE EVERYTHING.. SO WHY ARE WHY AM I HERE? FOR THE CURRENT PROJECT What do you get from me? • A file of Keys Created Representing each Protein Structure. • Each of these files of keys representing protein has been correctly classified into their respective Superfamilies. • That will give you the class information for the files. It is a hypothesis that each file belonging to same class must have similar keys. You must be able to test this hypothesis. Biggest Challenge COMBINATORIAL EXPLOSION 30,000,000.00 25,000,000.00 Key Count 20,000,000.00 15,000,000.00 10,000,000.00 5,000,000.00 4,410,549.00 0.00 0 50 100 150 200 250 300 350 Number of Amino Acids per Protien 400 450 500 550 600 For the current project • Develop SIGNATUREs for the PROTEIN KEYS. • These SIGNATUREs must be used to CLASSIFY the proteins correctly into their respective SUPERFAMILIES. • Performance and Speed are important Signature for Keys • Accurately/concisely represent the keys. • Signatures can be simple statistics like mean, median etc. of the keys or a complex combination of features. • What ever may be the choice of Signature/s, it/they must be able to perform extremely fast and accurate classification of the protein/s. Choice of Classifier/Tool • Criteria: 1. ACCURACY 2. SPEED Final Product A software that takes in “keys” as input and classifies it correctly. There must be a check if the “new” protein-keys already exists in the system.