Download Lecture_note_463BI

BCH463 Bioinformatics Md. Ashrafuzzaman, D.Sc. Known as: Dr. Ashraf Email: [email protected] Emergency contact cell: 0564174931 Office: 2B10, Bldg # 5, KSU Bioinformatics Bio-Informatics Management of the biological information using computer technology. Biological informations? Huge! What kind of info? (structure and mechanism) • • • • • • Discovered aspects related to biology Literature search using various routes Data bank exploration from different international sources Biological network data Biological structure data Data that will help understand the working mechanisms of biological systems • etc. Searching Data • • • • • Why searching? How to search? Where to search? What is usually done with searched data? Who should be a Bioinformatician? A case study • • • Bioinformatic-driven search for metabolic biomarkers in disease http://www.jclinbioinformatics.com/content/1/1/2 The search and validation of novel disease biomarkers requires the complementary power of professional study planning and execution, modern profiling technologies and related bioinformatics tools for data analysis and interpretation. Biomarkers have considerable impact on the care of patients and are urgently needed for advancing diagnostics, prognostics and treatment of disease. This survey article highlights emerging bioinformatics methods for biomarker discovery in clinical metabolomics, focusing on the problem of data preprocessing and consolidation, the data-driven search, verification, prioritization and biological interpretation of putative metabolic candidate biomarkers in disease. In particular, data mining tools suitable for the application to omic data gathered from most frequently-used type of experimental designs, such as case-control or longitudinal biomarker cohort studies, are reviewed and case examples of selected discovery steps are delineated in more detail. This review demonstrates that clinical bioinformatics has evolved into an essential element of biomarker discovery, translating new innovations and successes in profiling technologies and bioinformatics to clinical application. Data sequencing-GeneBank What is GeneBank? GenBank® is the National Institute of Health (NIH) genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at National Center for Biotechnology Information (NCBI). These three organizations exchange data on a daily basis. As of 2008, there are approximately 100 billion bases in 100 million sequences Consider the growth rate! Started in 1982 with 680,338 base pairs in 606 sequences How GeneBank works Submissions to GenBank • Many journals require submission of sequence information to a database prior to publication so that an accession number may appear in the paper. Sequin, NCBI's stand-alone submission software for MAC, PC, and UNIX platforms, is available. When using Sequin, the output files for direct submission should be sent to GenBank by electronic mail. Updating or Revising a Sequence • Revisions or updates to GenBank entries can be made at any time and can be accepted as BankIt or Sequin files or as the text of an e-mail message. Access to GenBank • GenBank is available for searching at NCBI via several methods. • The GenBank database is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. New Developments • NCBI is continuously developing new tools and enhancing existing ones to improve both submission and access to GenBank. The easiest way to keep abreast of these and other developments is to check the "What's New" section of the NCBI Web page and to read the NCBI News, which is also available by free subscription. Various bases of Bioinformatics • Count Bases at the Fraunhofer IGB, Germany This system basically consists of modules that cover sequence analysis (Count Bases – Next-Gen Sequence Assistant), statistics as well as visualization (Count Bases Viewer) In a single run, 106–109 DNA fragments with an average sequence length of 30–800 bases are simultaneously sequenced. This results in huge amounts of data that require a storage volume of up to 10–100 gigabyte. Sources: Genome and proteomic data bases Major rersearch areas Sequence analysis Genome annotation Literature Analysis of gene expression, regulation Analysis of protein expression Mutations in cancer, Etc. Organisms in GeneBank • 260,000 different species • 1000 new species being added per month • Human (Homo sapiens): 11,551,000 entries with 13,149,000,000 bases • Mouse (Mus musculus): 7,256,000 entries with 8,361,230,000 bases are top two species GeneBank Format GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence section. Annotation section The start of the annotation section is marked by a line beginning with the word "LOCUS". The only rule now applied in assigning a locus name is that it must be unique Sequence section The start of sequence section is marked by a line beginning with the word "ORIGIN" and the end of the section is marked by a line with only "//“. GeneBank Flat File Format LOCUS AF068625 200 bp mRNA linear ROD 06-DEC-1999 DEFINITION Mus musculus DNA cytosine-5 methyltransferase 3A (Dnmt3a) mRNA, complete cds. ACCESSION AF068625 REGION: 1..200 VERSION AF068625.2 GI:6449467 KEYWORDS . SOURCE Mus musculus (house mouse) ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Mus. REFERENCE1 (bases 1 to 200) , AUTHORS, TITLE, JOURNAL, etc. REFERENCE2 (bases 1 to 200) , AUTHORS, TITLE, JOURNAL, etc. REMARK Sequence update by submitter COMMENT On Nov 18, 1999 this sequence version replaced gi:3327977. FEATURES Location/Qualifiers source 1..200 /organism="Mus musculus" /mol_type="mRNA" /db_xref="taxon:10090" /chromosome="12" /map="4.0 cM" gene 1..>200 /gene="Dnmt3a" ORIGIN 1 gaattccggc ctgctgccgg gccgcccgac ccgccgggcc acacggcaga gccgcctgaa 61 gcccagcgct gaggctgcac ttttccgagg gcttgacatc agggtctatg tttaagtctt 121 agctcttgct tacaaagacc acggcaattc cttctctgaa gccctcgcag ccccacagcg 181 ccctcgcagc cccagcctgc // GenBank sequence format It’s a rich format for storing sequences and associated annotations. It shares a feature table vocabulary and format with the EMBL and DDJB formats. • • • • • • • • • • • • • • LOCUS CAA89576 109 aa linear PLN 11-AUG-1997 DEFINITION CYC1 [Saccharomyces cerevisiae]. ACCESSION CAA89576 VERSION CAA89576.1 GI:1015707 DBSOURCE embl locus SCYJR048W, accession Z49548.1 KEYWORDS 5-10 or as many as needed SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE1 (residues 1 to 109) , AUTHORS, TITLE, JOURNAL, etc. REFERENCE2 (residues 1 to 109) , AUTHORS, TITLE, JOURNAL, etc. FEATURES Location/Qualifiers source 1..109 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="X" Protein 1..109 /name="CYC1" CDS 1..109 /gene="CYC1" /coded_by="Z49548.1:954..1283" /note="ORF YJR048w" /db_xref="GOA:P00044" /db_xref="SGD:S0003809" /db_xref="UniProtKB/Swiss-Prot:P00044" ORIGIN 1 mtefkagsak kgatlfktrc lqchtvekgg phkvgpnlhg ifgrhsgqae gysytdanik 61 knvlwdennm seyltnpkky ipgtkmafgg lkkekdrndl itylkkace // Online Mendelian Inheritance in Man (OMIM) Database • • • • • • • OMIM (since 1960s) catalogues all the known diseases with a genetic component and tries to link them to the relevant genes in human genome. In 2004 there were 15,000 records. One can request to download the mim2gene.txt file from OMIM here: http://www.omim.org/downloads The OMIM code Every disease and gene is assigned a six digit number of which the first number classifies the method of inheritance. If the initial digit is 1, the trait is deemed autosomal dominant; if 2, autosomal recessive; if 3, X-linked. Wherever a trait defined in this dictionary has a MIM number, the number from the 12th edition of MIM, is given in square brackets with or without an asterisk (asterisks indicate that the mode of inheritance is known; a number symbol (#) before an entry number means that the phenotype can be caused by mutation in any of two or more genes) as appropriate e.g., Pelizaeus-Merzbacher disease [MIM #312080] is an X-linked recessive disorder. For further studies visit http://www.omim.org OMIM Example: http://www.omim.org/entry/189911 *189911 TRANSFER RNA GLYCINE 1; TRNAG1 Alternative titles; symbols TRANSFER RNA GLYCINE-CCC-1; TRG1 Cytogenetic location: Chr.16 Genomic coordinates (GRCh37): 16:0 - 90,354,753 (from NCBI) TEXT Mapping McBride et al. (1989) assigned a glycine tRNA(CCC) gene (TRG1) to human chromosome 1 (1pter-p34) on the basis of Southern analysis of a panel of hybrid cell DNAs. They also assigned a cloned DNA fragment encompassing a glycine tRNA gene (tRNA-GCC) and pseudogene to human chromosome 16 by the same method. Evolution There are about 1,300 tRNA genes in the haploid human genome (Hatlen and Attardi, 1971) encoding 60 to 90 tRNA isoacceptors (Lin and Agris, 1980). The studies by McBride et al. (1989) as well as studies by others (see, e.g., 180620, 189930, 189920, 180640, 189880) indicated that tRNA genes and pseudogenes are dispersed on at least 7 human chromosomes and suggested that these sequences would probably be found on most if not all human chromosomes. McBride et al. (1989) described short, 8-12 nucleotide, direct terminal repeats flanking many of the dispersed tRNA genes. This finding, combined with the dispersion of tRNA genes, suggests that many of these genes may have arisen by an RNA-mediated retroposition mechanism. There may have been selection for reiteration of genes encoding isoaccepting tRNAs, since a single mutation in a single-copy tRNA gene could be devastating. Moreover, even a mutation in the anticodon of a single tRNA gene might not be crucial if competition was provided by the normal 'wildtype' tRNA isoacceptor produced by multiple copies of the normal tRNA gene still present in the genome. Dispersion of multiple copies of each tRNA gene could provide diversity of 5-prime-flanking sequences, which are known to modulate the expression of some human tRNA genes. Tissue-specific or differentiation-specific expression of tRNA isoacceptors might be provided for by this mechanism. The recombination and unequal crossingover that can occur with tandem tRNA sequences can result in homogenization of the sequences with disastrous consequences. Nucleotide Database • • • • • • • • • • • • • • NUCLEOTIDE DATABASES NCBI's sequence databases accept genome data from sequencing projects from around the world and serve as the cornerstone of bioinformatics research. GenBank: An annotated collection of all publicly available nucleotide and amino acid sequences. EST database: A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDNA). GSS database: A database of genome survey sequences, or short, single-pass genomic sequences. HomoloGene: A gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. HTG database: A collection of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. SNPs database: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms. Nucleotide Database • • • • • • • • RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within NCBI and with external groups, support our data-gathering efforts. STS database: A database of sequence tagged sites, or short sequences that are operationally unique in the genome. UniSTS: A unified, non-redundant view of sequence tagged sites (STSs). UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters, each representing a unique known or putative human gene annotated with mapping and expression information and cross-references to other sources. UniGene computationally identifies transcripts from the same locus; analyzes expression by tissue, age, and health status; and reports related proteins (protEST) and clone resources. Single Nucleotide Polymorphism (SNP) database What it is? The SNP Database (also known as dbSNP) is an archive for genetic variation within and across different species developed and hosted by NCBI in collaboration with the National Human Genome Research Institute (NHGRI). Polymorphism in biology occurs when two or more clearly different phenotypes exist in the same population of a species: related to biodiversity, genetic variation and adaptation -The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. -It was created in September 1998 to supplement GenBank (NCBI’s nucleic acid and protein sequences) Goal Its goal is to act as a single database that contains all identified genetic variation, which can be used to investigate a wide variety of genetically based natural phenomenon. Specifically, access to the molecular variation cataloged within dbSNP aids basic research such as physical mapping, population genetics, investigations into evolutionary relationships, as well as being able to quickly and easily quantify the amount of variation at a given site of interest. Application Applied research, genetic engineering, drug discovery, etc. Submitting Every submitted variation receives a submitted SNP ID number (“ss#”).This accession number is a stable and unique identifier for that submission. Unique submitted SNP records also receive a reference SNP ID number (“rs#”; "refSNP cluster"). Section Types for Submissions to dbSNP Contact TYPE: CONT HANDLE:EGREEN NAME: Eric Green EMAIL: [email protected] LAB: Biophysics laboratory INST: King Saud University ADDR: PO Box 2455, Riyadh 11451, Kingdom of Saudi Arabia Publication section TYPE: PUB HANDLE: EGREEN MEDUID: Medline unique identifier. Not obligatory TITLE: Human chromosome 7 STS AUTHORS: Ashrafuzzaman,M. YEAR: 2012 STATUS: 1 (unpublished) / 2 (submitted) / 3 (in press) / 4 (published) Population class TYPE:POPULATION HANDLE:WHOEVER ID:YOUR_POP POP_CLASS: EUROPE POPULATION: Continent:Europe Nation: Some Nation Phenotype: You name it How to Submit To submit variations to dbSNP, one must first acquire a submitter handle, which identifies the laboratory responsible for the submission. Next, the author is required to complete a submission file containing the relevant information and data. Submitted records must contain the ten essential pieces of information listed in the following table.Other information required for submissions includes contact information, publication information (title, journal, authors, year), molecule type (genomic DNA, cDNA, mitochondrial DNA, chloroplast DNA), and organism. A sample submission sheet can be found at: (http://www.ncbi.nlm.nih.gov/SNP/get_html.cgi?whichHtml=how_to_submit#SECTION_TYPES) Element Explanation Flanking DNA (region of DNA that is not transcribed to RNA, region of DNA adjacent to 5’ end of the gene) Variations from assays must have 25 bp of flanking sequence on either side of the polymorphism and must be 100 bp overall. Alleles Alleles must be defined using A, G, C, or T nomenclature; IUPAC nomenclature will only be accepted in flanking regions. See: http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp Method A description of how the variation was detected (e.g. DNA sequencing) or how the allele frequencies were calculated. A table of method classes is provided. Population A description of the initial group from which the variation was found or from which the allele frequency was calculated. A table of population classes is provided. Sample size The number of chromosomes used to find the variation and the number of chromosomes used to calculate allele frequencies. Population-specific allele frequency The allele frequency of the surveyed population. Population-specific genotype frequency The genotype frequency of the surveyed population. Population-specific heterozygosity The proportion of individuals who are heterozygous for the variation. Individual genotypes The genotype of individuals from the study. Validation information The validation status lists the categories of evidence supporting the variation. Example of SNP submission View SNP Submission Batch Submitter Handle: OMIM-CURATED-RECORDS Submitter Batch ID: 590095_batch Submitter Method ID: CLINICAL_SNP_SUBMISSION Citation: Comment: Batch Total SubSNP(ss) Count: SNP Allele Samplesize RefSNP(rs) ss2rs Orien Chr ChrPos Contig Accession Contig Pos ss49214876 8804 6 A/G N.D rs19947467 3 0 MT 5521 NC_012920 .1 5521 ss49214877 8805 0 A/G N.D rs19947467 4 0 MT 5532 NC_012920 .1 5532 ss49214876 8803 2 AG/T N.D rs19947467 2 0 MT 5537 NC_012920 .1 5537 ss49214875 8802 3 A/G N.D rs19947467 1 0 MT 5549 NC_012920 .1 5549 SubSNP(ss) Submitter SNP_ID not supplied not supplied 4 Entrez records Homo sapiens Taxonomy ID: 9606 Genbank common name: human Inherited blast name: primates Rank: species Genetic code: Translation table 1 (Standard) Mitochondrial genetic code: Translation table 2 (Vertebrate Mitochondrial) Other names: common name: man authority: Homo sapiens Linnaeus, 1758 Database name Subtree links Direct links Nucleotide 9,892,226 9,892,201 Nucleotide EST 8,315,296 8,315,296 Nucleotide GSS 1,695,452 1,694,126 599,454 599,358 Structure 19,444 19,444 Genome 51 50 22,309 22,309 60,480,978 60,480,978 10 10 GEO Datasets 402,695 402,695 UniGene 129,493 129,493 UniSTS 328,584 328,584 PubMed Central 11,220 11,214 Gene 42,139 42,102 HomoloGene 18,431 18,431 SRA Experiments 72,649 72,647 9,033,473 9,033,473 Bio Project 694 693 Bio Sample 550,346 550,343 2,219 2,219 795,936 795,936 Epigenomics 1,987 1,987 GEO Profiles 27,034,750 27,034,750 13 13 2 1 Protein Popset SNP Domains Probe Bio Systems dbVar Protein Clusters Taxonomy Protein structure-presentation • Ribbon diagram PyMol ribbon of the unusual structure of the "tubby" brain protei Computer-drawn ribbon diagram of two CuZn superoxide dismutase dimers. Hollow 1.1 – Illustration software for Proteins HOLLOW facilitates the production of surface images of proteins. Hollow generates fake atoms that identifies voids, pockets, channels and depressions in a protein structure specified in the PDB format. interior pathway surfaces channel surfaces (and electrostatic surfaces) ligand-binding surfaces Softwares help addressing protein functions Molecular dynamics (MD) (mimicking the structure/conformations) Purpose:To understand statistical nature of conformations MD requires the following parameters: • • • • i. Dimension, parameters related to the state of the platform-initial conditions ii. Dimensions of the participating atoms iii. Structure of the individual molecules or sections of the whole structure. iv. Physical properties like charges on the atoms MD allows to locate agents/atoms involved in a structure by providing the following: • • i. coordinates (in most cases time dependent) ii. Projection MD results can importantly be converted into energetics: • • i. interactions between participating agents/atoms ii. Interactions with the background MD on DNA-lipid interaction An example of MD on interactions between biomolecules Important illustration in drug discovery Certain programs can convert these data into energy Information Energy Swiss Prot Database • • UniProtKB/Swiss-Prot UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB). It is a high quality annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions. • Since 2002, it is maintained by the UniProt consortium and is accessible via the UniProt website http://www.uniprot.org/ . • Deals with interactions, protein modelling, proteomics, protein structure & function, and genome analysis & annotation, etc. UniProtKB • The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. The UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and a section with computationally analyzed records that await full manual annotation. For the sake of continuity and name recognition, the two sections are referred to as "UniProtKB/Swiss-Prot" (reviewed, manually annotated) and "UniProtKB/TrEMBL" (unreviewed, automatically annotated), respectively. • • • • • • • Why is UniProtKB composed of 2 sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL? Where do the protein sequences come from? About 85 % of the protein sequences provided by UniProtKB are derived from the translation of the coding sequences (CDS) which have been submitted to the public nucleic acid databases, the EMBL-Bank/GenBank/DDBJ databases (INSDC). All these sequences, as well as the related data submitted by the authors, are automatically integrated into UniProtKB/TrEMBL. Where do the UniProtKB protein sequences come from? Does UniProtKB contain all protein sequences? What are the differences between UniProtKB/Swiss-Prot and UniProtKB/TrEMBL? UniProtKB/TrEMBL (unreviewed) contains protein sequences associated with computationally generated annotation and large-scale functional characterization. UniProtKB/Swiss-Prot (reviewed) is a high quality manually annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions. PCR-Polymerase Chain Reaction • • Polymerase Chain Reaction Polymerase chain reaction (PCR) enables researchers to produce millions of copies of a specific DNA sequence in approximately two hours. This automated process bypasses the need to use bacteria for amplifying DNA. • PCR is a scientific technique in molecular biology to amplify a single or a few copies of a piece of DNA across several orders of magnitude, generating thousands to millions of copies of a particular DNA sequence. Developed in 1983 by Kary Mullis,[1] PCR is now a common and often indispensable technique used in medical and biological research labs for a variety of applications.[2][3] These include DNA cloning for sequencing, DNA-based phylogeny, or functional analysis of genes; the diagnosis of hereditary diseases; the identification of genetic fingerprints (used in forensic sciences and paternity testing); and the detection and diagnosis of infectious diseases. In 1993, Mullis was awarded the Nobel Prize in Chemistry along with Michael Smith for his work on PCR.[4] The method relies on thermal cycling, consisting of cycles of repeated heating and cooling of the reaction for DNA melting and enzymatic replication of the DNA. Primers (short DNA fragments) containing sequences complementary to the target region along with a DNA polymerase (after which the method is named) are key components to enable selective and repeated amplification. As PCR progresses, the DNA generated is itself used as a template for replication, setting in motion a chain reaction in which the DNA template is exponentially amplified. PCR can be extensively modified to perform a wide array of genetic manipulations. • • • http://www.youtube.com/DNALearningCenter Fast A and BLAST • • FASTA suite of programs to perform sequence searching of the EBI protein databases using local or global similarity. In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST program was designed by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David J. Lipman at the NIH and was published in the Journal of Molecular Biology in 1990 Phylogenetic tree tutorial All life on Earth is united by evolutionary history; we are all evolutionary cousins — twigs on the tree of life. Phylogenetic systematics is the formal name for the field within biology that reconstructs evolutionary history and studies the patterns of relationships among organisms. Unfortunately, history is not something we can see. It has only happened once and only leaves behind clues as to what happened. Systematists use these clues to try to reconstruct evolutionary history. See the attached tutorial: pdf file provided A phylogeny, or evolutionary tree, represents the evolutionary relationships among a set of organisms or groups of organisms, called taxa (singular: taxon). The tips of the tree represent groups of descendent taxa (often species) and the nodes on the tree represent the common ancestors of those descendants. Two descendents that split from the same node are called sister groups. In the tree below, species A & B are sister groups — they are each other's closest relatives. Many phylogenies also include an outgroup — a taxon outside the group of interest. All the members of the group of interest are more closely related to each other than they are to the outgroup. Hence, the outgroup stems from the base of the tree. An outgroup can give you a sense of where on the bigger tree of life the main group of organisms falls. It is also useful when constructing evolutionary trees. Evolutionary trees depict clades. A clade is a group of organisms that includes an ancestor and all descendants of that ancestor. You can think of a clade as a branch on the tree of life.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture_note_463BI