Download Powerpoint - School of Engineering and Computer Science

What Is Bioinformatics? • Using computers to solve problems in biology. • Advances in biology have generated large amounts of data; it is no longer possible to categorize or search it all manually. • Advances in computers have made it possible to investigate problems that were formerly too computationally intensive to tackle. What Is Bioinformatics? • Using computers to solve problems in biology. • Advances in biology have generated large amounts of data; it is no longer possible to categorize or search it all manually. • Advances in computers have made it possible to investigate problems that were formerly too computationally intensive to tackle. Some Areas Of Bioinformatics • Creation and maintenance of databases of DNA and protein sequences. • Prediction of protein structure and function. • Construction of ancestry trees of organisms. Why Study Bioinformatics? • Solve interesting scientific and technological problems. • Help cure diseases. • Job opportunities. The Central Dogma Proposed by Francis Crick in 1958 to describe the flow of information in a cell. DNA Information stored in DNA is transferred residue-by-residue to RNA which in turn transfers the information residue-by-residue to protein. RNA Protein The Central Dogma was proposed by Crick to help scientists think about molecular biology. It has undergone numerous revisions in the past 45 years. Central Dogma Replication DNA duplication of DNA using DNA as the template Transcription synthesis of RNA using DNA as the template RNA Translation synthesis of proteins using RNA as the template Protein Central Dogma: DNA The Central Dogma deoxyribonucleic acid DNA RNA Protein DNA stores the blueprint for cellspecific synthesis of proteins necessary for life. Molecular Structure of DNA base: thymine (pyrimidine) monophosphate sugar: 2’-deoxyribose base:adenine (purine) DNA: Nucleoside Structure nucleoside base nucleotides (nucleoside mono-, di-, and triphosphates) base phosphate(s) sugar sugar DNA: Molecule Structure • • • • • • DNA is double stranded. DNA strands are antiparallel. G-C pairs have 3 hydrogen bonds. A-T pairs have 2 hydrogen bonds. One strand is the complement of the other. Major and minor grooves present different surfaces for interaction. Central Dogma: RNA DNA ribonucleic acid RNA Protein RNA carries the instructions from the cell nucleus to the cytoplasm for synthesis of proteins. RNA Terminology Base Nucleoside (RNA) Deoxynucleoside (DNA) base Adenine Adenosine Deoxyadenosine Guanine Guanosine Deoxyguanosine Cytosine Cytidine Deoxycytidine sugar Uracil Uridine (not usually found) Thymine (not usually found) (Deoxy)thymidine nucleoside RNA: Structure • • • • • RNA can be single or double stranded G-C pairs have 3 hydrogen bonds A-U pairs have 2 hydrogen bonds Single-stranded, double-stranded, and loop RNA present different surfaces Central Dogma: Proteins DNA Proteins are synthesized in RNA Protein the cytoplasm using the message on messengerRNA. Proteins serve as the basis for the cellular structure, function, communications and metabolism. Amino Acids: Protein Building Blocks The 20 Amino Acids carboxyl group amino group Protein Structure -helix antiparallel -sheet Central Dogma Replication DNA duplication of DNA using DNA as the template Transcription RNA synthesis of RNA using DNA as the template Translation Protein synthesis of proteins using RNA as the template Biological Databases Problems of Biological Databases • Biologists have sequenced (determined the base pairs of) a large amount of DNA. • In addition to the raw sequence data, ancillary information about each sequence must also be stored, such as what species it is from, who discovered it, what protein it encodes, and the function of the protein. GenBank • To handle all this information, and make it available to researchers, biologists have set up several different databases, each specializing on one aspect of the data. • The main one in the United States that deals with DNA sequences is called GenBank. • It is maintained by the NCBI, a branch of the National Institute of Health. EMBL and DDBJ • There are two other DNA databases: EMBL in Europe and DDBJ in Japan. • Researches can submit their sequences to any of these; they exchange information daily to keep the databases in synchronization. • The amount of data is huge: over 30 million sequences and almost 40 billion base pairs. (As of February, 2004.) Other Databases • In addition to the sequence, other databases track the proteins that result from the translation and transcription of the DNA, the function of these proteins (ontology), the species the DNA came from (taxonomy) and the authors and journal the reports about the sequence are published in. Accession Numbers • The key to tying the records that appear in different databases together is the accession number. • This is assigned to the sequence when it is first submitted, and is subsequently used by all other databases. • It is either a six character (one letter followed by five numbers) or eight character (two letters, six numbers) field. Database Schema Taxonomy SeqAccession Version Kingdom Phylum Species Feature Sequence Publications FeaureId SeqAccession Version StartLoc EndLoc Date Author SeqAccession Version Source Author Date Length SeqAccession Version PubMed-Id Feature Data Sequence Data FeatureId Details URL SeqAccession Version Data GenBank Format • GenBank uses a flat file format. An example: GenBank Schema (portion) Usable via: •Web interface at NCBI •http://www.ncbi.nlm.nih.gov/BLAST/ •Local web server •Download database and search engine to personal computer Types of Sequence Searches •DNA •Protein •Translated searches •Pairwise •Genomic •Specialized •Existing searches Nucleotide Databases •NR: All non-redundant •Month: Last 30 days •EST: Expressed Sequence Tags •EST_Human •EST_Mouse •HTG: High Throughput Genomic •Yeast •Ecoli GenBank Format The heading gives the accession number, a brief description (if known) and the date submitted. GenBank Format • The next section indicates which organism the DNA was obtained from; in this case, a human. Clicking on the hyperlink will take us to the taxonomy database, for more details. Human Taxonomy Detail GenBank Format • This portion names the researchers who sequenced the DNA, and tells in what journal the paper describing it may be found. The hyperlink will take us to the actual article. Journal Article • 1: Genomics. 1998 Dec 15;54(3):542-55. Related Articles, Links – A long terminal repeat of the human endogenous retrovirus ERV-9 is located in the 5' boundary area of the human beta-globin locus control region. Long Q, Bengra C, Li C, Kutlar F, Tuan D. Department of Medicine, Medical College of Georgia, Augusta, Georgia, 30912, USA. Transcription of the human beta-like globin genes in erythroid cells is regulated by the far-upstream locus control region (LCR). In an attempt to define the 5' border of the LCR, we have cloned and sequenced 5 kb of new upstream DNA. We found an LTR retrotransposon belonging to the ERV-9 family of human endogenous retroviruses in the apparent 5' boundary area of the LCR. This ERV-9 LTR contains an unusual U3 enhancer region composed of 14 tandem repeats with recurrent GATA, CACCC, and CCAAT motifs. This LTR is conserved in human and gorilla, indicating its evolutionary stability in the genomes of the higher primates. In both recombinant constructs and the endogenous human genome, the LTR enhancer and promoter activate the transcription of cis-linked DNA preferentially in erythroid cells. Our findings suggest the possibility that this LTR retrotransposon may serve a relevant host function in regulating the transcription of the beta-globin LCR. Copyright 1998 Academic Press. GenBank Format • Here is a portion of the actual sequence, in this case for a gene that encodes a part of the hemoglobin molecule. The a,c,t, and g represent adenine, cytosine, thymine, and guanine, respectively. Protein, Ontology, Etc • [slides] DB Relations Other Biological Databases • • • • TIGR dbs, e.g. Chlamydia trachomatis ACEdb Globin Gene Server DAS – Generalized Annotation Services The Institute for Genome Research • [history, Venter, etc] • [collection of microbial databases] TIGR: Chlamydia • [screen shot of query page] TIGR: Chlamydia ACEdb • http://www.wormbase.org/ • Originally designed for Ceanorrhabitis elegans (a small worm), now used for many organisms. • Object oriented. ACEdb BNF Definition • BNF Grammar for the ACEDB Models • <models> ::= <model> | <model> <models> ; • <model> ::= ?<model name> <unique> <tag column> /* For classes */ | #<model name> <unique> <tag column> /* For constructed types */ ; <tag column> ::= <tag node> | <tag node> NL <tag column> ; <tag node> ::= <tag> | <tag> <unique> <data cluster> | <tag> <unique> START_INDENT <tag column> END_INDENT /* In addition, in ACEDB 1-x, we allowed */ | <tag> <unique> START_INDENT <data cluster> NL <tag column> END_INDENT /* This construction however can lead to ambiguities when parsing data and */ /* will be forbidden in release 2-x */ ; <data cluster> ::= <data type list> | <data type list> REPEAT | <data type list> #<model name> | #<model name> ; <data type list> ::= <data type> | <data type> <unique> <data type list> ; <unique> ::= <null> | UNIQUE ; <data type> ::= <primitive data type> | <class reference> | ANY /* reserved for kernel use */ ; <primitive data type> ::= Int | Text | Float ; <class reference> ::= ? <class name> | ? <class name> XREF <tag name> ; • Etc. ACEdb • [screen shots] • [live links] Globin Gene Server • Catalogs variants in hemoglobin genes, that can cause such diseases as sickle-cell anemia and beta-thalessemia. • [live links] Globin Server Schema • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • TABLE 1. A Synopsis of the Schema for HbVar— a Database of Human Hemoglobin Variants and Thalassemias Name Category Type of Thalassemia Description: Chain Residue number Substitutions Insertions Deletions Fusion gene Hbs Contact Haplotype Hematology: Genotype Hemotological findings Modifier Condition Laboratory findings Assay Range Units Other factors Electrophoresis Method Quantitative result Chromatography Method • • • • • • • • • • • • • • • • • • • • • • Stability Relative stability Dissociation Other stability information Occurrence Ethnic background Frequency Structure studies Separation of hemoglobins Separation of globin chains Methods Protein analysis DNA analysis Functional studies Study Result What the study covered Comments on the variant References Authors/editors Journal articles Other references Globin Gene Server Distributed Annotation System - DAS • [discussion] • [import and modify slides from project] APIs for Database Access Bio-PERL • Officially organized in 1995 • The Bio-PERL Project is an open source project using PERL tools for bioinformatics, genomics and life science research. • http://www.bioperl.org Retrieving a Sequence from GenBank use Bio::Perl; use Bio::DB::EMBL; $gb = new Bio::DB::EMBL(); $id = 'AF162692'; #Accession number $seq = $gb->get_Seq_by_acc($id); BioPerl API for $desc = $seq->desc (); sequence retrieval $description = $seq->description (); $len = $seq->length (); Member Functions of $dna = $seq->seq(); the sequence object print "Sequence Desc = $desc \n"; print "Sequence Description = $description \n"; print "Sequence Len = $len \n"; print "Seq = $dna \n"; Clones and Contigs #!/usr/bin/perl -w use Bio::EnsEMBL::DBSQL::DBAdaptor; my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (host=>'kaka.sanger.ac.uk', -user=>'anonymous', -dbname=>'homo_sapiens_core_19_34a'); my $slice_adaptor = $db->get_SliceAdaptor; my $slice = $slice_adaptor->fetch_by_chr_start_end('18', 1, 10000000); $count = 0; @genes = @{$slice->get_all_Genes}; foreach $gene (@genes) { $count++; print "Gene # $count- ". $gene->stable_id . ":" . $gene->description . "\n"; }

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Powerpoint - School of Engineering and Computer Science