Download Chapter 2 PowerPoint Slides

Assignment: Due Friday: READ: Ensembl API Tutorial: http://www.ensembl.org/info/software/core/core_tutorial.html READ: Encode Project paper (on Icon) 1 Databases • Reference: Bioinformatics Computer Skills – Gibas and Jambeck 2 Database Intro (for Ensembl Modules) Database Design for Mere Mortals, M Hernandez, Addison-Wesley 3 Databases • What is a database? 4 Databases • Vital part of genomics, bioinformatics, genetics, and biology • Genomic data – sequence, genes, annotation, markers, literature, diseases, structures, SNPs, mutations, etc, etc, etc. – Ensembl, UCSC, NCBI -- the big "three" • Knowing how to find and download from central biological repositories is an important skill • However, this is a tool building class, and databases on the web becomes more integral to sharing information – build our own DBs 5 Databases • This portion of the course will not make you an expert in databases (there are other courses for that), nor will you be able to design and implement a major database • However, you will have a basic understanding of the steps involved, and you will be able to instantiate rudimentary databases from which you may expand with efforts beyond this class • Additionally, you will be better equipped to query existing databases and develop applications/interfaces to interact with DBs 6 Databases • Steps – – – – design a data model select a database management system implement data model interface 7 DBMS • types – flat file index – relational – object-oriented 8 Database Introduction: terms • Field, attribute, tuple, record – Individual data items within a database – Often “field” describes the “place holder” in a database table, and “record” describes that actual data value • Table – One or more attributes grouped together • Relation, link, pointer – A connection between two fields (usually between tables) • Other terms – meta data -- data that describes attributes (like the column titles in a spread sheet) – data elements -- a "class" and attribute • ex) subject, name • subject, address • subject, affection_status 9 Databases • Operational – Used whenever there is need to collect, maintain, and modify data – Dynamic data • Changes constantly • Reflects up-to-the minute info – Examples (Biology/Genetics) • Ensembl*, NCBI (GenBank, PubMed, dBest, UniGene*, LocusLink*), UCSC* • Analytical – Store and track historical and time-dependent data – Static • Many of the biological databases (Ensembl, UniGene, UCSC) are the result of substantial analyses and are relatively static – with a release period of between 1 week and 6 months. • These do not change between releases – something between operational and analytical 10 Flat File Databases • an ordered collection of similar files • usually conforming to a standard format (but not always) • a flat file database means that rules can be applied to find data within the files (rather than remembering the locations of all individual files) • made searchable by indexing • an index is a pairing of an "attribute" from within a file and the file name/location 11 Flat File Example >gnl|UG|Hs#S593306 zr74d01.r1 Soares_NhHMPu_S1 Homo sapiens cDNA clone IMAGE:669121 5', mRNA sequence /clone=IMAGE:669121 /clone_end=5' /gb=AA234466 /gi=1858985 /ug=Hs.68879 /len=276 ATTAAGATCTGAAAACTGTGATGCGTCCTTTCTGCAGAGACGCCTCTTTCTGAATCTGCC CGGAGCTTCGAGCCCCGGCGTCTGTCCCTCAGCCTGGCATGGCTTCTTCGGGGGTCTGCT TTGCATGGGGAGAGGGGCCACGCAGCGGACGGACTAGGTTTGGGGATTCTCGGTAATGGA CCCGGAGCAATGACTAACAGCCGCTCCCTCTCACTTTCCCACAGCGATCACCCTCTAACA CCCTCCCTCCCATTCCCGGCCCCGCGCGTGACAAGG Index: /HomoSapiens/EST/5-prime/zr74d01.r1 Hs.68879 12 LOCUS DEFINITION AA234466 276 bp mRNA linear EST 06-AUG-1997 zr74d01.r1 Soares_NhHMPu_S1 Homo sapiens cDNA clone IMAGE:669121 5', mRNA sequence. ACCESSION AA234466 VERSION AA234466.1 GI:1858985 KEYWORDS EST. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 276) AUTHORS Hillier,L., Allen,M., Bowles,L., Dubuque,T., Geisel,G., Jost,S., Kucaba,T., Lacy,M., Le,N., Lennon,G., Marra,M., Martin,J., Moore,B. , Schellenberg,K., Steptoe,M., Tan,F., Theising,B., White,Y., Wylie ,T., Waterston,R. and Wilson,R. TITLE WashU-Merck EST Project 1997 JOURNAL Unpublished (1997) COMMENT Contact: Wilson RK Washington University School of Medicine 4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108 Tel: 314 286 1800 Fax: 314 286 1810 Email: [email protected] This clone is available royalty-free through LLNL ; contact the IMAGE Consortium ([email protected]) for further information. Insert Length: 871 Std Error: 0.00 Seq primer: -28m13 rev2 ET from Amersham High quality sequence stop: 266. FEATURES Location/Qualifiers source 1..276 /organism="Homo sapiens" /mol_type="mRNA" 13 Flat File Databases • as a flat file database grows larger, the onedimensional aspect of indices makes it difficult to make connections between attributes • many biological databases began as flat file databases, and this legacy has influenced file formats and applications (PDB, ESTs, GeneBank) • because of the ease of browsing a file system, many systems adopt a hybrid approach -incorporating both a relational database and flat file system 14 Relational Databases • Just like flat file systems, relational databases are a way of assembling, storing, and retrieving information/data • in a relational database, data is stored in "tables" • a flat file database that describes protein structure is like a book – chapters about the origin of the sample, how the data was collected, the sequence, secondary structure, positions of the atoms • in a relational database, this information is spread across tables – a table for experimental conditions, sequence, secondary structure, etc. 15 Relational Databases • rows in the tables may refer to unique proteins • connections between the proteins can be made ( they aren't bound together like a book) • the form of the tables follows rules that are uniform so that you can access different attributes from different tables • you can freely access all entries for atomic position at once, or all attributes for a particular protein • for one protein, not inconvenient to "look it up" and read the entry in the "book" (I.e., flat file DB) • protein name, author who deposited it, can be put in an index (like a card catalogue) to help find the book • but how do you extract the secondary structure out of every "book" in the library -- you can't! 16 Relational Databases • a relational database management system (RDMS) allows you to "look" at the database from many different "views" • a RDMS essentially allows you to dynamically "create" the contents of your "book" by querying for the information that you want 17 Table organization • data in a table are organized in "rows" • each row represents one "record" • a row may contain separate pieces of information -- "fields" or "attributes" • tables are not glorified flat files -- although they may look that way if you examine them in their entirety 18 Example • We want to keep track of genes for the purpose of mutation screening using SSCP or sequencing • We want to keep track of: – – – – lists of genes sequence of genes (to design primers) exons and introns multiple transcripts? • Since we are putting this together for Pfizer (who grossed $52 billion in 2006), we'll be cute and refer to genes as "targets" -- since they are looking for druggable "targets" 19 project (table name) A first attempt id project_name gene_name date description 4 glaucoma BBS4 12/12/04 Glaucoma is… 5 glaucoma ABCA6 Glaucoma is… target (table name) id name sequence primerF primerR (fields) 5 BBS4 ATGCGG… GCTAGT ATGACCCT 6 BMP4 ATGCCC… GGTATG TGAATGAG … exon id gene_id sequence 1 5 ATGC… 2 5 CCGG… 20 Observations • values in the "target" table are related to values in the "project" table • however, it doesn't make sense to put all of the data into one big table (we could -- but it would be very redundant) • why are there multiple entries in project for the same project??? • does it really make sense to put primer information in the "target" table??? • the two data types ("project" and "target") are related but "orthogonal" • anywhere in a grouping of data where it is not sensible to include fields in a table, a new table should be created – normalization -- the process of separating complex data into a collection of mutually orthogonal related tables 21 What's wrong with this? • No inherent flaws, but may not be the most efficient – duplication (exon sequences in "exon" table are subsets of the full gene sequence in "target' table) – gene names are duplicated between "project" and "target" table – this isn't a major issue -- but for the sake of argument -what happens if the gene name for BMP4 is changed to BMO4? – Description of "project"s is duplicated 22 exon transcript_id exon_num sequence_start sequence_stop intron primer_pair transcript_id id intron_num transcript_id sequence_start left_primer_id sequence_stop right_primer_id project id name description transcript sequence id id sequence_id target_id source type source_id sequence chr_name target date id date set_table set_target_join id set_id project_id target_id name rank date cas_rank description cas_options gene_name description accession strand genomic_start genomic_stop source source_id refresh status 23 Sample Data exon transcript_id exon_num = 3 sequence_start sequence_stop intron primer_pair transcript_id id intron_num = 3 transcript_id sequence_start left_primer_id sequence_stop right_primer_id project id name = pro1 description transcript sequence id id sequence_id target_id source = Ensembl type = nucleotide source_id sequence = ATG… chr_name = 15 target date id date set_table set_target_join id set_id project_id target_id name = testset rank = 5 date cas_rank description cas_options gene_name = BBS4 description accession strand = 1 genomic_start = 15,123,120 genomic_stop = 16,378,131 source source_id refresh status 24 Database Schema • network of tables and relationships defines the "database schema" • generally you would carefully develop a schema so that it keeps utility over time – test data – example queries 25 Relational Database Model (basis for modern databases) • Codd, E. “A Relational Model of Data for Large Shared Databanks.” Communications of the ACM, 1970, 377-387. – Based on set theory, and predicate logic – “Relation” term is derived from set theory (not from notion that tables are “related” to each other). – Physical order of records is immaterial – Each record in a table is identified by a field that contains a unique value (sound like anything else we know???) – User does not need to know the physical location of a record to retrieve data (unlike other models, hierarchical, network). 26 End 27 Database Intro: An early model – Hierarchical Database Model Agents Entertainers Schedule Clients Engagements Features: - inverted tree with Root node - relationships are parent/child - each child has 1 parent, but parents may have multiple children - tables linked by a “pointer” - access to data requires familiarity with the structure Payments 28 Database Intro: An early model – Hierarchical Database Model Agents Entertainers Schedule Clients Engagements Payments Advantages: - quick retrieval - referential integrity is built in and automatically enforced - example: record in child table must be linked to existing record in parent table - if record deleted in parent table, all associated records in child tables are deleted as well 29 Database Intro: An early model – Hierarchical Database Model Agents Entertainers Schedule Clients Engagements Payments Disadvantages: - difficult to handle child entries that are initially unrelated to parent - example: - Want to add a new entertainer, but cannot add entertainer until it is associated with an Agent - Have to insert a “dummy” record - does not support complex relationships - redundant data: many to many relationship between Clients and Entertainers, so Schedule (date, time) will have same entries as Engagements (date, time) 30 Relational Database Model Agents Clients Entertainers Engagements Music Style Advantages: -built in multilevel data integrity: table: records are not duplicated, and detect missing primary keys relationship: ensure connection between tables is valid -logical and physical data independence from database application: changes in physical implementation of database would not (in theory) adversely affect applications built to use DB -easy data retrieval: data may be retrieved from any table or from any number of related tables 31 Relational Database Model Agents Clients Entertainers Engagements Music Styles -Relationships -may be: one-to-one, one-to-many, and many-to-many -Established through matching values of a shared field -Example: Agents are associated to Clients by an AgentsID field in Clients table -Data may be accessed in an almost unlimited combination of ways 32

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 2 PowerPoint Slides