* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Generic Model Organism Database
Extensible Storage Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Oracle Database wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Ingres (database) wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Generic Model Organism Database Lavanya Rishishwar Outline • Purpose • Genome database • Basics of webserver & database • GMOD 4/7/2016 Generic Model Organism Database 2 Presentation Assumption What do we understand: • Sequencing and computational genomics process • The output from different groups What we do not understand: • The end goal • Database (DBMS) and web service technologies 4/7/2016 Generic Model Organism Database 3 Outline • Purpose • Genome database • Basics of webserver & database • GMOD 4/7/2016 Generic Model Organism Database 4 Purpose • By the time, functional annotation finishes their part, you will have a lot of data to deal with • For a biologist, this is more information than they will ever care for • It doesn’t mean what you did was useless. The data now requires presentation. • This is where you have to start like a real bioinformatician – balancing the biology and computer science 4/7/2016 Generic Model Organism Database 5 Purpose What is needed: • An intuitive platform that can help a biologist pose questions, inspect the data and make inferences • Doesn’t mean it has to be fancy • But it does mean that it needs to have all the data in a neatly organized fashion that doesn’t overwhelm the end user • This is not 2007 where you do everything from scratch • The field has matured a lot, explore before developing • Do not reinvent the wheel 4/7/2016 Generic Model Organism Database 6 Purpose • You want to create something that is useful • Not something that is purely ornamental and no one will ever use it • Whatever you create, should have a purpose • What is the purpose here? • What is your client (CDC/Dr. Xin Wang) looking for? • How can you give the client exactly what they want and in a way that they will understand? 4/7/2016 Generic Model Organism Database 7 Outline • Purpose • Genome database • Basics of webserver & database • GMOD 4/7/2016 Generic Model Organism Database 8 Database • Organized collection of data • Can be flat files to more sophisticated systems • What was wrong with flat files? Why did we require advanced systems? 4/7/2016 Generic Model Organism Database 9 Genome Database • A database specializing in genomic information • What is different from other databases: • • • • • • What is stored? Who uses it? Who do you query it? The natural relationship between the dataset Linkouts to existing databases Information presentation manner – less text, more visual Q: Okay, so how do I make a database? 4/7/2016 Generic Model Organism Database 10 What do you require? • Data (and knowledge of what that data is) • A Database Management System (DBMS) • A database developer/administrator • A frontend designer 4/7/2016 Generic Model Organism Database 11 What do you require? • Data (and knowledge of what that data is) Output from different groups • A Database Management System (DBMS) MySQL hosted at compgenomics server • A database developer/administrator Developer – one of you; Administrator – Troy Hilley • A frontend designer One of you 4/7/2016 Generic Model Organism Database 12 Outline • Purpose • Genome database • Basics of webserver & database • GMOD 4/7/2016 Generic Model Organism Database 13 Database Management System (DBMS) Q: What is a DBMS? A: Collection of tools to help in creating, storing, modifying and extracting information from a database Q: Why not something like Excel? A: Issues with scalability, consistency, redundancy, simultaneous access and within data connectivity 4/7/2016 Generic Model Organism Database 14 Database Management System (DBMS) • Scalability Spreadsheets/flat files store data in a single file. As data grows, basic operations becomes unviable. • Consistency The data needs to be in the same format for pattern searching. Difficult to achieve in spreadsheets, not possible in flat files. 4/7/2016 Generic Model Organism Database 15 Database Management System (DBMS) • Redundancy Redundant information uselessly increases data size and may interfere with pattern searching. • Simultaneous access Limited simultaneous access in spreadsheets/flat files. 4/7/2016 Generic Model Organism Database 16 Database Management System (DBMS) • Within data connectivity Difficult to maintain in spreadsheets and flat files. • E.g. hypothetical database for outbreaks. Three types of data: • Strain information • Hospital information • CDC personnel information • If any data requires update, every single one needs to be updated! 4/7/2016 Generic Model Organism Database 17 Database Management System (DBMS) DBMS was specifically designed to resolve these issues Q: What are my options for a DBMS? A: Many! Relational DBMS: MySQL, Oracle, Microsoft Access, Postgre SQL Non-relational: MongoDB, CouchDB, Google Spanner 4/7/2016 Generic Model Organism Database 18 Relational DBMS Figure from http://www.ibm.com/ 4/7/2016 Generic Model Organism Database 19 Relational DBMS • Are defined in Structured Query Language (SQL) • Looks like this: CREATE TABLE STATION (ID INTEGER PRIMARY KEY, CITY CHAR(20), STATE CHAR(2), LAT_N REAL, LONG_W REAL); INSERT INTO STATION VALUES (13, 'Phoenix', 'AZ', 33, 112); • Why do I point this out? Ideally, you will be implementing dynamic browsers that fires these queries 4/7/2016 Generic Model Organism Database 20 Database Basics DBMS (Backend) 4/7/2016 Graphical User Interface (GUI) Generic Model Organism Database 21 Webserver Basics • Webserver – any computer connected to the internet that provides some sort of service • These services are provided through specific protocols. E.g. HTTP, FTP HTTP : Hypertext transfer protocol, FTP : File transfer protocol • A special software on the server facilitates this communication (answers the request) – HTTPD (HTTP Daemon) i.e., the web server • Most widely used web server – Apache • This will be setup for you already. You will most probably require specific changes to the setup once you are ready with the browser 4/7/2016 Generic Model Organism Database 22 Security Two levels: • Security at access to the server Responsible: Sys-admin and frontend designer/developers • Security at access to the database Responsible: Database administrators • How much should you care? Do not create a security hole in the server. This will not only open your browser to the attacks but will also open the whole Georgia Tech system to the attacks. • What if you (un)intentionally create a security hole Troy has some fool-proofing in place. If you bypass it somehow, Georgia Tech’s IT staff monitors the network constantly and they typically detect the attack within an hour and block it. And then they will find you. And they will make sure you never do such an act ever again. 4/7/2016 Generic Model Organism Database 23 Frontend • Your normal webpage • HTML – Hypertext markup language • Styling – CSS (Cascading style sheets) • More library functions – Javascript/Jquery • Fancy templates - bootstrap 4/7/2016 Generic Model Organism Database 24 Server-side scripting • Special type of programming language • Executes on and by the webserver. Can’t run locally. • E.g. PHP (originally: Personal homepage. Now: Hypertext preprocessor. Don’t ask how.), JSP, Perl via CGI, R, Python • I recommend PHP for you – easy to pick, constructs very similar to Perl/Java/C, GMOD uses it, this is a class project 4/7/2016 Generic Model Organism Database 25 Outline • Purpose • Genome database • Basics of webserver & database • GMOD 4/7/2016 Generic Model Organism Database 26 Who is this crazy looking guy? William James (Jim) Kent. Know that name. Respect this guy. He is one of greatest, perhaps the greatest, bioinformatics programmers ever. He was deeply involved in the assembly of the public human genome project. Along with his PI and a cluster of 50 PCs, this guy raced with Celera (JCV; was using a most powerful civilian supercomputer) and finished the draft human genome 3 days before them. Bam! If you were in the fall class, you compiled the James Kent Source tree. Almost all his. He speaks nothing but the truth. 4/7/2016 Generic Model Organism Database 27 He knows what a genome browser should be “Genome browsers facilitate genomic analysis by presenting alignment, experimental and annotation data in the context of genomic DNA sequences.” Melissa S Cline & James W Kent, 2009 Genome browsers aggregate data 4/7/2016 Generic Model Organism Database 28 The UCSC Genome Browser Clicking on any of these takes you to a page full of details CDKN2A 4/7/2016 Generic Model Organism Database 29 Tracks don’t have to be genes They are many kinds of genomic information • Homologous sequences • Conservation • Protein domains • Signaling peptide • CRISPR elements • Regulatory elements • RNAs • Transcription factor binding sites • Conservation of genomics sequences • Extremely important in modern times are tracks displaying *-seq data 4/7/2016 Generic Model Organism Database 30 What’s good about the UCSC GB? • Arguably the most advanced genome browser, it is much more than a tool for looking at genomes • It integrates a huge amount of data for each gene it displays. • The UCSC also has a graphical front end for downloading from its huge backend database 4/7/2016 Generic Model Organism Database 31 This UCSC browser does so much more • It hosts the ENCODE project, one of the largest, probably the largest, assemblies of functional genomic data. • It let’s you jump between orthologous regions in different genome: CDKN2A • It’s a massive, massive database backend of over 6500 tables. 4/7/2016 Generic Model Organism Database 32 So why aren’t there dozens of UCSC Implementations? It’s really, really, really hard to install. It’s impossible to understand unless you’ve tried to do it. The UCSC genome browser works so well for the genomes that it has because it is so very, very specialized for those genomes. Each track in the UCSC browser has been lovingly crafted. 4/7/2016 Generic Model Organism Database 33 Browser choices There are a number of choices out there for a genome browser There are really just 2 big ones: UCSC GMOD & GBrowse/JBrowse We already discussed why you don’t use the UCSC browser for projects 4/7/2016 Generic Model Organism Database 34 GMOD • Model Organism Database (MOD) groups realized this issue and developed the Generic Model Organism Database (GMOD) • Generic => Yes, valid for any type of organism • Model Organism => That is a lie • Database => Core purpose 4/7/2016 Generic Model Organism Database 35 GMOD 4/7/2016 Generic Model Organism Database 36 GMOD • GMOD is an open source software • Has a number of tools that fall into three main components: • Data Management • Annotation • Visualization • Next set of slides discusses the different utilities in it 4/7/2016 Generic Model Organism Database 37 GMOD – Data Management 4/7/2016 Generic Model Organism Database 38 GMOD – Data Management 4/7/2016 Generic Model Organism Database 39 GMOD – Data Management 4/7/2016 Generic Model Organism Database 40 GMOD – Data Management 4/7/2016 Generic Model Organism Database 41 GMOD – Data Management 4/7/2016 Generic Model Organism Database 42 GMOD – Annotation 4/7/2016 Generic Model Organism Database 43 GMOD – Annotation 4/7/2016 Generic Model Organism Database 44 GMOD – Annotation 4/7/2016 Generic Model Organism Database 45 GMOD – Annotation 4/7/2016 Generic Model Organism Database 46 GMOD – Visualization 4/7/2016 Generic Model Organism Database 47 GMOD – Visualization 4/7/2016 Generic Model Organism Database 48 GMOD – Visualization 4/7/2016 Generic Model Organism Database 49 GMOD – Visualization 4/7/2016 Generic Model Organism Database 50 Genome Browser 4/7/2016 Generic Model Organism Database 51 You can use colors for information • This shows genes that we thought were horizontally transferred • Darker genes had more programs that indicated them being horizontally transferred 4/7/2016 Generic Model Organism Database 52 You can also have specialized tracks • We had a track of virulence factors in the first year • Clicking on any of them took you to details for the gene, a link to VFDB, etc. 4/7/2016 Generic Model Organism Database 53 This goes beyond colors You can alter how tracks are show in other ways Add and remove tracks, change the link that appears over a feature in the genome. 4/7/2016 Generic Model Organism Database 54 You can do even more customization You can make a transcriptome browser. It doesn’t have to be a genome Not super exciting from this view. Just the predicted coding region of an assembled contig (mRNA) 4/7/2016 Generic Model Organism Database 55 All of this is in the conf 4/7/2016 Generic Model Organism Database 56 Configuration file (.conf) Making a new Track ### TRACK CONFIGURATION ### [ExampleFeatures] feature = remark glyph = generic stranded = 1 bgcolor = orange height = 10 key = Example Features Genome Browser • The images are not created beforehand – they are generated on the runtime • The server does the processing and sends out images to the user 4/7/2016 Generic Model Organism Database 59 Genome Browser 1. The user types in the URL: gbrowse2012.biology.gatech.edu 4/7/2016 Generic Model Organism Database 60 Genome Browser 2. Browser interprets and sends the request to HTTP Server 4/7/2016 Generic Model Organism Database 61 Genome Browser 3. Web Server receives the request and “serves” the client, i.e., starts Gbrowse 4/7/2016 Generic Model Organism Database 62 Genome Browser 4. In case of success, relevant hypertexts and multimedia is generated by accessing the database 4/7/2016 Generic Model Organism Database 63 Genome Browser 4/7/2016 Generic Model Organism Database 64 Genome Browser 5. The output traverses the same path back 4/7/2016 Generic Model Organism Database 65 Genome Browser 6. The whole process repeats again when the user interacts with the browser 4/7/2016 Generic Model Organism Database 66 GBrowse & JBrowse • The previously discussed process is GBrowse’s implementation • JBrowse is another implementation of the same browser but in Javascript • Pros: Reduced server load, smoother UI, better images • Cons: Still lacks some of the features of GBrowse 4/7/2016 Generic Model Organism Database 67 Populating a Genome Database • Most of the genome browsers (including GBrowse/JBrowse) expect input files in a specific format known as GFF • GFF => General Feature Format • Text based, table like format • Capable of storing any type of information 4/7/2016 Generic Model Organism Database 68 General Feature Format (GFF) Format description: GFF General Feature Format is a tab separated format used for describing different omics features The format has 9 mandatory columns as shown below: Seqname 4/7/2016 Source Feature Start End Generic Model Organism Database Score Strand Frame Attribute 69 Conclusion of GBrowse/JBrowse Will it ever be the greatest genome browser? No. That will always be the UCSC browser Will it remain the easiest to install for some time? Probably – JBrowse seems more promising though Will you get the best return on time spent Yep 4/7/2016 Generic Model Organism Database 70 Conclusion of the talk • A database is an efficient and organized method of storing data • It allows faster mining patterns from the data • Two component system: backend storage and frontend interface • Most widely used option: GMOD • Web based genome browser options: GBrowse/JBrowse, UCSC 4/7/2016 Generic Model Organism Database 71 Questions? 4/7/2016 Generic Model Organism Database 72