* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gottfried Thesis First Draft
Survey
Document related concepts
Commitment ordering wikipedia , lookup
Microsoft Access wikipedia , lookup
Relational algebra wikipedia , lookup
Serializability wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Oracle Database wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Ingres (database) wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Functional Database Model wikipedia , lookup
Versant Object Database wikipedia , lookup
ContactPoint wikipedia , lookup
Clusterpoint wikipedia , lookup
Transcript
1 The last fifteen years have seen a renaissance in database software, as developers and consumers have finally agreed to branch out beyond the standard of relational, SQL-based database software. One of the new database types that has come to the forefront (under the umbrella of “NoSQL databases”) is graph databases. This paper will examine the performance of a specific graph database (Neo4J) on a specific problem, and elucidate more general conclusions about whether a graph database is the right choice for a prospective problem. Relational Databases and NoSQL The general history of database software can be traced back to the 1960s, when computing power became efficient and cheap enough to encourage storing data using software instead of physical methods (Vaughn). This saw the rise of two main data models: the Hierarchical System and the Network System. The Hierarchical System organizes data into a tree structure, where each node (aside from the root) has one (and only one) parent node and zero or more children; the most widely used Database Management System (DBMS) today that utilizes a Hierarchical Model is IBM’s IMS (Vaughn). The Network System, by contrast, relaxes the restriction on how many parents a node may have, allowing for a general graph structure where any node is connected to zero or more other nodes by one or more relationships from a defined set of relationships (Maurer). In the 1970s, E.L. Codd produced a paper detailing the benefits of a “Relational” model over a Hierarchical or Network model (Codd). The Relational model separated data into “relations”, which are sets of tuples (the tuples are usually represented as rows in a table), where each relation has a fixed schema that determines the components (and the possible values for those components) for every tuple in the relation; the relational model can be seen in most major database projects today, like Oracle Database, Microsoft’s SQL Server, MySQL, etc (Ramikrishnan). The relational database became the dominant model over the next 30 years, until the beginnings of the NoSQL movement in the late 1990s. Gottfried 2 The need for NoSQL databases became apparent as businesses began collecting large amounts of data that didn’t easily fit into a relational model; with data too complex to fit into a few tables, many businesses started seeing extremely poor performance using relational databases, due to the expensive cost of multiple joins in one query. NoSQL databases (literally, “Not only SQL”, though most NoSQL databases are non-relational) have been around since the 1960s as well, but only began to flourish in the late 1990s and early 2000s, when development begins on several different NoSQL databases, including Google’s Big Table, Memcache, and Neo4j (Haegan). As the name states, NoSQL databases don’t normally use SQL as a query language and usually have a non-relational model; instead, they have a data model that emphasizes scalability, to handle large amounts of data, and flexibility, to handle complex or sparse data types. By 2010, academics had started to separate NoSQL databases into four types: key-value stores, column stores, document databases, and graph databases. Key-value stores are databases with a data model of a value and a key to access it; document databases are similar to key-value stores, except that they store documents (complex data objects) instead of a simple key. Column stores are like key-value stores, except they are organized into a vertical structure, with values sharing similar columns or families. Graph databases are built on graph theory and their underlying data model consists of nodes, which represent objects, and relationships between those nodes (Vardanyan). Graph databases and Neo4j This paper is specifically concerned with graph databases , as the testing done in this research was on an instance of Neo4j, a graph database software that was created, or at least conceived of, in 2000 (Haugen). A graph database uses a data model where data is stored as nodes and relationships; the nodes represent entities in the database, and data about an entity's relationship to another entity is represented in the relationship between the two. Graph databases are the logical and spiritual Gottfried 3 successors to the Network Database model that was created in the 1960s. The Network Model was replaced by the Relational Model as hardware improvements made the Network Model’s superior performance obsolete and businesses realized the flexibility of a relational model (Vaughn). The Relational Model made sense so long as businesses were keeping track of disconnected or only looselyconnected data, but as businesses evolved and started gathering data that more closely approximates the real world (where most data is interconnected in one or more ways), the Relational Model became a hindrance (Eifrem). In order to model those relationships in a Relational Model, you must perform joins on the tables involved; as tables get larger, joins become exponentially more expensive to perform - a specific example of this can be seen in the background of the Chemotext project, complete with numbers (Baker). Neo4j was created in response to these issues with databases, and is probably the most widely used graph database software in development today (DB-Engines Ranking). Neo4j follows the basic graph database model, having two primitive types: nodes and relationships. Nodes can consist of one or more identifying attributes (every node has a unique internal Node ID that cannot be changed that serves as its primary identifier). Relationships exist only between two nodes and can have their own attributes; the only required attribute is a Type attribute. Neo4j allows for indexing (its indexing system is built upon a Lucene1 core) on relationships or nodes; the legacy version of indexing would index upon a node or relationship attribute and required that a node must be manually added to the index (aside from one auto_index for nodes and one for relationships; these indexes would automatically adjust on addition/deletion/update of appropriate nodes). This system was replaced in Neo4j 2.0 with the advent of labels; labels are special attributes that can be applied to a node or relationship as an identifier (like, but separate from, the Type attribute of a relationship); the new indexing system allows for indexing on an attribute of all nodes/relationships with a specific label and automatically maintains the index as creation, deletion, 1 Lucene is a full-text search engine built in Java by the Apache Lucene project team. It is built upon user-created indexes, which are made from the documents (or database objects) a user would like to search over. Gottfried 4 and edit updates are made. There are many more specifics to the Neo4j system than those outlined above, but those are the only truly necessary parts that were required to build the Chemotext system. An example detailing how indexing works in Neo4j 2.0 is shown in Figure 1: Figure 1: An example Neo4j database. Every node is of type Person (labels and relationships not shown for clarity). The index on Person(id) lists every node’s location. The figure on the right shows the same database after inserting a new node, and the effect on the index. The added node and its entry in the index are highlighted in red. Chemotext Background and Prior Work Chemotext was originally conceived of by Nancy Baker during her term as a PhD student at the University of North Carolina. She wanted a tool that could harness the approximately twenty three million articles in the MEDLINE database to discover new implicit connections between drugs and diseases for potential new therapeutic drug treatments. To discover these connections, Nancy built a system that utilized Swanson’s ABC paradigm: that a drug (A in the diagram) may have an as-yet-unknown connection to a disease (C in the diagram) through a set of intermediate terms (B in the diagram) (Ijaz). Figure 2 shows the theoretical structure of Swanson’s ABC Paradigm, while Figure 3 shows what it looks like implemented in a Neo4j database. Gottfried 5 Figure 2: This is the theoretical structure of Swanson’s ABC paradigm. Chemical A is linked to Disease C through some intermediate terms B. Figure 3: An example of Swanson’s ABC paradigm as it would be modeled in Neo4j The system that Nancy built involved two main data sources: the MEDLINE library of journal articles and the associated Dictionary of MeSH Terms (US National Library of Medicine). MEDLINE is the National Library of Medicine's online database of references and abstracts for all medical and many scientific journals published in the US from 1965 onward; the database contains limited records from before this point, stretching back all the way to 1806 (Lindberg). MEDLINE is most easily accessed through PubMed, the online search utility developed in 1996 to allow for easy and open access to the MEDLINE library; the National Library of Medicine allows for downloading an XML file containing the entirety of the MEDLINE/PubMed article database by FTP (National Library of Medicine). Nancy took the raw XML file and did extensive filtering and processing on the records to obtain her base of articles. Gottfried 6 The dictionary of MeSH Terms is the set of concepts the National Library of Medicine uses as a controlled vocabulary thesis. It is a set of approximately twenty seven thousand concepts that serve as the overall vocabulary for indexing articles in MEDLINE; each MeSH term has been classified by Nancy as a chemical, disease, or protein. for the purposes of the Chemotext system. By combining the Dictionary of MeSH Terms with the MEDLINE database, Nancy had the set of all chemical/effects recognized by the national library of medicine and over 22 million articles relating the elements of that set. The initial implementation for Chemotext used a SQL database as the back-end database and a PHP front-end (developed by Samuel Gass, Nick Bartlett, and Chris Allen) to deliver the data. The PHP front-end just made SQL requests to the database using user-provided information and returned the results of the query in a browser-digestible format; the majority of the work in this initial implementation was the structure of the SQL database. The structure consisted of several tables: a table each for chemicals, diseases, and proteins, three tables relating articles to chemicals, diseases, and proteins, and a table of articles; this structure is modeled in Figure 4 (Baker). Figure 4: A diagram of the relational database for the original implementation of Chemotext Gottfried 7 The Chemotext system performed one main query: if provided with a single MeSH Term, it would return a listing of all MeSH Terms related to that MeSH Term by a linking article. The query format for this database is shown in Figure 5: SELECT c.name, Count(a.pmid) AS pmid_ct, a.uid, b.uid FROM chemicals e JOIN chemart a ON e.uid = a.uid JOIN disart b ON a.pmid = b.pmid JOIN diseases c ON b.uid = c.uid JOIN articles d ON a.pmid = d.pmid WHERE a.chemname = "" GROUP BY a.uid, c.name ORDER BY pmid_ct DESC Figure 5: The structure of the Chemotext query in the relational version of the database An example of this query (using “aspirin” as the starting chemical) across the tables is seen in Figure 6: Figure 6: An example of how a query issued on the relational version of Chemotext would actually be executed. Source: NoSQL Databases and their Applications: A Case Study by Ian Kim The main bottleneck in the system was the multiple joins required to perform a single query; when the database was fully loaded, these joins could occur across several tables with tens of millions Gottfried 8 of rows, resulting in a total query time of up to fifteen minutes in some cases (Kim). This performance was unacceptable in a web application and rendered the system largely useless. The next version of Chemotext was developed largely by Ian Kim. Recognizing that the performance of the database was limited by the inherent structure of a relational database, he decided to move the back end of the system to a database built on Neo4j. This involved a change in the data model. Figure 7: A diagram of the schema of Ian’s Neo4j implementation of Chemotext The new data model (pictured in Figure 7) placed both chemicals/effects and articles as nodes, with relationships between chemicals/effects and articles. This vastly simplified the query syntax; the equivalent query in the Neo4j database to the one shown in Figure 5 is shown in Figure 8: START chem:node_auto_index(lc_name=”aspirin”) MATCH chem < articles > other WHERE others.type = "discond" RETURN other.name Figure 8: The query structure for the standard chemotext query under Ian’s Neo4j implementation Gottfried 9 Ian then implemented this data model using only articles from 2002, taking the data from the original Relational database developed by Nancy’s team, ending up with approximately fifty thousand terms and articles and four hundred thousand relationships (Kim). Using this new system, Ian was able to reduce query times down to 60 milliseconds for his benchmark query, which is the query for all terms related to aspirin that was shown in Figure 9. Creating the Full Database Having taken the project over from Ian, our first goal was to create a working instance of the database that included all the articles from the original Relational database and could be considered upto-date. Initial attempts to do so involved downloading the raw XML file from the NLM and processing it using Python. The Python module lxml (Behnel), a Pythonic wrapper for the C libraries libxml2 and libxslt, proved very useful in this regard; the native speed of the C libraries, combined with custom code from Liza Daly created specifically to handle large XML files, turned the process of parsing a 90+GB file into a matter of hours (approximately 18-24) instead of days (Daly). However, even with improved processing time, we could not reproduce the filtering and labeling of data that Nancy had created in the original relational system, as during her processing, she trimmed the number of articles from the approximately 20 million in PubMed to just over 11 million relevant articles, as well as added type (chemical, disease, or protein) and other annotations to the MeSH Terms. Thankfully, contacting Nancy revealed that she had continued to curate and update her data up to the current year, so we used her records of the MeSH Terms and Articles (current up through 2013) as the basis of the new Neo4j database. Creating the Neo4j database presented its own problems: attempting to insert data into an empty Neo4j database using the REST API and a “CREATE NODE” or “CREATE RELATIONSHIP” statement involves a separate transaction for each CREATE statement (Espeed). However, Neo4j offers Gottfried 10 a batch import tool that is specifically designed for importing large amounts of data quickly; the native Batch Importer is written in Java (like the rest of Neo4j), but Michael Hunger, who is one of the Neo4j Developers, has written a Pythonic wrapper for the Batch Importer; Ian used this to create his version of the Neo4j database (Kim). The batch import tool takes in a configuration file, along with one or more tab-separated csv files for the nodes and zero or more tab-separated csv files for the relationships (Hunger). Using the batch import tool, we created two CSV files for nodes and two for relationships. The format of the database (nodes as circles and relationships as arcs) is shown in Figure 9: Figure 9: The new schema for the Neo4j implementation, including the addition of labels as well as the new isMasterTerm relationship. Note that the “Type” attributes for the relationships are not labels and they have fixed values (isMasterTerm for the relationship between a master term and its synonym and mentions for the relationship between an article and a MeSH Term it mentions). Creation of the database takes approximately six hours with the batch import tool, and the database is ready to be used once the tool is done. Once appropriate performance measures have been applied to the database (see the section on indexing), the benchmark query for all terms related to Gottfried 11 aspirin (but with a search space of roughly twenty times the size) in the full implementation of Neo4j runs in 5080 ms, approximately 100 times slower than the same query run on Ian’s implementation of Neo4j. However, an average speed test of the benchmark query on 1000 random terms resulted in a mean query time of 650ms and a median query time of 42ms, implying that the Aspirin query is not a good indicator of the average query time. Synonyms Once we had upgraded the database to include the most up to date data from Nancy, the next step was to add some functionality to the database. There are approximately 20,000 terms in the MeSH Dictionary, but each term has a number of alternate names (referred to from now on as “synonyms”) that it may be known by. In order to prevent confusion if a researcher enters a synonym and the database returns zero results (since the articles reference the master term, or canonical term, not synonyms), we built the “isMasterTerm” relationship into the database: it connects every synonym with the canonical term it refers to. In order to avoid errors when a researcher used the canonical term in a search (or having to create a second query syntax), the canonical term has an “isMasterTerm” relationship from itself to itself. The query syntax for getting related terms after this change to the database is shown in Figure 10 and a graph representation of the query is shown in Figure 11: MATCH (synonym:MeSHTerm) <-[:isMasterTerm](canonicalTerm:MeSHTerm) < -[:mentions]- (article:Article) -[:mentions]-> (relatedTerm:MeSHTerm) WHERE startTerm.name = “<name>” RETURN relatedTerm.name Figure 10: The updated query syntax, including the addition of labels and the addition of the isMasterTerm relationaship Gottfried 12 Figure 11: A graph representation of the query in Figure 10. Performance Testing Before we managed any performance testing, we discovered one very important fact about the database: due to the size of the database (>100 million primitives), the system we were running it on did not have enough available RAM to run queries efficiently. According to the Neo4j documentation, for a system with approximately 100 million primitives, you need a minimum of 4GB of available RAM for Neo4j; working with less than that requires paging out the database constantly, just like what happened with the Relational version of the database (Neo4j). Accordingly, we moved the Neo4j implementation to a system with 7.8GB of RAM, and eight i7-3770 (3.40GHz each) CPUs. Our next step was to determine if we could improve the performance of the database in any way. Our first test was to see if reintroducing SQL to a portion of the process could improve performance. We created a standard Relational database in MySQL that contained a single table, which held the synonym relationship captured in the “isMasterTerm” relationship. We then tested the performance of getting the synonym through the Relational database and querying for related terms to the canonical term in Neo4j versus doing the entire query (synonym relationship included) in Neo4j. The thinking on this is that the SQL lookup would be very fast and would save time searching in the Neo4j database; however, this was not borne out in the testing. The results of 5 tests of 100 random terms from the database are shown in Figure 12: Gottfried 13 Mean Time (s) Median Time (s) Native Neo4j request 2.93 .59 Neo4j + SQL request 5.09 .65 Figure 12: A table showing the results of the two implementations of the Synonym function. The implementation using both SQL and Neo4j had a slower median and mean time than the implementation using only Neo4j. The miniscule difference in performance time meant that either solution was acceptable; we opted to continue with the synonym relationships in Neo4j, as this not only simplified the query process by eliminating an entire system, but it also freed up system memory for Neo4j. The next performance step was to try and include a middle-man to limit requests to the Neo4j database. We chose to use Memcached, a high-performance distributed caching system originally developed by Danga Interactive (Memcached). Memcached not only allows you to use excess memory for caching, but also allows you to spread the load across multiple systems, creating a consistent virtual memory across all of them instead of duplicating a cache on each machine separately (Memcached). The idea was that with so many related terms, a batch of queries would request a small section of localized, repeated nodes, and we could store those in a cache instead of going to the database. Further reflection upon this revealed that this cache would be of limited usefulness, as it could not efficiently store the relationships between a MeSH Term and its articles and then the relationships between those articles and other terms without duplicating a portion of the Neo4j database, which it would do a poorer job of and be a non-efficient use of memory. However, the caching system could be useful in requesting information on a node, so we decided to test it. The testing protocol required picking 100 random terms and alternating randomly between getting the related terms for that term and getting the term’s node. If it got the related terms for the node, it would cache any of the terms that weren’t currently cached in Memcached; if it was getting the node of a term, it would check Memcached first and only request the node from Neo4j if Memcached was not storing the node. That protocol was compared to one where the same 100 random terms were Gottfried 14 searched (in the same order), but only using Neo4j’s native caching agent. The test results can be seen below: Mean Time (ms) Median Time (ms) Native Neo4j Caching 0.142 0.129 Memcached-Assisted Caching 0.221 0.216 Figure 13: A table showing the performance differences between caching using Memcached and caching using Neo4j’s native caching utility . Since the nodes in the Neo4j database are so small (the amount of data they contain is tiny), querying for them directly is so inexpensive that going out to another system for caching actually slows the system down. We also considered implementing an ElasticSearch client to sit on top of the Neo4j database, but further examination revealed that ElasticSearch, which is designed largely for full-text indexing and search, would only hinder the performance of the database (ElasticSearch). The single largest improvement we made to the database was indexing, using Neo4j’s built in schema indexes. Adding a single index to the database, on the name attribute of nodes labeled “MeSHTerms”, improved performance considerably. Attempting to run the query shown in Figure 10 without an index on MeSHTerm(name) resulted in queries that would time out after close to 30 minutes of querying; the same query would run in an average of 18ms if run with an active index on MeSHTerm(name). We presumed that without an index to locate the initial MeSH Term node, the sheer size of the database resulted in many swaps of memory in order to load all of the nodes labeled MeSH Term, resulting in long run times, just like in the Relational implementation (though the paging to memory was happening for a different reason – the large joins – in that case). Gottfried 15 Summary The overall conclusions we drew from this research were the following: Neo4j (and arguably graph databases overall) are no longer beta software. When you can create a database over a hundred million primitives that performs at sub-second query times, the software is mature enough to be usable in a performance environment. The specifics of your problem determine whether a graph database will perform well. The Chemotext problem was uniquely suited to a graph database, because it represents a “friend of a friend”, or second-degree separation, connection. This second-degree connection, if solved in a relational database, requires multiple joins and becomes prohibitively expensive as soon as the tables of data reach any reasonably large size. When the problem is right, a standalone graph database is the best option for performance and trying to add other software in to improve performance will either produce no improvement or actually degrade performance. This was evidenced with the tests using MySQL and Memcache. Lastly, it cannot be overstated that this research would not have been possible without the prior work of Nancy Baker, who developed the initial concept and database for Chemotext, Samuel Gass, Nick Bartlett, and Chris Allen, who developed the PHP-based front-end for Chemotext, Ian Kim, who developed the initial Neo4j implementation of Chemotext, and Diane Pozefsky, who advised all of those people through the project’s life. I am greatly indebted to them for laying the groundwork for this project and giving me an opportunity to research this topic. Gottfried 16 Works Cited Baker, Nancy C., and Bradley M. Hemminger. "Abstract." National Center for Biotechnology Information. U.S. National Library of Medicine, 27 Mar. 2010. Web. 10 Apr. 2014. <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2902698/>. Behnel, Stefan. "Lxml - XML and HTML with Python." Lxml. N.p., n.d. Web. 13 Apr. 2014. <http://lxml.de/index.html#introduction>. Codd, Edgar F. "A Relational Model of Data for Large Shared Data Banks." N.p., n.d. Web. 10 Apr. 2014. <http://technology.amis.nl/wp-content/uploads/images/RJ599.pdf>. Daly, Liza. "High-performance XML Parsing in Python with Lxml." High-performance XML Parsing in Python with Lxml. IBM DeveloperWorks, 24 Mar. 2011. Web. 21 Apr. 2014. <http://www.ibm.com/developerworks/xml/library/x-hiperfparse/>. "DB-Engines Ranking." DB-Engines. N.p., Apr. 2014. Web. 10 Apr. 2014. <http://dbengines.com/en/ranking>. Eifrem, Emil. "Neo4j -- or Why Graph Dbs Kick Ass." Neo4j -- or Why Graph Dbs Kick Ass. N.p., 22 Nov. 2008. Web. 10 Apr. 2014. <http://www.slideshare.net/emileifrem/neo4jpresentation-at-qcon-sf-2008-presentation>. "Open Source Distributed Real Time Search & Analytics | Elasticsearch."Elasticsearch.org. Elastic Search BV, n.d. Web. 19 Apr. 2014. <http://www.elasticsearch.org/>. Espeed. "Fastest Way to Perform Bulk Add/insert in Neo4j with Python?" Stack Overflow. N.p., 1 Oct. 2012. Web. 13 Apr. 2014. <http://stackoverflow.com/questions/12643662/fastestway-to-perform-bulk-add-insert-in-neo4j-with-python>. Guzunda, Leon, and Nick Quinn. "An Introduction to Graph databases." An Introduction to Graph databases. N.p., n.d. Web. 10 Apr. 2014. <http://www.slideshare.net/infinitegraph/an-introduction-to-graph-databases>. Haegan, Knut. "A Brief History of NoSQL." All About the Code. N.p., May 2010. Web. 10 Apr. 2014. <http://blog.knuthaugen.no/2010/03/a-brief-history-of-nosql.html>. Horne, Christopher. "IQube Marketing Limited » Glossary of Big Data Terminology." IQube Marketing Limited. N.p., n.d. Web. 10 Apr. 2014. <http://www.iqubemarketing.com/glossary-big-data-terminolgy/>. Hunger, Michael. "Public Jexp/batch-import." GitHub. N.p., n.d. Web. 13 Apr. 2014. <https://github.com/jexp/batch-import>. Ijaz, Ali Z., Min Song, and Doheon Li. "MKEM: A Multi-level Knowledge Emergence Model Gottfried 17 for Mining Undiscovered Public Knowledge." BMC Bioinformatics. N.p., n.d. Web. 24 Apr. 2014. <http://www.biomedcentral.com/1471-2105/11/S2/S3> Lindberg, Donald. "Internet Access to the National Library of Medicine." ACP Online. N.p., Sept. 2000. Web. 10 Apr. 2014. <http://www.acponline.org/clinical_information/journals_publications/ecp/sepoct00/nlm .pdf>. Maurer, H., and N. Scherbakov. "1. Network (CODASYL) Data Model." 1. Network (CODASYL) Data Model. N.p., n.d. Web. 10 Apr. 2014. "Memcached - a Distributed Memory Object Caching System." Memcached.org. Dormando, n.d. Web. 19 Apr. 2014. <http://memcached.org/about>. Neo4j. "22.8. JVM Settings." 22.8. JVM Settings. N.p., n.d. Web. 13 Apr. 2014. <http://docs.neo4j.org/chunked/stable/configuration-jvm.html>. Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. Boston: McGraw-Hill, 2003. Print. "Using PubMed." National Center for Biotechnology Information. U.S. National Library of Medicine, n.d. Web. 19 Apr. 2014. <http://www.ncbi.nlm.nih.gov/pubmed>. Vardanyan, Mikayel. "Home Industry Info Picking the Right NoSQL Database Tool." Uptime Performance Tips Tips for SysAdmin Webmaster Network Admin. N.p., 22 May 2011. Web. 10 Apr. 2014. <http://blog.monitis.com/2011/05/22/picking-the-right-nosqldatabase-tool/>. Vaughn, John. "CPSC 343: A Sketch of Database History." CPSC 343: A Sketch of Database History. N.p., n.d. Web. 10 Apr. 2014.