Download Gottfried Thesis First Draft

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Commitment ordering wikipedia , lookup

Microsoft Access wikipedia , lookup

Relational algebra wikipedia , lookup

Serializability wikipedia , lookup

IMDb wikipedia , lookup

SQL wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Oracle Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Ingres (database) wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Versant Object Database wikipedia , lookup

ContactPoint wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

The last fifteen years have seen a renaissance in database software, as developers and
consumers have finally agreed to branch out beyond the standard of relational, SQL-based database
software. One of the new database types that has come to the forefront (under the umbrella of “NoSQL
databases”) is graph databases. This paper will examine the performance of a specific graph database
(Neo4J) on a specific problem, and elucidate more general conclusions about whether a graph database
is the right choice for a prospective problem.
Relational Databases and NoSQL
The general history of database software can be traced back to the 1960s, when computing
power became efficient and cheap enough to encourage storing data using software instead of physical
methods (Vaughn). This saw the rise of two main data models: the Hierarchical System and the
Network System. The Hierarchical System organizes data into a tree structure, where each node (aside
from the root) has one (and only one) parent node and zero or more children; the most widely used
Database Management System (DBMS) today that utilizes a Hierarchical Model is IBM’s IMS
(Vaughn). The Network System, by contrast, relaxes the restriction on how many parents a node may
have, allowing for a general graph structure where any node is connected to zero or more other nodes
by one or more relationships from a defined set of relationships (Maurer).
In the 1970s, E.L. Codd produced a paper detailing the benefits of a “Relational” model over a
Hierarchical or Network model (Codd). The Relational model separated data into “relations”, which are
sets of tuples (the tuples are usually represented as rows in a table), where each relation has a fixed
schema that determines the components (and the possible values for those components) for every tuple
in the relation; the relational model can be seen in most major database projects today, like Oracle
Database, Microsoft’s SQL Server, MySQL, etc (Ramikrishnan). The relational database became the
dominant model over the next 30 years, until the beginnings of the NoSQL movement in the late 1990s.
Gottfried 2
The need for NoSQL databases became apparent as businesses began collecting large amounts of data
that didn’t easily fit into a relational model; with data too complex to fit into a few tables, many
businesses started seeing extremely poor performance using relational databases, due to the expensive
cost of multiple joins in one query. NoSQL databases (literally, “Not only SQL”, though most NoSQL
databases are non-relational) have been around since the 1960s as well, but only began to flourish in
the late 1990s and early 2000s, when development begins on several different NoSQL databases,
including Google’s Big Table, Memcache, and Neo4j (Haegan). As the name states, NoSQL databases
don’t normally use SQL as a query language and usually have a non-relational model; instead, they
have a data model that emphasizes scalability, to handle large amounts of data, and flexibility, to
handle complex or sparse data types. By 2010, academics had started to separate NoSQL databases into
four types: key-value stores, column stores, document databases, and graph databases. Key-value stores
are databases with a data model of a value and a key to access it; document databases are similar to
key-value stores, except that they store documents (complex data objects) instead of a simple key.
Column stores are like key-value stores, except they are organized into a vertical structure, with values
sharing similar columns or families. Graph databases are built on graph theory and their underlying
data model consists of nodes, which represent objects, and relationships between those nodes
Graph databases and Neo4j
This paper is specifically concerned with graph databases , as the testing done in this research
was on an instance of Neo4j, a graph database software that was created, or at least conceived of, in
2000 (Haugen). A graph database uses a data model where data is stored as nodes and relationships; the
nodes represent entities in the database, and data about an entity's relationship to another entity is
represented in the relationship between the two. Graph databases are the logical and spiritual
Gottfried 3
successors to the Network Database model that was created in the 1960s. The Network Model was
replaced by the Relational Model as hardware improvements made the Network Model’s superior
performance obsolete and businesses realized the flexibility of a relational model (Vaughn). The
Relational Model made sense so long as businesses were keeping track of disconnected or only looselyconnected data, but as businesses evolved and started gathering data that more closely approximates the
real world (where most data is interconnected in one or more ways), the Relational Model became a
hindrance (Eifrem). In order to model those relationships in a Relational Model, you must perform
joins on the tables involved; as tables get larger, joins become exponentially more expensive to perform
- a specific example of this can be seen in the background of the Chemotext project, complete with
numbers (Baker). Neo4j was created in response to these issues with databases, and is probably the
most widely used graph database software in development today (DB-Engines Ranking).
Neo4j follows the basic graph database model, having two primitive types: nodes and
relationships. Nodes can consist of one or more identifying attributes (every node has a unique internal
Node ID that cannot be changed that serves as its primary identifier). Relationships exist only between
two nodes and can have their own attributes; the only required attribute is a Type attribute.
Neo4j allows for indexing (its indexing system is built upon a Lucene1 core) on relationships or
nodes; the legacy version of indexing would index upon a node or relationship attribute and required
that a node must be manually added to the index (aside from one auto_index for nodes and one for
relationships; these indexes would automatically adjust on addition/deletion/update of appropriate
nodes). This system was replaced in Neo4j 2.0 with the advent of labels; labels are special attributes
that can be applied to a node or relationship as an identifier (like, but separate from, the Type attribute
of a relationship); the new indexing system allows for indexing on an attribute of all
nodes/relationships with a specific label and automatically maintains the index as creation, deletion,
Lucene is a full-text search engine built in Java by the Apache Lucene project team. It is built upon user-created indexes,
which are made from the documents (or database objects) a user would like to search over.
Gottfried 4
and edit updates are made. There are many more specifics to the Neo4j system than those outlined
above, but those are the only truly necessary parts that were required to build the Chemotext system.
An example detailing how indexing works in Neo4j 2.0 is shown in Figure 1:
Figure 1: An example Neo4j database. Every node is of type Person (labels and relationships not shown for clarity). The index on Person(id) lists
every node’s location. The figure on the right shows the same database after inserting a new node, and the effect on the index. The added node
and its entry in the index are highlighted in red.
Chemotext Background and Prior Work
Chemotext was originally conceived of by Nancy Baker during her term as a PhD student at the
University of North Carolina. She wanted a tool that could harness the approximately twenty three
million articles in the MEDLINE database to discover new implicit connections between drugs and
diseases for potential new therapeutic drug treatments.
To discover these connections, Nancy built a system that utilized Swanson’s ABC paradigm:
that a drug (A in the diagram) may have an as-yet-unknown connection to a disease (C in the diagram)
through a set of intermediate terms (B in the diagram) (Ijaz). Figure 2 shows the theoretical structure of
Swanson’s ABC Paradigm, while Figure 3 shows what it looks like implemented in a Neo4j database.
Gottfried 5
Figure 2: This is the theoretical structure of Swanson’s ABC paradigm. Chemical A is linked to Disease C through some intermediate
terms B.
Figure 3: An example of Swanson’s ABC paradigm as it would be modeled in Neo4j
The system that Nancy built involved two main data sources: the MEDLINE library of journal
articles and the associated Dictionary of MeSH Terms (US National Library of Medicine). MEDLINE
is the National Library of Medicine's online database of references and abstracts for all medical and
many scientific journals published in the US from 1965 onward; the database contains limited records
from before this point, stretching back all the way to 1806 (Lindberg). MEDLINE is most easily
accessed through PubMed, the online search utility developed in 1996 to allow for easy and open
access to the MEDLINE library; the National Library of Medicine allows for downloading an XML file
containing the entirety of the MEDLINE/PubMed article database by FTP (National Library of
Medicine). Nancy took the raw XML file and did extensive filtering and processing on the records to
obtain her base of articles.
Gottfried 6
The dictionary of MeSH Terms is the set of concepts the National Library of Medicine uses as a
controlled vocabulary thesis. It is a set of approximately twenty seven thousand concepts that serve as
the overall vocabulary for indexing articles in MEDLINE; each MeSH term has been classified by
Nancy as a chemical, disease, or protein. for the purposes of the Chemotext system. By combining the
Dictionary of MeSH Terms with the MEDLINE database, Nancy had the set of all chemical/effects
recognized by the national library of medicine and over 22 million articles relating the elements of that
The initial implementation for Chemotext used a SQL database as the back-end database and a
PHP front-end (developed by Samuel Gass, Nick Bartlett, and Chris Allen) to deliver the data. The
PHP front-end just made SQL requests to the database using user-provided information and returned
the results of the query in a browser-digestible format; the majority of the work in this initial
implementation was the structure of the SQL database. The structure consisted of several tables: a table
each for chemicals, diseases, and proteins, three tables relating articles to chemicals, diseases, and
proteins, and a table of articles; this structure is modeled in Figure 4 (Baker).
Figure 4: A diagram of the relational database for the original implementation of Chemotext
Gottfried 7
The Chemotext system performed one main query: if provided with a single MeSH Term, it
would return a listing of all MeSH Terms related to that MeSH Term by a linking article. The query
format for this database is shown in Figure 5:
Count(a.pmid) AS pmid_ct,
chemicals e
JOIN chemart a
ON e.uid = a.uid
JOIN disart b
ON a.pmid = b.pmid
JOIN diseases c
ON b.uid = c.uid
JOIN articles d
ON a.pmid = d.pmid
WHERE a.chemname = ""
GROUP BY a.uid,
Figure 5: The structure of the Chemotext query in the relational version of the database
An example of this query (using “aspirin” as the starting chemical) across the tables is seen in Figure 6:
Figure 6: An example of how a query issued on the relational version of Chemotext would actually be executed. Source: NoSQL Databases and
their Applications: A Case Study by Ian Kim
The main bottleneck in the system was the multiple joins required to perform a single query;
when the database was fully loaded, these joins could occur across several tables with tens of millions
Gottfried 8
of rows, resulting in a total query time of up to fifteen minutes in some cases (Kim). This performance
was unacceptable in a web application and rendered the system largely useless.
The next version of Chemotext was developed largely by Ian Kim. Recognizing that the
performance of the database was limited by the inherent structure of a relational database, he decided to
move the back end of the system to a database built on Neo4j. This involved a change in the data
Figure 7: A diagram of the schema of Ian’s Neo4j implementation of Chemotext
The new data model (pictured in Figure 7) placed both chemicals/effects and articles as nodes,
with relationships between chemicals/effects and articles. This vastly simplified the query syntax; the
equivalent query in the Neo4j database to the one shown in Figure 5 is shown in Figure 8:
START chem:node_auto_index(lc_name=”aspirin”)
MATCH chem < articles > other
others.type = "discond"
Figure 8: The query structure for the standard chemotext query under Ian’s Neo4j implementation
Gottfried 9
Ian then implemented this data model using only articles from 2002, taking the data from the
original Relational database developed by Nancy’s team, ending up with approximately fifty thousand
terms and articles and four hundred thousand relationships (Kim). Using this new system, Ian was able
to reduce query times down to 60 milliseconds for his benchmark query, which is the query for all
terms related to aspirin that was shown in Figure 9.
Creating the Full Database
Having taken the project over from Ian, our first goal was to create a working instance of the
database that included all the articles from the original Relational database and could be considered upto-date. Initial attempts to do so involved downloading the raw XML file from the NLM and
processing it using Python. The Python module lxml (Behnel), a Pythonic wrapper for the C libraries
libxml2 and libxslt, proved very useful in this regard; the native speed of the C libraries, combined with
custom code from Liza Daly created specifically to handle large XML files, turned the process of
parsing a 90+GB file into a matter of hours (approximately 18-24) instead of days (Daly). However,
even with improved processing time, we could not reproduce the filtering and labeling of data that
Nancy had created in the original relational system, as during her processing, she trimmed the number
of articles from the approximately 20 million in PubMed to just over 11 million relevant articles, as
well as added type (chemical, disease, or protein) and other annotations to the MeSH Terms.
Thankfully, contacting Nancy revealed that she had continued to curate and update her data up to the
current year, so we used her records of the MeSH Terms and Articles (current up through 2013) as the
basis of the new Neo4j database.
Creating the Neo4j database presented its own problems: attempting to insert data into an empty
Neo4j database using the REST API and a “CREATE NODE” or “CREATE RELATIONSHIP”
statement involves a separate transaction for each CREATE statement (Espeed). However, Neo4j offers
Gottfried 10
a batch import tool that is specifically designed for importing large amounts of data quickly; the native
Batch Importer is written in Java (like the rest of Neo4j), but Michael Hunger, who is one of the Neo4j
Developers, has written a Pythonic wrapper for the Batch Importer; Ian used this to create his version
of the Neo4j database (Kim). The batch import tool takes in a configuration file, along with one or
more tab-separated csv files for the nodes and zero or more tab-separated csv files for the relationships
(Hunger). Using the batch import tool, we created two CSV files for nodes and two for relationships.
The format of the database (nodes as circles and relationships as arcs) is shown in Figure 9:
Figure 9: The new schema for the Neo4j implementation, including the addition of labels as well as the new isMasterTerm relationship.
Note that the “Type” attributes for the relationships are not labels and they have fixed values
(isMasterTerm for the relationship between a master term and its synonym and mentions for the
relationship between an article and a MeSH Term it mentions).
Creation of the database takes approximately six hours with the batch import tool, and the
database is ready to be used once the tool is done. Once appropriate performance measures have been
applied to the database (see the section on indexing), the benchmark query for all terms related to
Gottfried 11
aspirin (but with a search space of roughly twenty times the size) in the full implementation of Neo4j
runs in 5080 ms, approximately 100 times slower than the same query run on Ian’s implementation of
Neo4j. However, an average speed test of the benchmark query on 1000 random terms resulted in a
mean query time of 650ms and a median query time of 42ms, implying that the Aspirin query is not a
good indicator of the average query time.
Once we had upgraded the database to include the most up to date data from Nancy, the next
step was to add some functionality to the database. There are approximately 20,000 terms in the MeSH
Dictionary, but each term has a number of alternate names (referred to from now on as “synonyms”)
that it may be known by. In order to prevent confusion if a researcher enters a synonym and the
database returns zero results (since the articles reference the master term, or canonical term, not
synonyms), we built the “isMasterTerm” relationship into the database: it connects every synonym with
the canonical term it refers to. In order to avoid errors when a researcher used the canonical term in a
search (or having to create a second query syntax), the canonical term has an “isMasterTerm”
relationship from itself to itself. The query syntax for getting related terms after this change to the
database is shown in Figure 10 and a graph representation of the query is shown in Figure 11:
MATCH (synonym:MeSHTerm) <-[:isMasterTerm](canonicalTerm:MeSHTerm) < -[:mentions]- (article:Article)
-[:mentions]-> (relatedTerm:MeSHTerm)
WHERE = “<name>”
Figure 10: The updated query syntax, including the addition of labels and the addition of the isMasterTerm relationaship
Gottfried 12
Figure 11: A graph representation of the query in Figure 10.
Performance Testing
Before we managed any performance testing, we discovered one very important fact about the
database: due to the size of the database (>100 million primitives), the system we were running it on
did not have enough available RAM to run queries efficiently. According to the Neo4j documentation,
for a system with approximately 100 million primitives, you need a minimum of 4GB of available
RAM for Neo4j; working with less than that requires paging out the database constantly, just like what
happened with the Relational version of the database (Neo4j). Accordingly, we moved the Neo4j
implementation to a system with 7.8GB of RAM, and eight i7-3770 (3.40GHz each) CPUs.
Our next step was to determine if we could improve the performance of the database in any
way. Our first test was to see if reintroducing SQL to a portion of the process could improve
performance. We created a standard Relational database in MySQL that contained a single table, which
held the synonym relationship captured in the “isMasterTerm” relationship. We then tested the
performance of getting the synonym through the Relational database and querying for related terms to
the canonical term in Neo4j versus doing the entire query (synonym relationship included) in Neo4j.
The thinking on this is that the SQL lookup would be very fast and would save time searching in the
Neo4j database; however, this was not borne out in the testing.
The results of 5 tests of 100 random terms from the database are shown in Figure 12:
Gottfried 13
Mean Time (s)
Median Time (s)
Native Neo4j request
Neo4j + SQL request
Figure 12: A table showing the results of the two implementations of the Synonym function. The implementation using both SQL and Neo4j had a
slower median and mean time than the implementation using only Neo4j.
The miniscule difference in performance time meant that either solution was acceptable; we
opted to continue with the synonym relationships in Neo4j, as this not only simplified the query
process by eliminating an entire system, but it also freed up system memory for Neo4j.
The next performance step was to try and include a middle-man to limit requests to the Neo4j
database. We chose to use Memcached, a high-performance distributed caching system originally
developed by Danga Interactive (Memcached). Memcached not only allows you to use excess memory
for caching, but also allows you to spread the load across multiple systems, creating a consistent virtual
memory across all of them instead of duplicating a cache on each machine separately (Memcached).
The idea was that with so many related terms, a batch of queries would request a small section
of localized, repeated nodes, and we could store those in a cache instead of going to the database.
Further reflection upon this revealed that this cache would be of limited usefulness, as it could not
efficiently store the relationships between a MeSH Term and its articles and then the relationships
between those articles and other terms without duplicating a portion of the Neo4j database, which it
would do a poorer job of and be a non-efficient use of memory. However, the caching system could be
useful in requesting information on a node, so we decided to test it.
The testing protocol required picking 100 random terms and alternating randomly between
getting the related terms for that term and getting the term’s node. If it got the related terms for the
node, it would cache any of the terms that weren’t currently cached in Memcached; if it was getting the
node of a term, it would check Memcached first and only request the node from Neo4j if Memcached
was not storing the node. That protocol was compared to one where the same 100 random terms were
Gottfried 14
searched (in the same order), but only using Neo4j’s native caching agent. The test results can be seen
Mean Time (ms)
Median Time (ms)
Native Neo4j Caching
Memcached-Assisted Caching
Figure 13: A table showing the performance differences between caching using Memcached and caching using Neo4j’s native caching utility
Since the nodes in the Neo4j database are so small (the amount of data they contain is tiny), querying
for them directly is so inexpensive that going out to another system for caching actually slows the
system down.
We also considered implementing an ElasticSearch client to sit on top of the Neo4j database,
but further examination revealed that ElasticSearch, which is designed largely for full-text indexing and
search, would only hinder the performance of the database (ElasticSearch).
The single largest improvement we made to the database was indexing, using Neo4j’s built in
schema indexes. Adding a single index to the database, on the name attribute of nodes labeled
“MeSHTerms”, improved performance considerably. Attempting to run the query shown in Figure 10
without an index on MeSHTerm(name) resulted in queries that would time out after close to 30 minutes
of querying; the same query would run in an average of 18ms if run with an active index on
MeSHTerm(name). We presumed that without an index to locate the initial MeSH Term node, the
sheer size of the database resulted in many swaps of memory in order to load all of the nodes labeled
MeSH Term, resulting in long run times, just like in the Relational implementation (though the paging
to memory was happening for a different reason – the large joins – in that case).
Gottfried 15
The overall conclusions we drew from this research were the following:
Neo4j (and arguably graph databases overall) are no longer beta software. When you can create
a database over a hundred million primitives that performs at sub-second query times, the software is
mature enough to be usable in a performance environment.
The specifics of your problem determine whether a graph database will perform well. The
Chemotext problem was uniquely suited to a graph database, because it represents a “friend of a
friend”, or second-degree separation, connection. This second-degree connection, if solved in a
relational database, requires multiple joins and becomes prohibitively expensive as soon as the
of data reach any reasonably large size.
When the problem is right, a standalone graph database is the best option for performance and
trying to add other software in to improve performance will either produce no improvement or actually
degrade performance. This was evidenced with the tests using MySQL and Memcache.
Lastly, it cannot be overstated that this research would not have been possible without the prior
work of Nancy Baker, who developed the initial concept and database for Chemotext, Samuel Gass,
Nick Bartlett, and Chris Allen, who developed the PHP-based front-end for Chemotext, Ian Kim, who
developed the initial Neo4j implementation of Chemotext, and Diane Pozefsky, who advised all of
those people through the project’s life. I am greatly indebted to them for laying the groundwork for this
project and giving me an opportunity to research this topic.
Gottfried 16
Works Cited
Baker, Nancy C., and Bradley M. Hemminger. "Abstract." National Center for Biotechnology
Information. U.S. National Library of Medicine, 27 Mar. 2010. Web. 10 Apr. 2014.
Behnel, Stefan. "Lxml - XML and HTML with Python." Lxml. N.p., n.d. Web. 13 Apr. 2014.
Codd, Edgar F. "A Relational Model of Data for Large Shared Data Banks." N.p., n.d. Web. 10
Apr. 2014. <>.
Daly, Liza. "High-performance XML Parsing in Python with Lxml." High-performance XML
Parsing in Python with Lxml. IBM DeveloperWorks, 24 Mar. 2011. Web. 21 Apr. 2014.
"DB-Engines Ranking." DB-Engines. N.p., Apr. 2014. Web. 10 Apr. 2014. <>.
Eifrem, Emil. "Neo4j -- or Why Graph Dbs Kick Ass." Neo4j -- or Why Graph Dbs Kick Ass.
N.p., 22 Nov. 2008. Web. 10 Apr. 2014. <>.
"Open Source Distributed Real Time Search & Analytics | Elasticsearch."
Elastic Search BV, n.d. Web. 19 Apr. 2014. <>.
Espeed. "Fastest Way to Perform Bulk Add/insert in Neo4j with Python?" Stack Overflow. N.p.,
1 Oct. 2012. Web. 13 Apr. 2014. <>.
Guzunda, Leon, and Nick Quinn. "An Introduction to Graph databases." An Introduction to
Graph databases. N.p., n.d. Web. 10 Apr. 2014.
Haegan, Knut. "A Brief History of NoSQL." All About the Code. N.p., May 2010. Web. 10 Apr.
2014. <>.
Horne, Christopher. "IQube Marketing Limited » Glossary of Big Data Terminology." IQube
Marketing Limited. N.p., n.d. Web. 10 Apr. 2014.
Hunger, Michael. "Public Jexp/batch-import." GitHub. N.p., n.d. Web. 13 Apr. 2014.
Ijaz, Ali Z., Min Song, and Doheon Li. "MKEM: A Multi-level Knowledge Emergence Model
Gottfried 17
for Mining Undiscovered Public Knowledge." BMC Bioinformatics. N.p., n.d. Web. 24
Apr. 2014. <>
Lindberg, Donald. "Internet Access to the National Library of Medicine." ACP Online. N.p.,
Sept. 2000. Web. 10 Apr. 2014.
Maurer, H., and N. Scherbakov. "1. Network (CODASYL) Data Model." 1. Network
(CODASYL) Data Model. N.p., n.d. Web. 10 Apr. 2014.
"Memcached - a Distributed Memory Object Caching System." Dormando,
n.d. Web. 19 Apr. 2014. <>.
Neo4j. "22.8. JVM Settings." 22.8. JVM Settings. N.p., n.d. Web. 13 Apr. 2014.
Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. Boston:
McGraw-Hill, 2003. Print.
"Using PubMed." National Center for Biotechnology Information. U.S. National Library of
Medicine, n.d. Web. 19 Apr. 2014. <>.
Vardanyan, Mikayel. "Home Industry Info Picking the Right NoSQL Database Tool." Uptime
Performance Tips Tips for SysAdmin Webmaster Network Admin. N.p., 22 May 2011.
Web. 10 Apr. 2014. <>.
Vaughn, John. "CPSC 343: A Sketch of Database History." CPSC 343: A Sketch of Database
History. N.p., n.d. Web. 10 Apr. 2014.