Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
PDQ-Wizard Prototype 1.0 Installation Guide
University of Edinburgh 2005
GTI and eDIKT
1.
Introduction
This document is for users who want set up the PDQ-Wizard system. It includes how to configure the
environment, compile and build the source code and deploy the system.
1.1.
What is PDQ-Wizard
PDQ-Wizard is a web-based application that provides easy queries to the PubMed literature database.
1.2.
Prerequisites
Java 5.0
Axis RC1.0
Ant 1.6.5
JUnit
Tomcat 5.0
MySQL 5.0 and JDBC connector
Python 2.4.2
PubMed web service library
2.
Prepare the tools
2.1.
Java
Java 2 Standard Edition (J2SE) 5.0 SDK is used to compile and run the application. Here is the download
page http://java.sun.com/j2se/1.5.0/download.jsp.
Previous versions of Java SDK does not work.
2.1.1.
JavaServer Faces
This is the component that facilitates JSP dynamic web pages using the MVC(Model-View-Controller)
pattern. We used latest 1.1, file: jsf-1_1_01.zip which can be downloaded from
http://java.sun.com/j2ee/javaserverfaces/download.html.
2.1.2.
Other components
The Jakarta Taglibs standard 1.0 is used by the project. http://jakarta.apache.org/taglibs/
The components Jakarta commons dbcp and collections are also used to support database pooling in the
Tomcat. Download these jar files from http://jakarta.apache.org/commons/ and save in the common/lib
folder of Tomcat (see below).
2.2.
Ant
Ant is the build tool which is heavily used by the PDQ-Wizard application. Ant version 1.6.5 is needed to
handle Tomcat deployment as a build target. (tools.types.RedirectElement is required.)
It can be downloaded at http://ant.apache.org/.
2.3.
JUnit
JUnit is a tool for unit testing.
2.4.
Axis
Axis is a tool for web service support. Version 1.2 RC1 is tested, and other versions may not work. Axis
can be downloaded here: http://www.apache.org/dyn/closer.cgi/ws/axis/1_2RC1/.
After downloading, unpack the file into any folder.
2.5.
MySQL
MySQL is the relational database management system that PDQ-Wizard uses to store local data cache. It
can be downloaded at http://dev.mysql.com/downloads/mysql/5.0.html.
Note that MySQL version 5.0 is used because View is supported.
Installation of MySQL is straight-forward and no configuration is necessary. The PDQ-Wizard database
needs to be created and this will be discussed later.
An additional component is required to support access to MySQL from Java, and this is the JDBC driver
for MySQL. It can be downloaded at http://dev.mysql.com/downloads/connector/j/3.1.html.
2.6.
Tomcat
Tomcat is the web server to support PDQ-Wizard to show data result as web pages. It can be downloaded
at http://tomcat.apache.org/download-55.cgi and installed easily.
2.7.
Python
The Python programming language interpreter is used to run the data extraction programs for populating
the alias database. This tool can be downloaded from http://www.python.org/. Installation is straightforward.
3.
Build the PubMed library
PDQ-Wizard accesses the PubMed database through the Entrez Utilities web service interface. To use it,
we first download the web service definition file and then generate the stub Java source code.
3.1.
Download PubMed WSDL
The Entrez utilities web service WSDL (Web Service Description Language) file can be downloaded from
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/eutils.wsdl, it is used to generate the stub code to connect to
the Entrez database for data queries. The current version used by PDQ-Wizard is 1.3.
This file should be saved in the <PDQ-Wizard folder>\entrez for later use.
3.2.
Generate the stub code library
To make use of the Entrez web service, we need to generate stub source code from the WSDL file, compile
these source code and package the class files into a jar file. We have these steps built into the Ant build
script as a single task.
<target name="gen-entrez" description="Generate ncbi.jar file from eutils.wsdl">
<java classname="org.apache.axis.wsdl.WSDL2Java" fork="true" dir="entrez">
<classpath refid="classpath" />
<arg value="eutils.wsdl"/>
</java>
<mkdir dir="entrez/build" />
<javac destdir="entrez/build" source="1.4">
<src path="entrez" />
<classpath refid="classpath" />
</javac>
<jar destfile="ncbi.jar" basedir="entrez/build" />
</target>
To run this task at command line:
Ant gen-entrez
As a result, we can see that a gov folder is created in the entrez folder and all Java source code files
generated from the WSDL file are put in the gov directory tree. Another folder created by the Ant task is
the ‘build’ folder which contains all compile Java class files. And what we need is the jar file named
‘ncbi.jar’ which is placed in the current (PDQ-Wizard root) folder.
4.
Preparing the data source
This section explains how to prepare the database for local data cache and alias collection.
4.1.
Create the database
After MySQL is installed successfully, we can create the PDQ-Wizard database. To manually create the
database, run mysql at command console like this:
mysql -u root --password=<yourpassword>
Then issue command to create the database:
mysql> CREATE DATABASE IF NOT EXISTS PDQDB;
mysql> exit
4.2.
Create data tables
After the database is created, it is ready to create the data tables and indexes. We can use the build script to
do the work. In the PDQ-Wizard folder, issue ant command for create table task:
Ant create-tables
To verify the tables are successfully created, start mysql and try the following command:
Mysql> use pdqdb;
Mysql> describe aliasmaps
You can see the following if the tables are created.
+-------------+--------------+------+-----+---------+-------+
| Field
| Type
| Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| geneid
| varchar(16) | NO | PRI |
|
|
| symbol
| varchar(64) | NO | MUL |
|
|
| category | varchar(2) | NO | | GE
|
|
| description | varchar(128) | NO | |
|
|
+-------------+--------------+------+-----+---------+-------+
4 rows in set (0.24 sec)
4.3.
Download the alias datasets
There are several datasets that we want to use to search synonyms or aliases for gene IDs or protein IDs.
These datasets can be downloaded from these sites:
ftp://ftp.ncbi.nlm.nih.gov/gene/
http://www.pir.uniprot.org/database/downloads.shtml
Files to download from the FTP site include gene2accession, gene2refseq, gene_info, and uniprot_sprot.dat
from the HTTP site.
(These files should be saved in the db folder under PDQ-Wizard folder)
4.4.
Extract and Load the datasets
The downloaded datasets need to be extracted in order to store into the database tables. We have several
programs written in Python language to extract the datasets from the original files into a text file which can
be directly loaded into the MySQL database. The following is the list of programs to do the work.
Extract.py
Extract gene2refseq and gene2accession files.
Python extract.py gene2refseq refseq.txt
Python extract.py gene2accession accession.txt
Output files: refseq.txt, accession.txt
Exgeneinfo.py
Extract datasets from gene_info.
python exgeneinfo.py 128 256
Output files: geneinfo.txt, aliasmaps.txt
Extractunigene.py
Extract data from gene2unigene.
python extractunigene.py unigene.txt Hs Mm
Output file: unigene.txt
Exswissprot.py
Extract AC and DE from SwissProt text file
Output files: swissprot.txt and swissprotsyns.txt
After running these extractor programs, we have a few text files that are ready to be loaded into the
database. The file loaddata.sql contains the commands to populate the database tables with these files.
LOAD DATA LOCAL INFILE 'geneinfo.txt' INTO TABLE geneinfo LINES TERMINATED BY '\r\n';
LOAD DATA LOCAL INFILE 'swissprot.txt' INTO TABLE geneinfo LINES TERMINATED BY '\r\n';
LOAD DATA LOCAL INFILE 'aliasmaps.txt' INTO TABLE aliasmaps LINES TERMINATED BY '\r\n';
LOAD DATA LOCAL INFILE 'swissprotsyns.txt' INTO TABLE aliasmaps LINES TERMINATED BY
'\r\n';
LOAD DATA LOCAL INFILE 'refseq.txt' INTO TABLE aliasmaps LINES TERMINATED BY '\r\n';
LOAD DATA LOCAL INFILE 'accession.txt' INTO TABLE aliasmaps LINES TERMINATED BY '\r\n';
LOAD DATA LOCAL INFILE 'unigene.txt' INTO TABLE aliasmaps LINES TERMINATED BY '\r\n';
To run this file, issue command:
Mysql -u root --password=<password> --database=pdqdb < loaddata.sql
4.5.
Data Cleansing
The alias table contains synonyms or alias names of genes and proteins. These terms will be used as query
terms when searching the pubmed literature database. However, some of these terms are common English
words or very short acronyms that are too common in the articles. These terms should be eliminated from
the alias table. We rely on a dictionary to delete the terms from the alias database table. The resources we
use include:
English dictionary: OpenBsd, usr/share/words
Cell-line dictionary: http://www.biotech.ist.unige.it/cldb/indexes.html
Abbrev and Acronyms in Biomedical Research and Practice. http://focosi.altervista.org/abbreviations.html
These words are put in a SQL script file ‘delwords.sql’ with commands like:
DELETE FROM AliasMaps WHERE LENGTH(Alias)<3;
DELETE FROM AliasMaps WHERE Alias='aand';
DELETE FROM AliasMaps WHERE Alias='able';
DELETE FROM AliasMaps WHERE Alias='about';
……
Later, run
mysql -u root --password=<password> --database=pdqdb < delwords.sql
5.
Building the system
This section is about building the PDQ-Wizard system from the source code.
5.1.
Download the package
The PDQ-Wizard application is distributed as one package which contains the source code, web pages, JSP
pages, and other related files.
Download the package from : http://forge.nesc.ac.uk/
Unpack it into a folder, for example, C:\pdq
5.2.
Directory structure
After unpacking the application, the directory structure is like the following:
pdq/
src/
pdq/*.java
test/*.java
db/*.sql
web/*.html,*.jsp
css/*.css
images/*.jpg,*.gif
WEB-INF/web.xml, faces-config.xml
5.3.
Modify the build properties
Under the pdq folder (the root of PDQ-Wizard), there is a file named build.properties. It is a text file
containing variable values representing the paths of the libraries, for example:
tomcat.home=C:\\Program Files\\Apache Software Foundation\\Tomcat 5.5
deploy.path=${tomcat.home}/webapps
sf-api.jar=${tomcat.home}/jsf-1_1_01/lib/jsf-api.jar
jsf-impl.jar=${tomcat.home}/jsf-1_1_01/lib/jsf-impl.jar
commons-logging.jar=${tomcat.home}/jsf-1_1_01/lib/commons-logging.jar
……
The real directory paths need to be modified according to the folders where those components are installed.
For example, the JavaServer faces components (jsf-impl.jar) may not be installed in the Tomcat home
folder. The ${} represents a variable (shortcut) defined previously.
5.4.
Build the source code
To check whether it compiles use the ant script:
Ant compile
If not successful, check the tools and libraries and try again.
After a successful compile, run build:
Ant build
The build script will create a ‘build’ folder under the pdq folder. It contains everything to be deployed into
the web container.
6.
Deploying the system
The PDQ-Wizard needs to be deployed into the Web container, in this case, the Tomcat web server.
Deployment of an application into Tomcat can be as simple as copying the files into a particular folder,
namely, webapps under the Tomcat home folder. The easiest way is to use the Ant build script.
6.1.
Deploy using build script
To deploy the application, issue the Ant build command:
Ant deploy
After a successful deployment, a folder is created in the webapps folder of Tomcat. The content of this
folder is a copy of the build folder created from building the application.
[Tomcat]/
webapps/
pdq/*.html,*.jsp
css/*.css
images/*.jpg,*.gif
WEB-INF/web.xml,faces-config.xml
classes/
pdq/*.class
lib/*.jar
6.2.
Restart the application
Whenever something is modified, such as a web page, or Java bean, the application must be rebuilt and
deployed into the web container. In order that the update is immediately shown, it is necessary to let the
container reload the updated application. With Tomcat, we can use the build script to do the work.
Ant stop
Ant start
7.
Running the system
7.1.
Unit testing
Some components are unit tested, to run the unit test code, issue command:
Ant runtest
7.2.
Start from a browser
Start a browser like IE, point to URL http://<host>:8080/pdq. If on local machine, use
http://localhost:8080/pdq.