Download Database structure

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Big data wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Response to NIH Database Structure Questionnaire
Southeast Collaboratory for Structural Genomics
1. What kinds of data models (e.g. UML) and database management systems (e.g. Oracle,
MySQL) are being used?
The various labs in our center (individual labs) use a range of RDBMS systems (Access,
mySQL, Oracle 8i & 9i, Postgresql and Microsoft SQL Server 2000). The Center’s
database for data integration, reporting and PDB deposition is being developed in Oracle
9i and some software packages in use are bundled with Oracle 8i (e.g. NuGenesis,
Beckman-Coulter iLIMS). Some groups use UML for application design while others
use it for entity relationship descriptions.
2. Is the data stored in a single database or multiple distributed databases? If a distributed
database scheme is used, how do these databases interoperate?
Individual labs store data in its own database system(s) and we have an exchange layer
based on XML. Currently, the information to be integrated from different databases is
first translated into XML, and then combined for different applications, such as internal
progress reports, XML files for NIH, etc. With the Center database development, an
expanded set of data will also be integrated by database federation and/or materialized
views.
3. Is there a role for Excel spreadsheets in your data management system?
Yes, experimenters are familiar with Excel, and prefer data in its format for organizing,
analysis and in some cases as an intermediate step prior to data entry. Its use is limited to
end user input/output because it requires little retraining; however, it is impractical for
long-term data management and most of the individual labs have/are migrating to
RDBMS.
4. How are the data maintained? Is information regularly archived using a physical backup?
In summary, both relational database and file servers are employed to store and maintain
data and these are periodically backed up to other systems and/or tape archives. For
example, a Tivoli Management System (Tape archive) is employed to backup all Center
file servers according to a predefined schedule. Also, strategies employed differential
and complete relational database backups have been implemented. Copies of backups are
maintained off-site.
5. How do you address the issue of database security to prevent unauthorized access?
The Center’s individual labs use a range of approaches that include password
authentication, IP restriction, secure shell and 128-bit SSL. Also, several labs regularly
monitor network traffic and/or database connections.
1
6. How much effort are you putting into LIMS construction, maintenance, improvement, and
support of users? Is that too little, or too much – what effort will be needed in the future?
This varies with the requirements of each individual lab. For example, one of the protein
production labs (Adams lab) has made a considerable investment in the development of a
custom LIMS system (Beckman-Coulter iLIMS). This lab has concluded that efficient
data management is essential to the high-throughput protein production process. They
believe that more support and development will be necessary.
Database Content
1. List the types of data that you are managing and storing in the database:
a. Administrative data, accounting, inventory, shipping.
Internally, the data range from administrative data (limited amount), production
accounting, inventory of products/pre-products, sample request/shipping between
groups. The University of Georgia maintains separate data bases on purchases,
personnel, accounting, and chemical inventory.
b. Experimental data; bioinformatics information, images, spectral data, structure
factors, coordinates….
Bioinformatics Groups: Public databases mirroring includes NCBI non-redundant
sequence database, complete genome database, protein domain database, PDB,
SwissProt and data generated by various Bioinformatics calculations, such as blast
run, transmembrane prediction, signal peptide prediction, 3D structure prediction etc.
Also, the Center database for data integration, reporting and PDB deposition is being
developed and it will contain meta-data to aid the data integration functions.
Crystallography Group: crystal images, file names, instrument parameters, as well as
experimental conditions (chemical/physical), such as summaries of crystallization
conditions, crystal storage location, data collection parameters, data processing output, phasing techniques/outcomes, and refinement and validation out-put.
Production Labs: Experimental data is tracked using defined protocols with the
assumption that the protocol is carried out properly. New protocols can be easily
created as necessary. A large majority of quality control/analytical data is stored such
as Mass Spec, and Metal Analysis. In Adams lab, for example, some of the
bioinformatics data are compiled in the production database. These include but are
not limited to membrane analysis, morf analysis, cys count, PDB hits, predicted pI,
and predicted MW. Images are stored in NuGenesis as part of a person’s lab report.
However, plans are being made to store gel images and purification trace in the
production database.
2
Data representation
1. Do you have a project wide data representation or independently developed schemas for
different parts of the structure solution process?
The individual lab databases are based on independently developed schemas that will be
modeled by the Center database for the purposes of data integration.
2. How do you maintain consistency across the data items?
We use column constraints and/or interface programming to check and/or limit options of
data items. We use validation software to check data prior to data integration.
3. What logical inconsistencies of data are acceptable and how are they tracked? For example,
is it possible to enter data about protein purification when cloning information is missing? If
this is allowed, how is such a state tracked and flagged?
Very few logical inconsistencies are currently allowed by our procedures. Logical
inconsistencies are typically tracked at the lab level by query and, in some cases,
prevented at the interface level. Our development of the Center database includes ways
of checking for inter lab inconsistencies.
3. How do you maintain consistency of annotation? Do you use predefined dictionaries for
annotation or can users sometimes enter arbitrary text strings?
This varies with lab. In some settings, free form comment fields are available with newly
identified "regular" data items being added to their database structure. In other labs,
almost all text strings are pre-defined and are only selectable at runtime. They expand
the terms set through the database. Most field terms are defined but not all are
standardized.
Data Entry
1. What are the major sources of data in the database(s)?
 Robots
 Software packages
 Human input
 External (for example, collaborators, web sites or other centers)
Data entry comes from a range of sources and these include human input, robots
(purification, crystallization, crystal scoring, and diffraction screening), data from
external analytical labs, harvesting of crystallographic program/pipeline output, program
and external web sites. Also, there are plans to include gel image and protein purification
data.
3
2. Is the experimental data recorded automatically or selectively entered by people, or a
combination of both?
This varies with data source. Some data are recorded (program output/robots)
automatically, while other sources require manual input for all but interpretive data.
3. Do you offer a number of different alternative ways to import data into the database:
 Large or small volumes of data en bloc (manual or automatic)
 Bar-coded
 Excel spreadsheets
 Text files
Again, this varies with data source. This includes web-based and client/server-based
interfaces for human data entry, importation of XML files, data harvesting from programs
output, scripts that allow users to automatically calculate 96 new clones at time and store
them in the database, the importation of robot generated text files and barcode generated
data.
4. Is a rollback to an earlier state of the database possible, in case an update has errors?
Yes. Some requires restoration from backups while others have enabled rollbacks from
archive/transaction logs.
5. How do you convince users to enter their data into the database, especially given that early
versions may be imperfect and painful to use?
A number of approaches have been taken and the following three responses from lab
informatics personnel indicate their range.
Make the process as painless as possible, lots of pull downs and/or population from welldefined protocols.
Very patiently.
The boss tells them to use it! It is true there are growth pains and bugs to be worked out.
Training is very important but is still not enough to make them use it right. However,
they have no alternative so they learn to use it the best they can.
6. How do you convince users that your work can make the life of the experimentalists much
easier and increase the overall efficiency and productivity?
By showing them that data management can simplify their lives, organize their tasks,
identify trends, and keep track of their data by using pre-compile reports using dynamic
data from the database.
4
Data Harvesting and Exchange
1. Have you developed tools for data harvesting?
Yes, we have done the following:
1. Pipeline tools for genomic data harvesting from both internal and external data
sources.
2. Center database in development will contain a number of report generation tools for
data harvesting.
3. Individual lab efforts
i. reports and graphs showing breakdowns of what is produced when, how and by
whom..
ii. tools to view and analyze small scale expression data for the best growth
conditions and best fraction to produce the protein.
2. How do you, or do you plan to, generate submissions to the public databases such as the
PDB?
We plan to use an openMMS generated schema to store that data in Oracle. We are
looking into methods for RDBMS schema to mmCIF file generation. However, we
would like to suggest that PDB develop an openMMS schema to mmCIF tool for PDB
submission. We believe that the latter would greatly simplify the automated PDB
submission process.
3. Is the database used for the purposes of internal data mining, report generation, and progress
reviews? If so, is this done automatically or manually?
Yes, with some queries done automatically with others being done manually. We intend
to automate most of the manual procedures.
4. How do you exchange data among databases or applications?
Currently, via the XML exchange layer and, with the Center database development,
materialized views will be added. In addition, an e-mail tabulation of changes to the
XML files are generated and sent to the Center’s various project managers weekly.
5. Are there any manual steps in the production of XML files for the NIH target/status
database?
No. The production of XML files for NIH target/status has been fully automated at our
Center for the past years.
5
Interaction Between Centers
1. To what extent does data from other centers influence your target selection and decision to
stop work on a target?
We scan the other Center’s XML reports and PDB status weekly. If our targets overlap
we will reconsider our priority.
2. How many of the other PSI centers are you cooperating with? What kind of tools are you
sharing?
We have some exchange of experiences with the Berkeley center on operating the IBM
Linux cluster and data storage system.
3. Do you have data management tools that could be used by other projects?
We believe our system could benefit other projects. However the system is not mature
enough for distribution at this time.
4. What tools could the public databases, such as the PDB, provide to make your efforts more
productive?
A tool to translate openMMS schema to mmCIF for structure deposition.
5. What parts of a LIMS should, or can, we eventually share?
 Nothing
 Data representations
 Interfaces
 Architecture
 Whole implementations
Data representations, interfaces to some degree, and architecture.
6. How flexible/general (versus specialized) does a LIMS have to be, to balance optimally
unspecified/unexpected future needs, complexity, cost and usability?
The LIMS should be both specialized and easily modifiable, because every center has a
different approach scientifically and experimentally.
Successes and Bottlenecks
1. What are the most challenging technical and administrative issues on data management you
are dealing with?
6








Determining the key subset of information that is needed to best describe an
experiment
Variations amongst the lab views on which information is important
Putting data into an electronic form
Meeting design goals of easy maintenance and comprehensive coverage
Determining the best utilization of personnel and hardware resources.
Variations in training, available time and parallel duties for team members.
The lack of streamlined tools/utilities that can be easily incorporated into relational
database systems for automated PDB deposition.
providing the expected service with the limited personnel available due to budget
constraints.
2. Can you identify bottlenecks within or arising from data management?
 Defining data representations for new experiments.
 Tool to analyze the efficiency of each component in the process as well as the overall
process itself.
3. What bottlenecks have you encountered in acquiring the data and entering it into the database
system?
 Resolving inter lab format differences and user data entry compliance issues.
 Changing production lines as a result of new analytical techniques and more complex
experimentation.
 Integration of robotics and binary data types has been challenging.
4. Have you developed any particularly successful solutions to specific bottlenecks?
 A XML based exchange layer to integrate different sources of information has
resolved many of the inter lab variations. On going organizational and interface
development efforts are attacking the other bottlenecks.
5. What's the biggest lesson you learned and the most successful experience you have so far?
The following responses from lab informatics personnel provide a representative answer:






An exchange data layer is essential for stable application based on integrated
data/databases.
This is not a true production environment. There are frequent changes occurring with
the data management-it’s dynamic data management. For this reason there is no one
size fits all to this problem.
Prioritize, compromise, and learn.
I am happy that we have already built a solid foundation in data management but
more work needs to be done.
Data pipelines have been very helpful in a number of areas.
Data mining is now a routine process and has yielded some unexpected results.
7