Download Andrew Moss - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Semantic Web wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Traction TeamPage wikipedia , lookup

Web analytics wikipedia , lookup

Clusterpoint wikipedia , lookup

Transcript
Lab I – READ Product Description
Running Head
Lab I – READ Product Description
Andrew Moss
CS411
Janet Brunelle
March 18, 2013
Version 2
1
Lab I – READ Product Description
2
Table of Contents
1 Introduction .................................................................................................................................. 3
2 READ Product Description.......................................................................................................... 5
2.1 Key Product Features and Capabilities ..................................................................................... 5
2.2 Major Components (Hardware/Software)................................................................................. 7
3 Identification of Case Study......................................................................................................... 9
4 READ Product Prototype Description ....................................................................................... 10
4.1 Prototype Architecture (Hardware/Software) ......................................................................... 12
4.2 Prototype Features and Capabilities........................................................................................ 12
4.3 Prototype Development Challenges ........................................................................................ 13
Glossary ........................................................................................................................................ 14
References ..................................................................................................................................... 19
List of Figures
Figure 1 - Major Functional Component Diagram ......................................................................... 7
List of Tables
Table 1 - Side-by-side Comparison of Real World Product and Prototype .................................. 10
Lab I – READ Product Description
3
1 Introduction
Publications are the primary method of distributing the results that come from conducting
research. There are approximately 4,600 universities (NCES, 2011) that “account for more than
half of the basic research conducted in the United States (McRobbie, 2012)”. Unfortunately,
many of these institutions lack an efficient online resource for organizing and displaying both the
publications resulting from their research and information about the grants that helped finance it.
Such a system would provide research universities and the departments therein, as well as the
students and professors performing the research, with increased recognition and awareness of
their work.
One example of a university in need of an improved publication system is Old Dominion
University (ODU) in Norfolk, Virginia. Their Computer Science Department (ODUCS), in
particular, would benefit a great deal from having an online well-maintained system for
publications and grants as it lacks one entirely. This department’s professors are burdened with
manually updating their own web pages to provide awareness of their recent publications. In the
past there was a single web page for the entire department that was maintained by an individual
member of their Systems Group. However, this page was last updated in 2008, likely a result of
the slow, tedious, and manual nature of the process.
The team behind READ, a Repository for Electronic Aggregation of Documents, intends
to alleviate the lack of quality online resources for displaying publications and grants. The
READ system will use a scraper to provide researchers with a method of organizing their
publications and grants in a format that allows for easy searching, sorting, filtering, and
browsing. Additionally, content authors will be able to verify that the listed publications are
Lab I – READ Product Description
4
actually their own work in the event that READ mistakenly shows something written by another
researcher with the same name.
There will be a prototype READ system developed for ODUCS as a proof of concept to
display its most basic capabilities. A prototype is necessary due to time constraints placed on
development. This prototype will provide public and private user interfaces to publication and
grant databases, user controls for publication verification, and most importantly, a scraper that
will gather links to publications automatically at set intervals to minimize manual effort.
(This Space Intentionally Left Blank.)
Lab I – READ Product Description
5
2 READ Product Description
READ is an automated system using a database to store links to articles, the publications
themselves and information about grants involved. It will allow anyone with Internet access to
browse the lists of publications and filter them by author, date, keywords, and publication type. It
will minimize the need for manual effort on the part of the author by automatically finding their
publications making it easier to manage the work they have already done. This will allow faculty
to spend more time actually conducting the research that attracts new students and funding alike.
Lastly, it will maximize the amount of information available.
2.1 Key Product Features and Capabilities
The most essential element of the READ solution is the Schaefer Scraper. This is an
algorithm developed by Andrew Schaefer, a graduate student at ODU, with the help of several
other ODUCS graduate students. The Schaefer Scraper combs external websites looking for
publications written by a specific author. It then extracts relevant information such as the title,
attributed authors, date, and the type of publication. This information is then inserted into a
database along with a link to the page where the publication can be accessed. After the database
has been updated, READ will then send a notification e-mail to the associated author so that
newly found documents can be verified with a single mouse click. Based on the responses to
these verification e-mails, READ will learn when a publication is likely to have been written by
a different author of the same name, and make the decide whether the publication truly belongs
to the associated author on its own.
The Schaefer Scraper will also extract information about grants from external websites.
In addition to filtering the displayed publications as discussed in Section 2, READ will also
allow viewers to filter the grants displayed by the amount, the status of the grant, the funding
Lab I – READ Product Description
agency, and the principal investigator. Each faculty member will have a publicly available
profile that will display his or her name, title, organization, and homepage. The profile will also
contain graphical representations of the number of publications created and amount of grant
funding received along with a filterable list of the author’s publications.
(This space intentionally left blank.)
6
Lab I – READ Product Description
7
2.2 Major Components (Hardware/Software)
As can be seen in Figure 1, the foundation of the READ solution consists of a single
server (this can be physical or virtual). It will be home to three main software components: a web
interface, a publication and grant link database, and Schaefer's Scraper. In order to implement
this solution, a web server and SQL database server software will also be required but can be
kept on the same host if desired.
Figure 1 – Major Functional Component Diagram
The web interface contains both public and private sections. The latter will be accessible
only to document authors and administrative staff. Access to this section will be strictly
protected by requiring user authentication before it can be viewed. Figure 1 shows Dr. Michele
Weigle, the READ team’s mentor, as an example author using the READ system.
Lab I – READ Product Description
8
The next component of the READ solution is the publication link database. The
database's primary function will be to provide links to externally located publications and grant
information. It will also contain files uploaded directly by the authors.
The last and, undoubtedly, most important element of READ is Schaefer's Scraper. This
is an automated tool that will regularly comb a specific list of external web sites for new
publications submitted by a known list of authors. Both of these lists are provided as input in
XML files. The algorithm consists of nested loops where for each author, and for each external
web site, the Scraper will search for publications by the author, parse the results, and export them
to the READ link database.
(This Space Intentionally Left Blank.)
Lab I – READ Product Description
9
3 Identification of Case Study
The initial customer for READ is Old Dominion University's Computer Science
Department. This is primarily due to the fact that the solution was requested by the team’s
mentor, Dr. Michele Weigle, a prominent professor in the department. In addition to Dr. Weigle,
according to its website, the department features a total of 37 faculty members, 11 currently
enrolled Ph.D. students, and 111 currently enrolled Master's students. That is potentially 159
authors who could benefit from a system that would make it easy for others to find their research.
After successful testing at ODUCS, READ could then be used by other departments at Old
Dominion University. Even further into the future, this solution could potentially be utilized by
other universities, governments or non-profit research institutions. Numerous organizations
could benefit from making their publications easier to manage.
(This Space Intentionally Left Blank.)
Lab I – READ Product Description
10
4 READ Product Prototype Description
As mentioned in Section 1, due to time constraints, it will be necessary to develop a
prototype to display the most basic functionality of the READ solution. The READ prototype
will use data from real authors from ODUCS and the database will be populated with their
publications by Schaefer’s Scraper. It will offer nearly the same functionality as the Real World
Product (RWP). Due to time constraints, the prototype will not feature graphical representation
of data about publications and grants, nor will it implement a learning algorithm to automatically
decide whether a publication does or does not likely belong to a specific author. This is shown in
Table 1.
Table 1 – Side-by-side Comparison of Real World Product and Prototype
Features
Real World Project
Prototype
Browsing
Ability to browse all grants and
Ability to browse all grants and
Capabilities
publication
publications
Publication
Filtered by title, publisher, authors,
Filtered by title, publisher, authors,
Filtering
publication date, date added, and
publication date, date added, and
Capabilities
keywords.
keywords.
Grant Filtering
Filtered by title, funding agency,
Filtered by title, funding agency,
Capabilities
principal or co-principal
principal or co-principal
investigator, start date, end date,
investigator, start date, end date, and
and active state.
active state.
Lab I – READ Product Description
Add, edit, and
Included. A thumbnail image and
Included. A thumbnail image and
delete publications
files may be associated with the
files may be associated with the
and grants
document. Fields can be
document. Fields can be
automatically filled in using a
automatically filled in using a
Bibtext document.
Bibtext document.
Lists faculty and provides a link to
Not included.
Faculty page
11
each person’s profile page
Login interface
Profile Page
Scraper
Linked to Old Dominion University
Linked to Old Dominion University
Computer Science accounts
Computer Science accounts
Displays authors’ profile picture,
Displays authors’ profile picture,
job title, email address, personal
job title, email address, personal
webpage link, and the author’s
webpage link, and the author’s
publications and grants. Displays
publications and grants. Graphs not
graphs
included.
Will update the system with new
Will update the system with
publications and grants and alert
publications only and alert users
users when one is added to the
when one is added to the system
system under their name.
under their name.
Prediction
Predicts if the consumer has enough Not included
algorithm
space to use the READ system.
Lab I – READ Product Description
12
Administrative
Administrators are able to edit, add,
Administrators are able to edit, add,
Privileges
or remove anything in the system.
or remove anything in the system.
4.1 Prototype Architecture (Hardware/Software)
The prototype will use the same major functional components as the RWP, as seen in
Figure 1. The server will be a virtual machine (VM) running Debian Linux. The web interface
will be served via Open Source web server software running on the VM. The database will be
stored and served from the same virtual machine using Open Source database software.
Schaefer’s Scraper is software that has already been written in PHP with which READ will
interface. The scraper has a list of external sites that it searches for publications. This list is hard
coded in the scraper itself. The results of the search are output in HTML. A method of exporting
the scraper’s output to a format friendly to the database will need to be developed. Finally, the
user interfaces will need to be coded as described in Section 2.2.
4.2 Prototype Features and Capabilities
The primary feature of the prototype is the automation provided by Schaefer’s Scraper.
Manual effort is the most significant hindrance to maintaining up-to-date lists of grants and
publications. It also removes the considerable capacity for human error.
The prototype will allow anyone on the Internet to browse all grants and publications.
While browsing, the viewer will be able to apply a variety of filters to the information displayed
as described in Section 2. The viewer will also have access to thumbnail images associated with
publications and grants. Authors will be able to log in to a private interface using their ODUCS
Lab I – READ Product Description
13
Unix/Linux credentials. Through this private interface they will be able to manage the
publications and grants associated with their accounts.
4.3 Prototype Development Challenges
The biggest obstacle in the development of the prototype will be the fact that Schaefer’s
Scraper, as provided to the team, is completely non-functional. It is poorly documented PHP that
should be re-written in Python. The scraper will also need to be integrated with the publication
and grant links database. Lastly, the limited amount of time available for the development of the
prototype will also be a formidable obstacle.
(This space intentionally left blank.)
Lab I – READ Product Description
14
Glossary
Administrator/Administrative User: a user with increased privileges for editing database content
Author: A person that is able to add and edit publications and grants to the system under their
name.
BibTeX: A file format for reference information in XML format. It will be used to automatically
fill in key information when uploading or editing publications and grants.
Computer Science (CS): An academic discipline based on advancing computing theory and
algorithm development, that sometimes includes theory about software engineering methods.
Client application: In a client/server architecture, the module that takes input and creates queries
to be processed by a server, and receives the results from the server.
Client/Server Architecture: A software engineering paradigm that separates functionality into a
“client” application and a “server” application that interact.
CSS: A programming language used to specify presentation of HTML pages
Data Mining: The act of going through a source of input to find specific information.
Lab I – READ Product Description
15
Database Schema: A description of the structure of database
Funding Agency: The source of funds for research grants. These organizations usually have a
limited amount of money to (pass out) principle investigator’s that submit an accepted
application for research funds.
GIT: A software system for controlling and organizing software versioning.
GoogleScholar (http://scholar.google.com): Google Scholar provides a simple way to broadly
search for scholarly literature. From one place, you can search across many disciplines and
sources: articles, theses, books, abstracts and court opinions, from academic publishers,
professional societies, online repositories, universities and other web sites. Google Scholar helps
you find relevant work across the world of scholarly research.
scholar.google.com
Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc
that can be interacted with via a mouse and keyboard, through which a user interacts with a
software application. Used to differentiate from a “command-line interface”, in which a user
interacts with a software application solely through a text terminal.
internet scraper: internet scraper / web scraper - (wikipedia) web scraping focuses more on the
transformation of unstructured data on the web, typically in HTML format, into structured data
Lab I – READ Product Description
16
that can be stored and analyzed in a central local database or spreadsheet.
JQuery Sparklines: A development library for the visualization of data.
ODU: Old Dominion University.
MicrosoftAcademic (http://academic.research.microsoft.com/): Microsoft Academic Search is a
free service developed by Microsoft Research to help scholars, scientists, students, and
practitioners quickly and easily find academic content, researchers, institutions, and activities.
Microsoft Academic Search indexes not only millions of academic publications, it also displays
the key relationships between and among subjects, content, and authors, highlighting the critical
links that help define scientific research. Microsoft Academic Search makes it easy for you to
direct your search experience in interesting and heretofore hidden directions with its suite of
unique features and visualizations.
MySQL: A database querying language.
Parse: A technical term usually used to describe the processing of a statement written in a
programming language. May be used generally to describe the processing of any statement for
specific meaning.
Perl: A widely-used programming language on the server-side of web applications.
Lab I – READ Product Description
17
PHP: A widely-used programming language on the server-side of web applications.
Principle Investigator (PI): The primary researcher that a research grant is bestowed upon,
responsible for documenting the work and publishing research results.
Publication or Academic Publication: A document created by a faculty member to share
research. They are usually published in an academic journals, technical reports, and records of
conference proceedings.
Query: An algorithm sent to the database to either change the database or get back results
READ: Repository for Electronic Aggregation of Documents
RSS: A system for subscribing to and distributing news.
Scraper: An automated application designed to scan a source of input such as a document or a
website for pertinent information.
Server application: In a client/server architecture, the module that takes queries or requests from
a client module, process them, and returns the result to the client.
Software Compatibility: A description of whether different softwares, or versions of software,
Lab I – READ Product Description
18
can communicate/interact.
SQL: A widely-used programming language used to query databases.
SQL injection: Performing unauthorized queries on a database for malicious purposes.
User Authentication: The process of verifying the access credentials of a user of an automated
system, usually accomplished by requesting a username and password combination.
Viewer: In the scope of our project an outside person who wishes to query the information
contained in the READ database.
Version Control: A method for organizing and recording different versions of documents that
have been created over time.
Virtual Private Server (VPS): A software version of a hardware server. Used to create
independent servers (....) on a single piece of hardware.
Webserver: A group of applications run on a computer or VPS in to serve webpages and provide
server-side computation for browser-based client applications. A web server is a constantly “on”
resource whose sole or main job is to respond to HTTP requests from browsers.
XML: Extensible markup language.
Lab I – READ Product Description
19
References
McRobbie, Michael A (2012, December 19). The Multibillion-Dollar Threat to Research
Universities. From
The Chronicle of Higher Education:
http://chronicle.com/article/The-Multibillion-Dollar-Threat/136363/
National Center for Education Statistics. Degree-granting institutions and branches, by controls
and level of institution and state or jurisdiction, 2010-11. From the Digest of Education
Statistics: http://nces.ed.gov/programs/digest/d11/tables/dt11_280.asp