Download harvesting - WebArchiv

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL shortening wikipedia , lookup

URL redirection wikipedia , lookup

Transcript
WebArchive – Archive of the Czech Web
Mgr. Jan HUTAŘ
http://www.webarchiv.cz
Why we started with WebArchiv?






amount of documents published on the Internet is growing
dramatically – average lifespan is 40 days --> if the
documents are not archived a part of the national cultural
heritage would disappear forever
need to save and keep accessible the documents on the CZ
web
about 90% documents on the web exist only in electronic
form
trend around the world (Australia, Sweden, Internet
Archive … etc.)
NK ČR is suppose to do it – it is deposit library
main mission of the NK is to collect, catalog, permanently
preserve documents published in the territory and make
them available to the general public
http://www.webarchiv.cz
The beginning


start in 2000 – till 2002 – grant project R&D „Registration,
preservation and access of national electronic resources in
the Internet“
cooperation with Moravian Library Brno and Institute of
Computer Science at the Masaryk University Brno

they are our „IT department“ ;-)

only grants money

we are still going on!
http://www.webarchiv.cz
Main Aims







to implement best solution in the field of archiving of the
national web, i.e. bohemical online-born documents
prepare tools, methods and conditions for collecting,
archiving and preserving web resources
to provide long-term access to them
large-scale automated harvesting of the entire national web
and selective archiving are being carried out, including
thematic „event-based“ collections
to solve current legal issues (the legal deposit legislation,
CA) Legal Deposit Act doesn‘t cover online-born documents
and according to the Copyright Act, it is not possible to
make archived data available to public.
set selection criteria for selective approach / harvest
to establish conditions for cooperation between libraries and
publishers of electronic documents
http://www.webarchiv.cz
Selection Criteria

The amount of documents on the Internet is quite big –
for selective approach we need to find the ones with
„research value“
For acquisition (harvesting) 2 approaches:
1.
2.
selective approach - sklízejí aonly selected documents
are harvested and archived – according to selec. criteria
complete harvest – of the entire national domain for
example .cz. We need only to set harvester…

approaches are different in different countries

trend is to do both (Australia, Denmark)
http://www.webarchiv.cz
Criteria –selective approach









to set selection criteria was very difficult – still in the
process
we are coordinators of "Web Cultural Heritage„ project (in
the frame of EU Culture 2000 program)
Content
Resource type
Original form
Access
Format
Domain
National aspect
http://www.webarchiv.cz
Criteria –selective approach
1. Contents
Web resources of art or research value, news stories and
feature articles and resources as outputs of government
and other offices. Promotion material of an individual or a
corporation is omitted.
2. Resource Type
Serials, monographs, conference proceedings, research
and other reports, academic works etc.
3. Original form
Only resources originally published in the web – it means
they have no traditional/printed copy
4. Access
Only freely accessed resources are collected
http://www.webarchiv.cz
Criteria –selective approach
5. Format
Resources available in formats that are interpreted by
common web browsers without necessity of installing
plug-ins are collected.
6. Domain
Resources accessible at servers under the top level
domain .cz and at servers under the other domains …
7. National aspect
Resources according to „authors nationality“, „national
language“, „country or nation as a subject“
http://www.webarchiv.cz
What we have done…






continuous testing of:
 SW tools
 applications for harvesting, archiving, indexing and
accessing of the web pages
only open source SW
effort / push to change legislation
international cooperation (activities in R&D within IIPC –
even we are not members)
we have opened part of our archive for public (since
autumn 2005)
we are going to open the rest of the whole archive in 1
month (only localy)
http://www.webarchiv.cz
Harvest of the .cz domain






2001 first try of the whole domain harvest of the .cz
domain, 1 PC + tape robot, cz2001 includes over 3 mil. of
unique URLs (107 GB) – not completed
2002 harvest interrupted - lack of space on data storage
and floods. cz2002 includes 315,5 GB, from 10 263 855
URLs harvested over 10 mil. docs
in 2003 no harvest
2004 March- October, from 32 149 396 URLs harvested
32,5 million files = 1,2 TB
all harvest executed by the NEDLIB harvester, deep 2550 links
from 2004 new harvester HERITRIX
http://www.webarchiv.cz
Present state of the project



4-6 times/year is harvested collection of selected resources
(agreement with NK), about 110 servers. increase is around
10GB of data for each harvest  it is still rising
harvest of „small“ amounts of data is successful
analysis of the domain .cz was done  servers „suspicious“ from
unrelevancy were rejected (mail, mysql apod.) as well as
duplicates – number of URLs decreased from 540 to 378
thousands
BUT …
from 2004 we are not able to keep running the harvest of the
whole .cz domain. – problem of Heritrix with memory using 
new release of H. should solve it
 we plan to start entire .cz domain harvest this year

http://www.webarchiv.cz
Present state of the project

presently we have in archive about 1,7 TB of data ≈ 50
million unique documents

effort for the whole domain 1-2 a year

main standards are used (MARC21, DC, ISSN and URN)




selected docs are catalogued in an ALEPH library system
which supports Z39.50 and OAI-PMH protocols
selected resources (with agreements) at least 4 times a
year
in the end of 2006  all data will be placed on the new
data repository
in 2007 archive of the project should become a part of
prepared project of „National Digital Library“ at National
Library (together with Kramerius and Manuscriptorium)
http://www.webarchiv.cz
Software changes



2004 development and support of NEDLIB harvester was
canceled – we replaced it by Heritrix
2004-2005 consecutive change over to SW developed by
IIPC (International Internet Preservation Consortium)
archival file format nedlib replaced by ARC format (used
by Heritrix)
http://www.webarchiv.cz
Harvester Heritrix – advantages
systém modularity, extensibility, continual development
(v.1.8), very good and fast support from Internet Archive
developers
open source codes and modularity allow cooperation of third
party on its development

2 parts – framework and add on modules


Framework – basic control over harvests, user
interface, process managemenst, harvest settings
modules – used for specific harvest implementation,
set up each harvest step by step
http://www.webarchiv.cz
Harvester Heritrix - problems

not possible to leave the whole process of harvesting
without the control of experts
trap detection
extraction of links from websites (Java)

memory problems (whole domain harvest)

incremental harvest and changes detection


http://www.webarchiv.cz
SW for access




everything from IIPC (IA)
fulltext
document
indexing
NutchWAX,
extension/superstructure over search engine Nutch
WERA (successor of NWA tools) – user interface for
accessing documents on the web – it can deal with Czech
diacritics (accents etc. – display it, search by it, sort)
ARCWayback make index over whole archive, it allows
access into archive by URL and time
http://www.webarchiv.cz
Nutch a NutchWAX
Nutch
 open source search engine, by IA
 comes from Apache Lucene architecture
Nutch is able to:
 download and work up millions of sites in a month, manage
and control their index and search in this index
1000times/second
NutchWAX
 superstructure over SE Nutch made for indexing of
documents archived by Heritrix
 set of indexing and query plug-in, which add some needed
metadata to index
http://www.webarchiv.cz
WERA 





WEb aRchive Access
cooperation between IIPC, Internet Archive and NWA
use some parts from NWA
very easy navigation, kind user interface (time line with
documents version in time)
search hits in URL form are displayed very digestedly,
each hit has link to the timeline to get differ. version of
the same URL
possibility to search by URL address (like Wayback M.)
archived docs and WERA are linked by NutchWAX index
http://www.webarchiv.cz
How does it work actually?


1.
2.

harvest of docs – by the Heritrix crawler, docs are
saved to data storage in ARC format
to make archived docs accessible we have to make
index + interface, which display seach hits
making of the fulltext index over the collection of
selected resources v- for searching by the wordsNutchWAX
making of global index to provide access of the whole
archive - ARCWayback
displaying of docs from archive - WERA and Wayback
http://www.webarchiv.cz
WERA -
http://www.webarchiv.cz
ukázka
Our future






main aim – 2006 to start and keep in processing the
whole .cz domain harvest
go on with selective collection and increase the amount of
resources in it
provide legal access to the whole archive – localyaccording to the new CA (searching by URL and by the
time of harvest
implemantation
of
incremental
harvest
identification in repeatedly harvested docs)
(changes
Harvesting of bohemical resourcs outside the .cz domain some language recognition tool
Adaptive incremental harvesting
http://www.webarchiv.cz
Our future


Identification
documents
of
duplicate
(or
rather
very
Incremental indexing - adding of new docs into already
made index, not to make new one everytime

Fulltext indexing of the whole archive

Selective harvesting on demand

Permanent linking into the archive

Access limitations set by the new copyright law

similar)
OAI-PMH
database
implementation
on
top
of
the
registration

Building METS structures on top of the archive

integration of the archive into the proposed NDL 2007
http://www.webarchiv.cz
Useful links – in english;-)

WebArchiv homepage
http://en.webarchiv.cz/

Petr Žabička
Digital Cultural Heritage and the Cooperation of National Memory Institutes
Archiving the Czech Web: Issues and Challenges

this presentation
http://www.webarchiv.cz/files/dokumenty/konference/hutarENG.ppt
http://www.webarchiv.cz