Download Krsek-KEG-2009

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL redirection wikipedia , lookup

URL shortening wikipedia , lookup

Transcript
Multimedia search engine
Michal Krsek, UISK Charles University at Prague & CESNET
Ivan Doležal, CESNET
Michal Illich, Jyxo
Electronic Media
• TV & radio
• Organized in channels
• Zero democracy in programming (by
channel management)
• Centralized production (big guys business)
Internet
• Not only web (audio/video and others)
– remember archie.sura.net?
• IPTV / Live / Video on demand
• Navigation only via web
=> not easy to find specific program in A/V
Search options I
• Voice recognition
– Language identification
– Accents
• Video recognition
– Text interpretation (bush vs. Bush)
– Low video quality
Search options II
• Indexing of web pages
– Yahoo! does (google bomb target)
Metadata
– “Out of the band Metadata” (as in librarian
world)
– Metadata in files (added during editing or
encoding)
Project description
•
•
•
•
•
Started in 2003 (oh yes, one year before Truveo)
“Google for audio and video on Internet”
No support from content owners
Modular concept
Start with .cz Internet
Technical description I
• Crawler
– Crawls web and collects addresses (URL)
– Exports URL of multimedia files
– Software written by Jyxo (Linux console app)
Technical description II
• Distiller
–
–
–
–
–
Imports addresses of multimedia files
Distills metadata (and makes XML files)
Makes screenshots (if video in file)
C# software and mplayer (windows apps)
Runs in distributed environment
Technical description III
• Database
– Imports XML metadata files to full text DB
– Responses back-end queries for web queries
– And others fulltext things (i.e. language)
Crawls webpages
crawling
Gets addresses
Filter A/V adresses
distillation
Gets metadata
from multimedia files
indexing
search
Holds fulltext database
Provides back end for querries
www.
yournamehere.
edu
Distillation
• Proces description
–
–
–
–
Get URL from DB
Get metadata from file available at URL
Get screenshots at 1,30,50 sec
Save metadata & screenshot
Distillation
• Use of win32 applications
– Native players (WMP, RP, Qt) for metadata
– Mplayer for screenshots
• Takes average one minute
– Slow servers/bandwidth
– Streaming without fast fw
DistillerGRID
• <= need 16 years to distill 8.500.000 URLs
• Ideal application for GRID computing
– Not need of real time response
– Huge amount of computing time needed
• Two ways to create GRID
– Build dedicated system
– Use of current capacities
Computing machines
• PC/Windows based
• HW independent
• Secure environment
– Security of hosting system
– Security of distillation process
• Well connected
• Not needed to run 24x7
• Easy to manage
Configuration
• ~100 PCs in student labs
• Running on demand during weekends
• Virtual machines (MS VPC 2004) in hosting
system (Win XP)
• Three different HW configurations
• Peak rate about 5000 URLs per minute
• SQL as background -> pull distribution of work
Actual status I
• HW
–
–
–
–
20 crawlers
2 servers for fulltext DB (<1.400 USD)
Distillation stations (X office PC)
Connected by 1 Gb/s to CESNET2 -> GEANT2
Actual status II
• Database
–
–
–
–
EU + .com, .edu
> 13.000.000 URLs
> 8.000.000 valid
> 2.800.000 with screenshots
Live show?
Want to test?
• URLs
– http://multimedia.jyxo.cz
– http://videoserver.cesnet.cz/videoarchiv_en.php
– For XML interface send me e-mail
Questions ?
Comments ?
Michal Krsek, [email protected] (academic service, cooperation)
Michal Illich, [email protected] (business service)