Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multimedia search engine Michal Krsek, UISK Charles University at Prague & CESNET Ivan Doležal, CESNET Michal Illich, Jyxo Electronic Media • TV & radio • Organized in channels • Zero democracy in programming (by channel management) • Centralized production (big guys business) Internet • Not only web (audio/video and others) – remember archie.sura.net? • IPTV / Live / Video on demand • Navigation only via web => not easy to find specific program in A/V Search options I • Voice recognition – Language identification – Accents • Video recognition – Text interpretation (bush vs. Bush) – Low video quality Search options II • Indexing of web pages – Yahoo! does (google bomb target) Metadata – “Out of the band Metadata” (as in librarian world) – Metadata in files (added during editing or encoding) Project description • • • • • Started in 2003 (oh yes, one year before Truveo) “Google for audio and video on Internet” No support from content owners Modular concept Start with .cz Internet Technical description I • Crawler – Crawls web and collects addresses (URL) – Exports URL of multimedia files – Software written by Jyxo (Linux console app) Technical description II • Distiller – – – – – Imports addresses of multimedia files Distills metadata (and makes XML files) Makes screenshots (if video in file) C# software and mplayer (windows apps) Runs in distributed environment Technical description III • Database – Imports XML metadata files to full text DB – Responses back-end queries for web queries – And others fulltext things (i.e. language) Crawls webpages crawling Gets addresses Filter A/V adresses distillation Gets metadata from multimedia files indexing search Holds fulltext database Provides back end for querries www. yournamehere. edu Distillation • Proces description – – – – Get URL from DB Get metadata from file available at URL Get screenshots at 1,30,50 sec Save metadata & screenshot Distillation • Use of win32 applications – Native players (WMP, RP, Qt) for metadata – Mplayer for screenshots • Takes average one minute – Slow servers/bandwidth – Streaming without fast fw DistillerGRID • <= need 16 years to distill 8.500.000 URLs • Ideal application for GRID computing – Not need of real time response – Huge amount of computing time needed • Two ways to create GRID – Build dedicated system – Use of current capacities Computing machines • PC/Windows based • HW independent • Secure environment – Security of hosting system – Security of distillation process • Well connected • Not needed to run 24x7 • Easy to manage Configuration • ~100 PCs in student labs • Running on demand during weekends • Virtual machines (MS VPC 2004) in hosting system (Win XP) • Three different HW configurations • Peak rate about 5000 URLs per minute • SQL as background -> pull distribution of work Actual status I • HW – – – – 20 crawlers 2 servers for fulltext DB (<1.400 USD) Distillation stations (X office PC) Connected by 1 Gb/s to CESNET2 -> GEANT2 Actual status II • Database – – – – EU + .com, .edu > 13.000.000 URLs > 8.000.000 valid > 2.800.000 with screenshots Live show? Want to test? • URLs – http://multimedia.jyxo.cz – http://videoserver.cesnet.cz/videoarchiv_en.php – For XML interface send me e-mail Questions ? Comments ? Michal Krsek, [email protected] (academic service, cooperation) Michal Illich, [email protected] (business service)