Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
WMES3103 : INFORMATION RETRIEVAL TEXT AND MULTIMEDIA LANGUAGES AND PROPERTIES INTRODUCTION Text - main form of communicating data and information Text also supplemented with multimedia elements - to make the contents of an IRS more attractive and interactive Website with a combination ot text and multimedia will be visited by many as compared to one which is text-based only IRS - text and multimedia is depicted via special languages. Metadata New concept on information – metadata Information about data arrangement, data domain and relationship between the two Data about data 2 types – descriptive and semantic descriptive Metadata – metadata which explain about document or one unit of information Commonly used Metadata : Authors Date of publication Source of publication Length of document Type of document Metadata semantic Metadata –resembles subject that can be obtain from the contents of the document – subjects heading Keywords LC Code TEXT With computers, we need to code text into binary digits First coding schemes – EBCDIC and ASCII – 7 bits to code each symbol Then, ASCII changed to 8 bits to accommodate other languages, accents and diacritical marks Oriental languages – Unicode – 16 bits TEXT Formats No one single format for a text document Good IRS system should be able to retrieve information from any format Initially, IRS will convert a document to an internal format but this had a lot of disadvantages Now, many new format has been developed for document interchange TEXT RTF – Rich Text Format for word processing PDF – Portable Document Format for displaying and printing documents Postscript – powerful programming language for drawing MIMT – Multipurpose Internet Mail Exchange to encode e-mail Files are compressed – Compress (Unix), ARJ (PCs), ZIP Convert binary files to ASCII text – uuencode/uudecode, binhex MARKUP LANGUAGES Markup = extra textual syntax that can be used to describe formatting actions, structure information, text semantics, attributes, etc. Formal markup languages are more structured Marks = tags - initial and ending tag surrounding the marked text Standard metalanguage = SGML New metalanguange for Web = XML (eXtensible Markup Language) = subset of SGML Most popular markup language used for the Web = HTML (HyperText Markup Language) MULTIMEDIA Applications that handle different types of digital data originating from distinct types of media Text, sound, images, video Digital data distinct and different in volume, format, and processing requirements Different types of formats necessary for storing each type of media MULTIMEDIA Different formats used commonly on the Web and in digital libraries Images Audio Moving Images Textual Images Graphics and Virtual Reality IMAGES XBM, BMP, PCX – direct representation of a bitmapped (or pixel-based) GIF (Graphic Interchange Format) – includes compression and good for black or white or with small number of clours or gray levels (256) JPEG (Joint Photographic Experts Group) – includes compression TIFF (Tagged Image File Format) – used to exchange different documents between different applications and different computer platforms TGA (Television Targa image file) – associated with video game boards Various other image formats AUDIO Must be digitized before storage AU, MIDI (standard format to interchange music between electronic instruments and computers), WAVE – for small pieces of digital audio Audio libraries – RealAudio or CD formats Animation or moving pictures MPEG (Moving Pictures Expert Group) – related to JPEG Others – AVI, FLI, QuickTime TEXTUAL IMAGES Images that contain mainly typed or typeset text Obtained by scanning the documents For archival purposes Saved as images but with further compression Textual and non-textual stored and compressed separately and when neded can be combined and displayed together GRAPHICS AND VIRTUAL REALITY 3-dimensional graphics found on Web CGM (Computer Graphics Metafile) standard Metafile = collection of elements CGM standard specifies which elements are allowed to occur in which positions in a metafile VRML (Virtual Reality Modeling Language) – file format for describing interactive 3D objects and worlds - universal interchange format for 3D graphics and multimedia - can be used for various applications MULTIMEDIA DOCUMENTS MARKUP HyTime = Hyper/Time-based Structuring Language – standard defined for multimedia documents markup SGML architecture which specifies the generic hypermedia structure of documents