Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COP 4814 Florida International University Kip Irvine Character Encoding Updated: 8/29/2013 Irvine COP 4814 Contents • ASCII • Unicode • UTF-8 • XML Entity References Irvine COP 4814 ASCII • American Standard Code for Information Interchange • • • Developed around 1960, based on telegraphic codes (7 bits) Too limited to display international characters, range is only 0-127, biased toward English Still used by older programming languages (C, C++, COBOL) • Hexadecimal ranges: • • Irvine 32 control characters, not visible 0x30 - 0x39 are digits, 0x41 – 0x2A are letters, the rest are punctuation and symbols COP 4814 ASCII Control Characters • This is a subset: • • • • • • Irvine 0x0D – carriage return 0x0A – line feed 0x08 – backspace 0x09 – horizontal tab 0x0C – form feed (new page) 0x1A -- ^Z, or end of file COP 4814 Unicode Computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. • Includes more than 109,000 characters covering 93 scripts; many symbols and other glyphs. • Designed to accommodate all ancient, modern, and future character sets. • Implemented by different character encodings: • • Irvine UTF-8 (1 to 4 bytes per character) UTF-16 (4 bytes per character) COP 4814 Unicode Code Points • Unicode uses a code point (numeric value) for each character • does not directly provide a glyph (visual rendering); instead, that task is handled by the software such as a Web browser or text editor. • Range: 0 to 0x10FFFF (over 1 million) • • Irvine notated as U+h, where h is a hexadecimal number for values in the basic multilingual group, 4 hex digits are used. For example, "X" is U+0058 COP 4814 Unicode Transformation Format (UTF) • Maps a range of code points to sequences of values in some fixed-size range (called code values). • UTF-16 is used by Microsoft Windows • UTF-8 is used in Web pages and Email Irvine COP 4814 UTF-8 Encoding • UCS Transformation Format – variable-width encoding that can represent every character in the Unicode character set. • first presented in 1993, became the dominant encoding for Web pages and Email • Uses 1 to 4-byte octets (octets are bytes) • • • Irvine more commonly used values have small values, and require only one byte every ASCII byte value is automatically valid in UTF8. range: U+0000 to U+10FFFF COP 4814 Byte Order Mark • Many Windows programs (including Notepad) add a Unicode byte order mark encoded sequence to the beginning of UTF-8 files. • BOM values are: • Irvine 0xEF, 0xBB, 0xBF COP 4814 XML Entity References The XML standard includes a large set of predefined entities as special characters that can be represented by a special name. Irvine Name Character & & < < > > ' ' " " COP 4814 Sample HTML Character Entities 252 named entities! Irvine COP 4814 Examples of Encoding Choices in Internet Explorer After selecting "Chinese Simplified", this is how IE displays the table from the previous slide: Irvine COP 4814 Encoding Characters in HTML Unicode code points can be represented in either of the following ways on a Web page: &#dddd; &#xHHHH; (where d is a decimal digit, and H is hexadecimal) Example: ¥ Irvine or ¥ COP 4814 Summary • Computer software has evolved since 1970 • Original emphasis was on reducing storage and memory requirements, resulting in simplistic character sets • International use of computers forced the use of expanded character sets • Use of different font collections complicates the use of encoding Irvine COP 4814