Download COP 4814 Florida International University Kip IRvine

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
COP 4814
Florida International University
Kip Irvine
Character Encoding
Updated: 8/29/2013
Irvine
COP 4814
Contents
• ASCII
• Unicode
• UTF-8
• XML Entity References
Irvine
COP 4814
ASCII
• American Standard Code for Information
Interchange
•
•
•
Developed around 1960, based on telegraphic codes
(7 bits)
Too limited to display international characters, range
is only 0-127, biased toward English
Still used by older programming languages (C, C++,
COBOL)
• Hexadecimal ranges:
•
•
Irvine
32 control characters, not visible
0x30 - 0x39 are digits, 0x41 – 0x2A are letters, the
rest are punctuation and symbols
COP 4814
ASCII Control Characters
• This is a subset:
•
•
•
•
•
•
Irvine
0x0D – carriage return
0x0A – line feed
0x08 – backspace
0x09 – horizontal tab
0x0C – form feed (new page)
0x1A -- ^Z, or end of file
COP 4814
Unicode
Computing industry standard for the consistent
encoding, representation, and handling of text
expressed in most of the world's writing systems.
• Includes more than 109,000 characters covering 93
scripts; many symbols and other glyphs.
• Designed to accommodate all ancient, modern, and
future character sets.
• Implemented by different character encodings:
•
•
Irvine
UTF-8 (1 to 4 bytes per character)
UTF-16 (4 bytes per character)
COP 4814
Unicode Code Points
• Unicode uses a code point (numeric value) for each
character
•
does not directly provide a glyph (visual rendering);
instead, that task is handled by the software such as
a Web browser or text editor.
• Range: 0 to 0x10FFFF (over 1 million)
•
•
Irvine
notated as U+h, where h is a hexadecimal number
for values in the basic multilingual group, 4 hex digits
are used. For example, "X" is U+0058
COP 4814
Unicode Transformation Format (UTF)
• Maps a range of code points to sequences of
values in some fixed-size range (called code
values).
• UTF-16 is used by Microsoft Windows
• UTF-8 is used in Web pages and Email
Irvine
COP 4814
UTF-8 Encoding
• UCS Transformation Format – variable-width
encoding that can represent every character in the
Unicode character set.
•
first presented in 1993, became the dominant
encoding for Web pages and Email
• Uses 1 to 4-byte octets (octets are bytes)
•
•
•
Irvine
more commonly used values have small values, and
require only one byte
every ASCII byte value is automatically valid in UTF8.
range: U+0000 to U+10FFFF
COP 4814
Byte Order Mark
• Many Windows programs (including Notepad) add a
Unicode byte order mark encoded sequence to the
beginning of UTF-8 files.
• BOM values are:
•
Irvine
0xEF, 0xBB, 0xBF
COP 4814
XML Entity References
The XML standard includes a large set of predefined
entities as special characters that can be
represented by a special name.
Irvine
Name
Character
&amp
&
<
<
&gt;
>
&apos;
'
&quot
"
COP 4814
Sample HTML Character Entities
252 named entities!
Irvine
COP 4814
Examples of Encoding Choices in Internet Explorer
After selecting "Chinese Simplified", this is how IE
displays the table from the previous slide:
Irvine
COP 4814
Encoding Characters in HTML
Unicode code points can be represented in either of
the following ways on a Web page:
&#dddd;
&#xHHHH;
(where d is a decimal digit, and H is hexadecimal)
Example:
&#x00A5;
Irvine
or
&#165;
COP 4814
Summary
• Computer software has evolved since 1970
• Original emphasis was on reducing storage and
memory requirements, resulting in simplistic
character sets
• International use of computers forced the use of
expanded character sets
• Use of different font collections complicates the use
of encoding
Irvine
COP 4814