Download IUC Template

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Unicode Support in ICU for Java
Doug Felt
[email protected]
Globalization Center of Competency, San Jose, CA
Overview
•
•
•
•
•
•
•
What is ICU4J?
ICU and the JDK, a brief history
Benefits and tradeoffs of ICU4J
Features of ICU4J
Performance of ICU4J
Using ICU4J
Conclusion and References
2
What is ICU4J?
• Internationalization Library
– Sister project of ICU (C/C++)
– Open-source, non-viral license
– Sponsored by IBM
•
•
•
•
Unicode Standard compliant, up-to-date
100% Pure Java
Enhances and extends JDK functionality
Over five years of continuous development
3
ICU and Java, a History
• Started with Java 1.1 internationalization
– Much code contributed by IBM/Taligent
– IBM provided support, bug fixes, enhancements
• Became open-source project in 2000
– ICU4C code started with port from Java
• Continued contributions to Java since then
– TextLayout, OpenType layout, Normalization
4
Collaboration with Java Teams
• We continue to work with Java
internationalization, graphics2D teams
• We participate in Java expert groups (e.g. JSR
204, Supplementary Support)
• Differences
– perspectives (conformance, features versus size)
– processes (open source versus corporate/JSR)
– timetable (twice a year versus every two years)
5
Benefits
• Fully implements current standards
– Unicode collation, normalization, break iteration
– Updated more frequently than Java
•
•
•
•
•
Full CLDR data
Improved performance
Open source, open license, customizable
Compatible with ICU C/C++ libraries and data
Runs on JDK 1.4
– Get supplementary support without moving to 1.5
6
Tradeoffs
• Not built-in, unlike Java i18n support
• Some API differences
– But generally a superset of the Java API
– Some differences unavoidable due to class restrictions
– Rule syntax differs to varying degrees
• Data differences
– ICU4J uses its own CLDR data, not the JVM’s data
• Size
– Can trim ICU4J, but it will always be larger than 0K
7
Features of ICU4J
•
•
•
•
•
•
•
Collation
Normalization
Break Iteration
UnicodeSet and Transforms
Character Properties
Locale data
Other
– Calendars, Formatters, IDNA, StringPrep, IMEs
8
Collation
• Full UCA (Unicode Collation Algorithm)
– Java does not implement UCA collation
• Locale data
– Over 60 tailorings for locale-specific collation
– Variants: Pinyin, stroke, traditional, etc.
• Performance
– sorting: 2 to 20 times faster
– sort key generation: 1.5 to 4 times faster
– sort key length: 2/3 to 1/4 the length of Java sort keys
9
Normalization
• Java does not provide normalization APIs
– Java uses ICU’s implementation internally
– Useful for searching, string equivalence, simplifying
processing of text
• Full implementation of Unicode standard
– NFC, NFD, NFKC, NFKD
– Also provides FCD ‘quick check’ for optimization
10
Break Iteration
• Fully conforms to Unicode specifications
– supplementary characters, Hangul
• Tags
– e.g., “what kind of word was this”
• Title case iteration
• Rule-based, dictionary-based for Thai
11
Unicode Set and Transforms
• UnicodeSet
– collections of characters based on properties
– logical set operations, flexible
– “[[:mark:]&[\u0600-\u067f]]”
• Transliterator
– general transformations, with chaining and editing
– converts between scripts, e.g. Greek/Latin,
Devanagari/Gujarati
– rule-based, rules for common conversions supplied\
• UScriptRun
12
Character Properties
• All Unicode character properties
– over 80, Java provides access to about 10
• All defined code points
• Current with latest Unicode release
– ICU4J 3.0 uses Unicode 4.0.1 data
• Fast access to character data
13
Locale Data
• Standard data, included with ICU4J
– CLDR (Common Locale Data Repository)
– Ensures same data is available everywhere
– Can share resource data with ICU4C applications
• More locales, more kinds of data
– ~230 locales, compared to ~130 for Java
– Can modularize to include only the data you need
• RFC3066bis support (language_script_region)
– e.g., zh_Hans, zh_Hant
– keywords (orthogonal variants)
14
Performance of ICU4J
• Instantiation times are comparable
– Common instantiate and reuse model
– ICU4J and Java both use caches to limit impact
• Collation performance faster
– faster sorting, smaller sort keys
• Performance is difficult to measure
– JVM makes a difference
– ICU4J performs well in spot tests
– Use a scenario that matters to you to test
15
Property Data Timings
JVM
Sun 1.4.1
Sun 1.5.0b2
ICU4J
89 ns/op
Java
101 ns/op
(J-I)/I
13%
117 ns/op
102 ns/op
-13%
50 ns/op
66 ns/op
32%
IBM 1.4.1
1.13MHz PIII, Win2K
Nanoseconds/operation for character property access (getType,
toLowerCase, getDirectionality) on three JVMs.
16
Sizes of ICU4J
• Full jar file: 2,700K
• Modular builds for common subsets
–
–
–
–
–
–
–
–
normalizer: 420K
collator: 1,400K
calendar: 1,300K
break iterator: 1,300K
basic properties: 500K
full properties: 1,200K
formatting: 2,200K
transforms: 1,500K
17
Using ICU4J
• Jar file, just add to class path
– Or roll into your distribution, it’s Open Source!
– Modular builds help you to trim ICU4J’s code
– Data can be trimmed to further reduce size
• Parallel APIs
– APIs on parallel classes are generally a superset
– Change import (one line change) or change class name
– Some differences unavoidable (our supplementary
support for Java 1.4 can’t add API to String)
18
Code Examples (1)
import com.ibm.icu.text.BreakIterator;
BreakIterator b =
BreakIterator.getWordInstance();
b.setText(text);
for (int pos = b.first();
pos != BreakIterator.DONE;
pos = b.next()) {
doSomething(pos);
}
19
Code Examples (2)
import com.ibm.icu.lang.UCharacter;
int cp, pos = 0;
while (pos < text.length()) {
cp = UCharacter.codePointAt(text, pos);
if (UCharacter.getType(cp) ==
UCharacter.SURROGATE) return true;
pos += UCharacter.charCount(cp);
}
20
Code Examples (3)
import com.ibm.icu.util.ULocale;
import com.ibm.icu.text.Collator;
import java.util.Arrays;
ULocale ulocale = new
ULocale(“es_ES@collation=traditional”);
Collator col =
Collator.getInstance(ulocale);
String[] list = ...
Arrays.sort(list, col);
21
Conclusion
• ICU4J is not for you if
– you have tight size constraints
– you require the Java runtime behavior
• ICU4J is for you if
–
–
–
–
–
you need full compliance with current standards
you need current or additional locale and property data
you need customizability
you need features missing from Java (normalization)
you need additional performance
22
References
• ICU4J
– http://oss.software.ibm.com/icu4j/
• Java
– http://java.sun.com/
– http://www.ibm.com/java/
• Unicode, CLDR
– http://www.unicode.org/
– http://www.unicode.org/cldr/
23