 | Level: Introductory Mark DavisIBM Helena ShihIBM
01 Oct 1998 The designers of Java made the important design decision that all text would be stored in Unicode. This solves the problem inherent in most other text-handling schemes, of always having to juggle multiple, limited character encodings. It puts all languages on an equal footing, and makes the whole process of designing for worldwide products far easier. But proper support of international text requires far more than just storing characters in Unicode. IBM's wholly owned subsidiary, Taligent, had a great deal of previous experience in
Unicode software internationalization. In 1996, Sun contracted with Taligent to design and
develop classes for the proper handling of multilingual text for JDK 1.1. Our goals were
to provide an architecture that supplied the required functionality, was fully object
oriented, could easily be extended to add additional features or to support additional
countries, and would scale well across both small and large projects written in Java.
There are some aspects of the architecture that we frequently get questions or complaints
about, so we'll explain why we made some of the decisions we did. In JDK 1.1 we focused on the "server level" support: that is, on the
mid-level internationalization services. Most of the low level text services were already
in JDK 1.0 and only needed some enhancement. The high level international services (for
input and output) utilized the host platform services in JDK 1.1. In JDK 1.2 the
high-level international services have been greately improved and no long depend on the
host platform services. In this paper, we will preview some of the most important new features under
development in the mid-level internationalization services. Many of these features are
coming in JDK 1.2. IBM is making others available through different channels, including
classes available on the IBM alphaWorks website at http://www.alphaWorks.ibm.com/. Date and time support
The Calendar class contains API that allows you to interpret a
Date according to a local calendar system, even non-Gregorian ones. It also contains routines to support GUI requirements, such as rolling, adding and subtracting dates and times. The TimeZone class enables the conversion between universal time (UTC) and local time. It also contains rules for figuring out the daylight savings time according to the local conventions. Why is January zero?
We probably get more complaints about this than any other issue. Here's what happened. The JDK 1.0 Date API and implementation were very specific to the Gregorian
calendar, and was not terribly Y2K friendly. Although Gregorian calendar is used in most of the world, many countries use other calendar systems. For instance, businesses in Europe often use a calendar that measures by day, week and year, rather than day, month, and year. There are also a large number of traditional calendars in widespread use in the Middle East and Asia. To deal with these problems, we split out the date computations into
another class, Calendar, and retained Date purely as a storage class. The zero-based month numbers in Date were a vestige of
old C-style programming; originally month names were stored in a zero-based array, and
the months were numbered accordingly for convenience. JavaSoft felt that consistency with the old Date APIs was important, so we needed to keep this convention in Calendar. So calling calendar.set(1998, 3, 5)
gives you April 15th, not March 15th. Time zone display names
Every abstract class in the internationalization frameworks except for TimeZone
has a getDisplayName() function. This means there's no easy way to get the displayable name for a time zone in JDK 1.1. This has led to some confusion, since some people thought that TimeZone.getID()
returned a displayable name, when it actually
returns an internal programmatic ID, one that should not be displayed to end users.
Moreover, the internal IDs themselves were too short and confusing: "AST" could
stand for either "Atlantic Standard Time" or "Alaska Standard Time".
To remedy this, the new method
getDisplayName()
has been added
to TimeZone in JDK 1.2, and longer more descriptive internal IDs are available.
// In JDK 1.1
TimeZone zone = TimeZone.getTimeZone(
"EST"
);
SimpleDateFormat sdf = new SimpleDateFormat("z", Locale.English);
fmt.getCalendar().setTimeZone(zone);
String name = format.format(new Date());
// name is "Eastern Standard Time"
|
TimeZone zone = TimeZone.getTimeZone(
"America/New_York"
);
String name = zone.getDisplayName(Locale.ENGLISH);
// name is "Eastern Standard Time"
|
Better Y2K support
Should "01/01/00" be year 2000 or year 1900? JDK 1.1 used the 80-20 rule.
This amounts to adding 1900 to the two-digit year, and if the result was more than 80
years in the past, adding another 100 (for the Gregorian calendar). The JDK 1.2 method DateFormat.set2DigitStartDate(
) provides more specific
control. This method sets the exact start of a 100-year range in which 2-digit years are
interpreted.
// In JDK 1.1: There is no way to specify when the 2 digit year starts.
|
// In JDK 1.2:
GregorianCalendar cal = new GregorianCalendar(1952, Calendar.SEPTEMBER, 13);
DateFormat fmt = DateFormat.getInstance();
fmt.set2DigitYearStart(cal.getTime());
fmt.parse("9-12-52"); // returns 9-13-1952
fmt.parse("9-14-52"); // returns 9-14-2052
|
Improved Daylight Savings switchover
In JDK 1.1, SimpleTimeZone allows the start and end dates for Daylight Savings Time to
be specified in only one way, as the Nth or Nth-from-last occurrence of a given weekday in
a given month, e.g., the last Sunday in October. However, some time zones have more
complicated rules for the switchover dates. For example, in Brazil Eastern Time, DST ends
on the first Sunday on or after February 11th, which cannot be expressed with the JDK 1.1
APIs. This was resolved in JDK 1.2 by adding several new types of DST start and end rules.
The following rule types will handle all known modern and historical time zones and
provide more flexibility for the future:
- A fixed date in a given month, e.g. the 1st of April.
- The first occurrence of a given day of the week
on or after
a certain date in
the month,
e.g. the first Sunday on or after February 18th, or equivalently, the first Sunday after
the second Thursday
- The first occurrence of a given day of the week
on or before
a certain date in
the month
The first click correctly
sets the control to February 28, 1997, but the second click sets it to March 28, 1997,
instead of March 30, 1997.Correct rolling
There is no way to implement a date widget with arrow buttons correctly with the
Calendar class in JDK 1.1. Suppose the user has selected the MONTH value for Jan 30, 1997,
and hits the up arrow twice.
The correct implementation is to remember the original date and roll the month field
the proper number of steps from that original date for each click of the arrows.
Unfortunately, in JDK 1.1, you can only roll a field a single unit at a time. JDK 1.2
fixes this problem by adding the ability to roll a field multiple units in a single
operation.
// In JDK 1.1: no workaround
|
// In JDK 1.2:
myCalendar.setTime(aDate);
myCalendar.roll(MONTH, numberOfArrowClicks);
|
International Calendar classes
Although the Calendar class is architected to allow for multiple calendars, both JDK
1.1 and 1.2 only include support for the Gregorian calendar.However, IBM is previewing a
large set of international calendars on the alphaWorks Web site, currently including
Hebrew, Islamic, Buddhist, and Japanese calendars.
Locales and resources
A locale in the JDK is merely an identifier. This identifier is made up of the ISO
language code and country code, plus optional variants (for information on the ISO codes,
see http://www.unicode.org/unicode/onlinedat/online.html).
Since Locale is just a lightweight identifier, there is no need for validity checking when
you construct a locale. Whenever you construct an international object, you have the
opportunity to supply an explicit Locale, or you can use whatever the current default
locale is on your system:
Collator col = Collator.getInstance(Locale.FRANCE);
if (col.compare(string1, string2) < -1) {
... // based on the French locale's sort sequence
Collator col = Collator.getInstance();
if (col.compare(string1, string2) < -1) {
... // based on the default locale's sort sequence
|
The ResourceBundle class provides a way to isolate translatable text or localizable
objects from your core source code. For example, resource bundles can be used for
translatable error messages, or building translatable components. The JDK also uses
resource bundles to hold its own localized data. For example, when you ask for a
NumberFormat object, the necessary formatting information is retrieved from a resource
bundle. Why can't you set the default locale in Applets?
People frequently ask for the ability to call Locale.setDefault()
within an applet. The problem is that a single JVM can run more than one applet at a time in the
same address space. Locale.setDefault() would change the default locale for the whole
address space, which means that all of the applets would be affected; this is considered a
security violation. To work around this, set the applet's locale instead of using
Locale.setDefault().
When you need an international class, supply the locale explicitly:
NumberFormat nf = NumberFormat.getInstance(myApplet.getLocale());
|
ResourceBundle fallback detection
The ResourceBundle implementation currently includes a fallback mechanism: if the
specified resource can't be found in the specified locale, ResourceBundle searches:
- first in the resource bundle for the specified language and country,
- then in the resource bundle for the specified language,
- then in the resource bundle for the default locale's language and country,
- then in the resource bundle for the default locale's language,
- finally in the root resource bundle.
Sometimes this is not what you want, or at least you may want to be able to detect when
a particular piece of data came from a fallback locale rather than the specified one. For
example, suppose you wanted a specific resource from the French Belgian locale, and there
is only a French locale installed -- you'll get the wrong resource. In JDK 1.2 we added a
method, getLocale(), to find out the actual locale that a
resource bundle comes from, so that you can determine if a fallback was used.
// In JDK 1.2
Locale frBE_Locale = new Locale("fr", "BE");
ResourceBundle rb = ResourceBundle.getBundle("MyResources", frBE_Locale);
if (!rb.getLocale().equals(frBE_Locale)) {
// French Belgian resources not available, report an error
|
Comparison and boundaries
In JDK 1.1, Collator allows you to compare strings in a language-sensitive way. The
standard comparison in String will just do a binary comparison. For strings that will be
displayed to the user, this is almost always incorrect! Wherever the ordering or equality
of strings is important to the user, such as when presenting an alphabetized list, then
use a Collator instead. Otherwise a German, for example, will find that you don't
equate two strings that she thinks are equal.
if (string1.compareTo(string2) < 0) {... // bitwise comparison
Collator col = Collator.getInstance();
if (col.equals(string1, string2)) {...
...
if (col.compare(string1, string2) < 0) {...
|
Why have CharacterIterator?
The CharacterIterator class is used in BreakIterator and a few other places in the JDK, and is used even more in JDK 1.2. Sometimes we are asked why we didn't use String or
StringBuffer instead. String and StringBuffer are simple classes that store their characters contiguously.
Insertion or deletion of characters in a StringBuffer ends up shifting all the characters
that follow, which works fine for reasonably small numbers of characters. However, this
model doesn't scale well. Consider a word processor, for example, where shifting many
kilobytes of characters just to insert or delete one character involves far too much extra
work. For acceptable performance in these circumstances, text needs to be stored in data
structures that use internally discontiguous chunks of storage. We needed some way to have a more abstract representation of text that could be used
both for String and for larger-scale text models. Unfortunately, we couldn't change String
and StringBuffer to descend from an abstract class that would provide this sort of
representation. To resolve this problem, we added a very minimal interface,
CharacterIterator. This interface allows both sequential (forward and backward) and
random access to characters from any source, not just from a String or StringBuffer. Rule-based BreakIterator
The BreakIterator class finds character, word, line and sentence boundaries, which may vary depending on the locale. The JDK 1.1 BreakIterator implementation uses a state
machine, which makes it very fast. However, it does not allow the behavior to vary
depending on the locale. If the built-in classes don't support behavior the clients want,
they must create a completely new BreakIterator subclass of their own -- they can't leverage
the JDK code at all. Therefore, we undertook an extensive revision of the BreakIterator framework. The new
RuleBasedBreakIterator class essentially works the same way the old class did, but it
builds the category and state tables from a textual description, which is essentially a
string of regular expressions. This description can be loaded from a resource -- allowing
different breaking rules for different languages -- or supplied by the client -- allowing
runtime customization. This class is provided on the alphaWorks Web site. Locale-sensitive searching
The CollationElementIterator class is intended for use in locale-sensitive text
searching. However, it is missing several methods in JDK 1.1 that makes it impossible to
use with fast string searching algorithms such as Boyer-Moore. The following new methods
were added in JDK 1.2 to fix this:
- The getOffset() method tells where a collation element was
found.
- The previous() and setOffset()
methods enable backing up and moving around in the text being searched.
- The new setText() method allows reuse of a CollationElementIterator. When collating or searching a large number of strings, it is much faster to reuse one CollationElementIterator than to construct a new one each time.
- The isIgnorable() method tells whether a collation element is ignorable.
- The getMaxExpansion() method returns the maximum length of
any expansion sequence producing a given character. A fast search algorithm needs to know
the maximum "shift" distance in looking for possible match sites. This is
complicated by the fact that in natural language, a match can occur with different numbers
of characters. If a search pattern for German text contains "oe", for example,
it can match the single character "?ot; in the text being searched. With the
maximum expansion length, a fast search algorithm can compute the correct lower limit on
shift distances.
Unicode normalization
Unicode is more than just "wide ASCII". One of the principal operations on
Unicode is to normalize text, ensuring that you have a unique spelling for a given text.
Text normalization includes decomposition and composition forms of characters. Text can be
normalized to be a canonical equivalent to the original unnormalized text, or to be
a compatibility equivalent to the original unnormalized text. For more information,
please see Unicode technical report #15 on http://www.unicode.org/unicode/reports/tr15/. One of the Unicode normalization forms is used internally as a part of JDK 1.1, but it
is not public. The Normalizer class incorporates this technology, and allows either batch
or incremental normalization of text. This class is provided on the AlphaWorks web site.
Formatting and parsing
JDK 1.1 provides a rich set of functionality for formatting values into strings and
parsing strings into values in a locale-sensitive way. These include numbers, dates,
times, and messages. Number formatting supports spreadsheet-style patterns. For example, a format such as "#,##0.00#" will produce output like "1,234.567" or "5.00";
the pattern specifies that you have at least 2 decimal digits, but no more than 3. You can
also reset the decimals and other characteristics of the pattern programmatically. Number
formatting also provides powerful pattern parsing support for proportional font decimal
alignment. Date/Time formatting supports similar features, and are fully integrated with Calendar.
Message formatting allows access to number, date, and time formatting within the context
of a localizable string. Substitutable currencies
NumberFormat provides the factory method
getCurrencyInstance(), which creates an object that can convert numbers to and from strings in the currency
format of a given locale. In JDK 1.1, these formats were treated just like any other
number formats. They were constructed from strings that were fetched from ResourceBundles.
In 1.2, the currency symbol can be specified independently from the rules for decimal
places, thousands separator, and so on, and is supplied in the pattern with the
international currency symbol ("¤" = "\u00A4"')
// In JDK 1.1: can't change currency symbols
|
// In JDK 1.2
DecimalFormatSymbols us_syms = (DecimalFormat)fmt.getDecimalFormatSymbols();
us_syms.setCurrencySymbol("US$ ");
fmt.setDecimalFormatSymbols(us_syms);
result = fmt.format(1234.56) // result is "US$ 1,234.56"
|
ISO currency codes
Additionally, we added an API to retrieve the 3-letter international currency codes
defined in ISO 4217. These are necessary in an application that deals with many different
currencies, because the regular, one-character currency symbols are often shared by many
different currencies. For example, both the US and Canada use "$" in their
default currency format. An application dealing with both currencies will probably want to
use "USD" and "CAD" instead. In JDK 1.2, this is now possible, using a
sequence of two international currency symbols ("¤¤" =
"\u00A4\u00A4") in the pattern.
// In JDK 1.1: can't get 3-letter currency codes
|
// In JDK 1.2:
fmt = new DecimalFormat("\u00a4\u00a4 #,##0.00;(\u00a4\u00a4 #,##0.00)");
result = fmt.format(1234.56); // result is "USD 1,234.56".
|
Parse error information
The abstract method
parseObject()
in java.text.Format is
used to parse strings and turn them into objects. In JDK 1.1, the program can find out how
far the parse got so that it can continue from that point on. However, it cannot find out
how far it got if there was an error. In JDK 1.2 a new field,
errorOffset,
now contains that information. If an error occurs during parsing, the formatters set this
value before returning an error or throwing an exception. In the following example, a text field is parsed for a number. If an error is found,
the text beyond the error is highlighted, a message is displayed, and a beep is played.
// In JDK 1.2:
String contents = textField.getText();
try {
NumberFormat fmt = NumberFormat.getInstance();
Number value = fmt.parse(contents);
} catch (ParseException foo) {
errorLabel.getToolkit().beep();
errorLabel.setText(myResourceBundle.getString("invalid number"));
textField.select(foo.getErrorOffset(), contents.length());
}
|
Number format enhancements
On the alphaWorks Web site we provide a class that correctly supports exponentials in
number formatting and parsing. The new number formatter supports formats such as
"1.2345E3", as well as engineering exponents, where the exponent is always a
power of 3. It also supports formatting and parsing BigInteger or BigDecimal values
without loss of precision, and "nickel-rounding": the ability to round to
multiples of a specified number, such as $0.05. (This is important for some countries
whose smallest coin is 5 units instead of 1. The implementation is not restricted to
nickels, however, and can be used to round to multiples of any given value.) Here is an example using the class on the alphaWorks website.
NumberFormat fmt = new NumberFormat("0.0000E00");
String result = fmt->format(123456789);
// result is "1.2346E08"
|
Number formats in words
The ability to take a numeric value (such as 12,345) and translate it into words (such
as "twelve thousand three hundred forty-five") is often needed in business
applications, for example, to write out the amount on a check. Number spellout in English
is a relatively easy thing to do; good algorithms for this are well-known and widely used.
A number-spellout engine that can be customized for any language is another thing
altogether. It's not enough to simply take the algorithm for English and read the literal string
values from a resource file. English separates all component parts of a number with
spaces; Italian and German do not. Some languages, such as Spanish and Italian, drop the
word for "one" from the phrases "one hundred" or "one
thousand". There are many other examples that show translating a number into words is
not a trivial task. To solve these issues, we developed a class called RuleBasedNumberFormat. It's a
general, rule-based mechanism for converting numbers to spelled-out strings. This class is
available on the alphaWorks Web site, along with information on the usage and rule
syntax.
RuleBasedNumberFormat fmt = new RuleBasedNumberFormat(rules);
String result = fmt.format(1234);
// result is "one thousand two hundred thirty four"
|
Conclusion
The internationalization services in Java 1.1 provide a wide range of functionality,
and are easily extended to add additional features and to support additional countries.
We've had an opportunity to discuss some of the design decisions taken in developing these
classes, and some of the enhancements that are being included in future releases. A more
detailed discussion is available at
http://www.ibm.com/developer/unicode/,
and includes more about the JDK i18n classes and possible future internationalization
improvements that IBM is discussing with Sun. These future possibilities include the
following:
- A character conversion API, so that you can actually find out which character code
converters are supported on your system, and get understandable display names for them. For example:
instead of "Cp1089", in English displaying Arabic (ISO 8859-6) and in French displaying
Arabe (OSI 8859-6)
- Customized locales, so that you could have fine-grained control over default behavior.
For example,
NumberFormat.setInstance(new Locale("en", "US", "Acme Widgets, Inc.")
new DecimalFormat("#,##0.0#")); |
- Convenience methods for common cases. For example,
String value = Number.formatCurrency(amount); |
We are working on many future enhancements; some of which are available right now on
IBM's alphaWorks Web site at http://www.alphaWorks.ibm.com/. We encourage those interested to download versions from there -- any comments on the design
and implementation are welcome!
Acknowledgements
Our thanks to Kathleen Wilson, Rich Gillam, and Laura Werner for their extensive review
and suggestions for organization of the document. Many other people in IBM and Sun
contributed to the Java internationalization efforts.
About the authors  | |  | Dr. Mark Davis is a Senior Technical Staff Member responsible for international software
architecture. Mark co-founded the Unicode effort, and is the president of the Unicode
Consortium. He is a principal co-author and editor of the Unicode Standard, Versions 1.0 and
2.0. At various times, his department has included software groups covering text, international,
operating system services, Windows porting, and technical communications. Technically, he
specializes in object-oriented programming and in the architecture and implementation of
international and text software.
|
 | |  | Helena Shih is the technical lead of the IBM Classes for Unicode at IBM's Center for Java Technology,
Cupertino. She previously was a member of the Java i18n team at Taligent, and contributed to the JDK 1.1
international classes. Helena has also worked for Dataware Technologies and Apple's Advanced Technology
Group. She holds an MSc. degree from University of Massachusetts. She is a native of Taipei, Taiwan.
|
Rate this page
|  |