Alessandro Lacava’s Blog

Google
 

July 16, 2008

Mapping between charset name and Java name

Filed under: Computer, Java, Encoding — alessandrolacava @ 12:44 pm

The following table shows a mapping between XML names and Java names for encoding.

An example of class where you can use these Java names for encoding is the OutputStreamWriter class

Table 1. Standard Character Sets and Encodings

XML Name Java Name First supported in Java Scripts and Languages
ISO-8859-1 8859_1 1.1 Latin-1: ASCII plus the accented characters needed for most Western European languages including Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Sorbian, Spanish, and Swedish as well as many non-European languages written in the Latin alphabet such as Swahili and Malaysian
ISO-8859-2 8859_2 1.1 Latin-2: ASCII plus the accented characters needed for most Central European languages including Albanian, Croatian, Czech, Finnish, German, Hungarian, Latin, Polish, Romanian, Slovak, Slovenian, and Sorbian
ISO-8859-3 8859_3 1.1 Latin-3: ASCII plus the accented characters needed for most Southern European languages including English, Esperanto, Finnish, French, German, Italian, Latin, Maltese, Portuguese, and Turkish
ISO-8859-4 8859_4 1.1 Latin-4: ASCII plus the accented characters needed for most Northern European languages including Danish, English, Estonian, Finnish, German, Greenlandic, Latin, Latvian, Lithuanian, Norwegian, S?mi, Slovenian, and Swedish
ISO-8859-5 8859_5 1.1 ASCII plus Cyrillic
ISO-8859-6 8859_6 1.1 ASCII plus Arabic
ISO-8859-7 8859_7 1.1 ASCII plus Greek
ISO-8859-8 8859_8 1.1 ASCII plus Hebrew
ISO-8859-9 8859_9 1.1 Latin-5: same as Latin-1 except the Turkish letters G, g, I, i, S, and s take the place of the Icelandic letters þ, Þ, ý, Ý, Ð, and ð
ISO-8859-13 ISO8859_13 1.3 Latin-7: ASCII plus the accented characters needed for most Baltic languages including Latvian, Lithuanian, Estonian, and Finnish, as well as English, Danish, Swedish, German, Slovenian, and Norwegian.
ISO-8859-15 ISO8859_15_FDIS 1.2 Latin-9: same as Latin-1 but with the Euro sign € instead of the international currency symbol ¤. It also replaces the infrequently used symbol characters ¦, ¨, ´, ¸, ¼, ½, and ¾ with the infrequently used French and Finnish letters Š, š, Ž, ž, Œ, œ, and Ÿ.
UTF-8 UTF8 1.1 The default encoding of XML documents; each Unicode character is represented in between 1 and 4 bytes.
UTF-16 UnicodeBig or UnicodeLittle 1.2 An encoding of Unicode in which characters in the Basic Multilingual Plane are encoded in two bytes, and all other characters are encoded as two two-byte surrogates
ISO-10646-UCS-2 N/A N/A A straightforward encoding in which each Unicode character is represented as a two-byte integer; cannot represent characters outside the Basic Multilingual Plane
ISO-10646-UCS-4 N/A N/A A straightforward encoding in which each Unicode character is represented as a four-byte integer
ISO-2022-JP JIS 1.1 Japanese
Shift_JIS SJIS 1.1 Japanese
EUC-JP EUCJIS 1.1 Japanese
US-ASCII ASCII 1.2 English
GBK GBK 1.1 Simplified Chinese
Big5 Big5 1.1 Traditional Chinese
ISO-2022-CN ISO2022CN 1.1 Traditional Chinese
ISO-2022-KR ISO2022KR 1.1 Korean
Note: The previous table is extracted from the following book: Processing XML with Java(TM): A Guide to SAX, DOM, JDOM, JAXP, and TrAX
I suggest you read it if you’re serious about processing XML using the Java programming language