|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="SHORTCUT ICON" href="/favicon.ico" /><style type="text/css">
|
|
Packit |
423ecb |
TD {font-family: Verdana,Arial,Helvetica}
|
|
Packit |
423ecb |
BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
|
|
Packit |
423ecb |
H1 {font-family: Verdana,Arial,Helvetica}
|
|
Packit |
423ecb |
H2 {font-family: Verdana,Arial,Helvetica}
|
|
Packit |
423ecb |
H3 {font-family: Verdana,Arial,Helvetica}
|
|
Packit |
423ecb |
A:link, A:visited, A:active { text-decoration: underline }
|
|
Packit |
423ecb |
</style><title>Encodings support</title></head><body bgcolor="#8b7765" text="#000000" link="#a06060" vlink="#000000"> | | The XML C parser and toolkit of GnomeEncodings support |
|
|
<center>Main Menu</center> | <form action="search.php" enctype="application/x-www-form-urlencoded" method="get"><input name="query" type="text" size="20" value="" /><input name="submit" type="submit" value="Search ..." /></form> |
<center>Related links</center> | |
|
| If you are not really familiar with Internationalization (usual shortcut |
|
|
|
|
|
|
Packit |
423ecb |
is I18N) , Unicode, characters and glyphs, I suggest you read a presentation
|
|
Packit |
423ecb |
by Tim Bray on Unicode and why you should care about it.If you don't understand why it does not make sense to have a string
|
|
Packit |
423ecb |
without knowing what encoding it uses, then as Joel Spolsky said please do not
|
|
Packit |
423ecb |
write another line of code until you finish reading that article.. It is
|
|
Packit |
423ecb |
a prerequisite to understand this page, and avoid a lot of problems with
|
|
Packit |
423ecb |
libxml2, XML or text processing in general.Table of Content:
|
|
Packit |
423ecb |
What does internationalization support
|
|
Packit |
423ecb |
mean ?
|
|
Packit |
423ecb |
The internal encoding, how and
|
|
Packit |
423ecb |
why
|
|
Packit |
423ecb |
How is it implemented ?
|
|
Packit |
423ecb |
Default supported encodings
|
|
Packit |
423ecb |
How to extend the existing
|
|
Packit |
423ecb |
support
|
|
Packit |
423ecb |
XML was designed from the start to allow the support of any character set
|
|
Packit |
423ecb |
by using Unicode. Any conformant XML parser has to support the UTF-8 and
|
|
Packit |
423ecb |
UTF-16 default encodings which can both express the full unicode ranges. UTF8
|
|
Packit |
423ecb |
is a variable length encoding whose greatest points are to reuse the same
|
|
Packit |
423ecb |
encoding for ASCII and to save space for Western encodings, but it is a bit
|
|
Packit |
423ecb |
more complex to handle in practice. UTF-16 use 2 bytes per character (and
|
|
Packit |
423ecb |
sometimes combines two pairs), it makes implementation easier, but looks a
|
|
Packit |
423ecb |
bit overkill for Western languages encoding. Moreover the XML specification
|
|
Packit |
423ecb |
allows the document to be encoded in other encodings at the condition that
|
|
Packit |
423ecb |
they are clearly labeled as such. For example the following is a wellformed
|
|
Packit |
423ecb |
XML document encoded in ISO-8859-1 and using accentuated letters that we
|
|
Packit |
423ecb |
French like for both markup and content:<?xml version="1.0" encoding="ISO-8859-1"?>
|
|
Packit |
423ecb |
<très>là </très>Having internationalization support in libxml2 means the following:
|
|
Packit |
423ecb |
the document is properly parsed
|
|
Packit |
423ecb |
information about it's encoding is saved
|
|
Packit |
423ecb |
it can be modified
|
|
Packit |
423ecb |
it can be saved in its original encoding
|
|
Packit |
423ecb |
it can also be saved in another encoding supported by libxml2 (for
|
|
Packit |
423ecb |
example straight UTF8 or even an ASCII form)
|
|
Packit |
423ecb |
Another very important point is that the whole libxml2 API, with the
|
|
Packit |
423ecb |
exception of a few routines to read with a specific encoding or save to a
|
|
Packit |
423ecb |
specific encoding, is completely agnostic about the original encoding of the
|
|
Packit |
423ecb |
document.It should be noted too that the HTML parser embedded in libxml2 now obey
|
|
Packit |
423ecb |
the same rules too, the following document will be (as of 2.2.2) handled in
|
|
Packit |
423ecb |
an internationalized fashion by libxml2 too:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
|
|
Packit |
423ecb |
"http://www.w3.org/TR/REC-html40/loose.dtd">
|
|
Packit |
423ecb |
<html lang="fr">
|
|
Packit |
423ecb |
<head>
|
|
Packit |
423ecb |
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
|
|
Packit |
423ecb |
</head>
|
|
Packit |
423ecb |
<body>
|
|
Packit |
423ecb |
<p>W3C crée des standards pour le Web.</body>
|
|
Packit |
423ecb |
</html>One of the core decisions was to force all documents to be converted to a
|
|
Packit |
423ecb |
default internal encoding, and that encoding to be UTF-8, here are the
|
|
Packit |
423ecb |
rationales for those choices:
|
|
Packit |
423ecb |
keeping the native encoding in the internal form would force the libxml
|
|
Packit |
423ecb |
users (or the code associated) to be fully aware of the encoding of the
|
|
Packit |
423ecb |
original document, for examples when adding a text node to a document,
|
|
Packit |
423ecb |
the content would have to be provided in the document encoding, i.e. the
|
|
Packit |
423ecb |
client code would have to check it before hand, make sure it's conformant
|
|
Packit |
423ecb |
to the encoding, etc ... Very hard in practice, though in some specific
|
|
Packit |
423ecb |
cases this may make sense.
|
|
Packit |
423ecb |
the second decision was which encoding. From the XML spec only UTF8 and
|
|
Packit |
423ecb |
UTF16 really makes sense as being the two only encodings for which there
|
|
Packit |
423ecb |
is mandatory support. UCS-4 (32 bits fixed size encoding) could be
|
|
Packit |
423ecb |
considered an intelligent choice too since it's a direct Unicode mapping
|
|
Packit |
423ecb |
support. I selected UTF-8 on the basis of efficiency and compatibility
|
|
Packit |
423ecb |
with surrounding software:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
UTF-8 while a bit more complex to convert from/to (i.e. slightly
|
|
Packit |
423ecb |
more costly to import and export CPU wise) is also far more compact
|
|
Packit |
423ecb |
than UTF-16 (and UCS-4) for a majority of the documents I see it used
|
|
Packit |
423ecb |
for right now (RPM RDF catalogs, advogato data, various configuration
|
|
Packit |
423ecb |
file formats, etc.) and the key point for today's computer
|
|
Packit |
423ecb |
architecture is efficient uses of caches. If one nearly double the
|
|
Packit |
423ecb |
memory requirement to store the same amount of data, this will trash
|
|
Packit |
423ecb |
caches (main memory/external caches/internal caches) and my take is
|
|
Packit |
423ecb |
that this harms the system far more than the CPU requirements needed
|
|
Packit |
423ecb |
for the conversion to UTF-8
|
|
Packit |
423ecb |
Most of libxml2 version 1 users were using it with straight ASCII
|
|
Packit |
423ecb |
most of the time, doing the conversion with an internal encoding
|
|
Packit |
423ecb |
requiring all their code to be rewritten was a serious show-stopper
|
|
Packit |
423ecb |
for using UTF-16 or UCS-4.
|
|
Packit |
423ecb |
UTF-8 is being used as the de-facto internal encoding standard for
|
|
Packit |
423ecb |
related code like the pango
|
|
Packit |
423ecb |
upcoming Gnome text widget, and a lot of Unix code (yet another place
|
|
Packit |
423ecb |
where Unix programmer base takes a different approach from Microsoft
|
|
Packit |
423ecb |
- they are using UTF-16)
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
What does this mean in practice for the libxml2 user:
|
|
Packit |
423ecb |
xmlChar, the libxml2 data type is a byte, those bytes must be assembled
|
|
Packit |
423ecb |
as UTF-8 valid strings. The proper way to terminate an xmlChar * string
|
|
Packit |
423ecb |
is simply to append 0 byte, as usual.
|
|
Packit |
423ecb |
One just need to make sure that when using chars outside the ASCII set,
|
|
Packit |
423ecb |
the values has been properly converted to UTF-8
|
|
Packit |
423ecb |
Let's describe how all this works within libxml, basically the I18N
|
|
Packit |
423ecb |
(internationalization) support get triggered only during I/O operation, i.e.
|
|
Packit |
423ecb |
when reading a document or saving one. Let's look first at the reading
|
|
Packit |
423ecb |
sequence:
|
|
Packit |
423ecb |
when a document is processed, we usually don't know the encoding, a
|
|
Packit |
423ecb |
simple heuristic allows to detect UTF-16 and UCS-4 from encodings where
|
|
Packit |
423ecb |
the ASCII range (0-0x7F) maps with ASCII
|
|
Packit |
423ecb |
the xml declaration if available is parsed, including the encoding
|
|
Packit |
423ecb |
declaration. At that point, if the autodetected encoding is different
|
|
Packit |
423ecb |
from the one declared a call to xmlSwitchEncoding() is issued.
|
|
Packit |
423ecb |
If there is no encoding declaration, then the input has to be in either
|
|
Packit |
423ecb |
UTF-8 or UTF-16, if it is not then at some point when processing the
|
|
Packit |
423ecb |
input, the converter/checker of UTF-8 form will raise an encoding error.
|
|
Packit |
423ecb |
You may end-up with a garbled document, or no document at all ! Example:
|
|
Packit |
423ecb |
~/XML -> ./xmllint err.xml
|
|
Packit |
423ecb |
err.xml:1: error: Input is not proper UTF-8, indicate encoding !
|
|
Packit |
423ecb |
<très>là </très>
|
|
Packit |
423ecb |
^
|
|
Packit |
423ecb |
err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
|
|
Packit |
423ecb |
<très>là </très>
|
|
Packit |
423ecb |
^
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and
|
|
Packit |
423ecb |
then search the default registered encoding converters for that encoding.
|
|
Packit |
423ecb |
If it's not within the default set and iconv() support has been compiled
|
|
Packit |
423ecb |
it, it will ask iconv for such an encoder. If this fails then the parser
|
|
Packit |
423ecb |
will report an error and stops processing:
|
|
Packit |
423ecb |
~/XML -> ./xmllint err2.xml
|
|
Packit |
423ecb |
err2.xml:1: error: Unsupported encoding UnsupportedEnc
|
|
Packit |
423ecb |
<?xml version="1.0" encoding="UnsupportedEnc"?>
|
|
Packit |
423ecb |
^
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
From that point the encoder processes progressively the input (it is
|
|
Packit |
423ecb |
plugged as a front-end to the I/O module) for that entity. It captures
|
|
Packit |
423ecb |
and converts on-the-fly the document to be parsed to UTF-8. The parser
|
|
Packit |
423ecb |
itself just does UTF-8 checking of this input and process it
|
|
Packit |
423ecb |
transparently. The only difference is that the encoding information has
|
|
Packit |
423ecb |
been added to the parsing context (more precisely to the input
|
|
Packit |
423ecb |
corresponding to this entity).
|
|
Packit |
423ecb |
The result (when using DOM) is an internal form completely in UTF-8
|
|
Packit |
423ecb |
with just an encoding information on the document node.
|
|
Packit |
423ecb |
Ok then what happens when saving the document (assuming you
|
|
Packit |
423ecb |
collected/built an xmlDoc DOM like structure) ? It depends on the function
|
|
Packit |
423ecb |
called, xmlSaveFile() will just try to save in the original encoding, while
|
|
Packit |
423ecb |
xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
|
|
Packit |
423ecb |
encoding:
|
|
Packit |
423ecb |
if no encoding is given, libxml2 will look for an encoding value
|
|
Packit |
423ecb |
associated to the document and if it exists will try to save to that
|
|
Packit |
423ecb |
encoding,
|
|
Packit |
423ecb |
otherwise everything is written in the internal form, i.e. UTF-8
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
so if an encoding was specified, either at the API level or on the
|
|
Packit |
423ecb |
document, libxml2 will again canonicalize the encoding name, lookup for a
|
|
Packit |
423ecb |
converter in the registered set or through iconv. If not found the
|
|
Packit |
423ecb |
function will return an error code
|
|
Packit |
423ecb |
the converter is placed before the I/O buffer layer, as another kind of
|
|
Packit |
423ecb |
buffer, then libxml2 will simply push the UTF-8 serialization to through
|
|
Packit |
423ecb |
that buffer, which will then progressively be converted and pushed onto
|
|
Packit |
423ecb |
the I/O layer.
|
|
Packit |
423ecb |
It is possible that the converter code fails on some input, for example
|
|
Packit |
423ecb |
trying to push an UTF-8 encoded Chinese character through the UTF-8 to
|
|
Packit |
423ecb |
ISO-8859-1 converter won't work. Since the encoders are progressive they
|
|
Packit |
423ecb |
will just report the error and the number of bytes converted, at that
|
|
Packit |
423ecb |
point libxml2 will decode the offending character, remove it from the
|
|
Packit |
423ecb |
buffer and replace it with the associated charRef encoding { and
|
|
Packit |
423ecb |
resume the conversion. This guarantees that any document will be saved
|
|
Packit |
423ecb |
without losses (except for markup names where this is not legal, this is
|
|
Packit |
423ecb |
a problem in the current version, in practice avoid using non-ascii
|
|
Packit |
423ecb |
characters for tag or attribute names). A special "ascii" encoding name
|
|
Packit |
423ecb |
is used to save documents to a pure ascii form can be used when
|
|
Packit |
423ecb |
portability is really crucial
|
|
Packit |
423ecb |
Here are a few examples based on the same test document and assumin a
|
|
Packit |
423ecb |
terminal using ISO-8859-1 as the text encoding:~/XML -> ./xmllint isolat1
|
|
Packit |
423ecb |
<?xml version="1.0" encoding="ISO-8859-1"?>
|
|
Packit |
423ecb |
<très>là </très>
|
|
Packit |
423ecb |
~/XML -> ./xmllint --encode UTF-8 isolat1
|
|
Packit |
423ecb |
<?xml version="1.0" encoding="UTF-8"?>
|
|
Packit |
423ecb |
<très>là </très>
|
|
Packit |
423ecb |
~/XML -> The same processing is applied (and reuse most of the code) for HTML I18N
|
|
Packit |
423ecb |
processing. Looking up and modifying the content encoding is a bit more
|
|
Packit |
423ecb |
difficult since it is located in a <meta> tag under the <head>,
|
|
Packit |
423ecb |
so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
|
|
Packit |
423ecb |
been provided. The parser also attempts to switch encoding on the fly when
|
|
Packit |
423ecb |
detecting such a tag on input. Except for that the processing is the same
|
|
Packit |
423ecb |
(and again reuses the same code).libxml2 has a set of default converters for the following encodings
|
|
Packit |
423ecb |
(located in encoding.c):
|
|
Packit |
423ecb |
UTF-8 is supported by default (null handlers)
|
|
Packit |
423ecb |
UTF-16, both little and big endian
|
|
Packit |
423ecb |
ISO-Latin-1 (ISO-8859-1) covering most western languages
|
|
Packit |
423ecb |
ASCII, useful mostly for saving
|
|
Packit |
423ecb |
HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
|
|
Packit |
423ecb |
predefined entities like © for the Copyright sign.
|
|
Packit |
423ecb |
More over when compiled on an Unix platform with iconv support the full
|
|
Packit |
423ecb |
set of encodings supported by iconv can be instantly be used by libxml. On a
|
|
Packit |
423ecb |
linux machine with glibc-2.1 the list of supported encodings and aliases fill
|
|
Packit |
423ecb |
3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the
|
|
Packit |
423ecb |
various Japanese ones.To convert from the UTF-8 values returned from the API to another encoding
|
|
Packit |
423ecb |
then it is possible to use the function provided from the encoding module like UTF8Toisolat1, or use the
|
|
Packit |
423ecb |
POSIX iconv()
|
|
Packit |
423ecb |
API directly.Encoding aliasesFrom 2.2.3, libxml2 has support to register encoding names aliases. The
|
|
Packit |
423ecb |
goal is to be able to parse document whose encoding is supported but where
|
|
Packit |
423ecb |
the name differs (for example from the default set of names accepted by
|
|
Packit |
423ecb |
iconv). The following functions allow to register and handle new aliases for
|
|
Packit |
423ecb |
existing encodings. Once registered libxml2 will automatically lookup the
|
|
Packit |
423ecb |
aliases when handling a document:
|
|
Packit |
423ecb |
int xmlAddEncodingAlias(const char *name, const char *alias);
|
|
Packit |
423ecb |
int xmlDelEncodingAlias(const char *alias);
|
|
Packit |
423ecb |
const char * xmlGetEncodingAlias(const char *alias);
|
|
Packit |
423ecb |
void xmlCleanupEncodingAliases(void);
|
|
Packit |
423ecb |
Well adding support for new encoding, or overriding one of the encoders
|
|
Packit |
423ecb |
(assuming it is buggy) should not be hard, just write input and output
|
|
Packit |
423ecb |
conversion routines to/from UTF-8, and register them using
|
|
Packit |
423ecb |
xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they will be
|
|
Packit |
423ecb |
called automatically if the parser(s) encounter such an encoding name
|
|
Packit |
423ecb |
(register it uppercase, this will help). The description of the encoders,
|
|
Packit |
423ecb |
their arguments and expected return values are described in the encoding.h
|
|
Packit |
423ecb |
header.Daniel Veillard </body></html>
|