Tree - source-git/libxml2 - CentOS Git server

source-git / libxml2

Blame doc/encoding.html

Blob History Raw

Packit

423ecb

Packit

423ecb

Packit

423ecb

<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="SHORTCUT ICON" href="/favicon.ico" /><style type="text/css">

Packit

423ecb

TD {font-family: Verdana,Arial,Helvetica}

Packit

423ecb

BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}

Packit

423ecb

H1 {font-family: Verdana,Arial,Helvetica}

Packit

423ecb

H2 {font-family: Verdana,Arial,Helvetica}

Packit

423ecb

H3 {font-family: Verdana,Arial,Helvetica}

Packit

423ecb

A:link, A:visited, A:active { text-decoration: underline }

Packit

423ecb

</style><title>Encodings support</title></head><body bgcolor="#8b7765" text="#000000" link="#a06060" vlink="#000000"> The XML C parser and toolkit of Gnome
Encodings support
<center>Main Menu</center>
<form action="search.php" enctype="application/x-www-form-urlencoded" method="get"><input name="query" type="text" size="20" value="" /><input name="submit" type="submit" value="Search ..." /></form>Home
Reference Manual
Introduction
FAQ
Developer Menu
Reporting bugs and getting help
How to help
Downloads
Releases
XML
XSLT
Validation & DTDs
Encodings support
Catalog support
Namespaces
Contributions
Code Examples
API Menu
XML Guidelines
Recent Changes
<center>Related links</center>
Mail archive
XSLT libxslt
DOM gdome2
XML-DSig xmlsec
FTP
Windows binaries
Solaris binaries
MacOsX binaries
lxml Python bindings
Perl bindings
C++ bindings
PHP bindings
Pascal bindings
Ruby bindings
Tcl bindings
Bug Tracker If you are not really familiar with Internationalization (usual shortcut

Packit

423ecb

is I18N) , Unicode, characters and glyphs, I suggest you read a presentation

Packit

423ecb

by Tim Bray on Unicode and why you should care about it.
If you don't understand why it does not make sense to have a string

Packit

423ecb

without knowing what encoding it uses, then as Joel Spolsky said please do not

Packit

423ecb

write another line of code until you finish reading that article.. It is

Packit

423ecb

a prerequisite to understand this page, and avoid a lot of problems with

Packit

423ecb

libxml2, XML or text processing in general.
Table of Content:

Packit

423ecb

  What does internationalization support

Packit

423ecb

    mean ?

Packit

423ecb

  The internal encoding, how and

Packit

423ecb

why

Packit

423ecb

  How is it implemented ?

Packit

423ecb

  Default supported encodings

Packit

423ecb

  How to extend the existing

Packit

423ecb

  support

Packit

423ecb

What does internationalization support mean ?
XML was designed from the start to allow the support of any character set

Packit

423ecb

by using Unicode. Any conformant XML parser has to support the UTF-8 and

Packit

423ecb

UTF-16 default encodings which can both express the full unicode ranges. UTF8

Packit

423ecb

is a variable length encoding whose greatest points are to reuse the same

Packit

423ecb

encoding for ASCII and to save space for Western encodings, but it is a bit

Packit

423ecb

more complex to handle in practice. UTF-16 use 2 bytes per character (and

Packit

423ecb

sometimes combines two pairs), it makes implementation easier, but looks a

Packit

423ecb

bit overkill for Western languages encoding. Moreover the XML specification

Packit

423ecb

allows the document to be encoded in other encodings at the condition that

Packit

423ecb

they are clearly labeled as such. For example the following is a wellformed

Packit

423ecb

XML document encoded in ISO-8859-1 and using accentuated letters that we

Packit

423ecb

French like for both markup and content:
<?xml version="1.0" encoding="ISO-8859-1"?>

Packit

423ecb

<très>là </très>Having internationalization support in libxml2 means the following:

Packit

423ecb

  the document is properly parsed

Packit

423ecb

  information about it's encoding is saved

Packit

423ecb

  it can be modified

Packit

423ecb

  it can be saved in its original encoding

Packit

423ecb

  it can also be saved in another encoding supported by libxml2 (for

Packit

423ecb

    example straight UTF8 or even an ASCII form)

Packit

423ecb

Another very important point is that the whole libxml2 API, with the

Packit

423ecb

exception of a few routines to read with a specific encoding or save to a

Packit

423ecb

specific encoding, is completely agnostic about the original encoding of the

Packit

423ecb

document.
It should be noted too that the HTML parser embedded in libxml2 now obey

Packit

423ecb

the same rules too, the following document will be (as of 2.2.2) handled  in

Packit

423ecb

an internationalized fashion by libxml2 too:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

Packit

423ecb

                      "http://www.w3.org/TR/REC-html40/loose.dtd">

Packit

423ecb

<html lang="fr">

Packit

423ecb

<head>

Packit

423ecb

  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

Packit

423ecb

</head>

Packit

423ecb

<body>

Packit

423ecb

<p>W3C crée des standards pour le Web.</body>

Packit

423ecb

</html>The internal encoding, how and why
One of the core decisions was to force all documents to be converted to a

Packit

423ecb

default internal encoding, and that encoding to be UTF-8, here are the

Packit

423ecb

rationales for those choices:

Packit

423ecb

  keeping the native encoding in the internal form would force the libxml

Packit

423ecb

    users (or the code associated) to be fully aware of the encoding of the

Packit

423ecb

    original document, for examples when adding a text node to a document,

Packit

423ecb

    the content would have to be provided in the document encoding, i.e. the

Packit

423ecb

    client code would have to check it before hand, make sure it's conformant

Packit

423ecb

    to the encoding, etc ... Very hard in practice, though in some specific

Packit

423ecb

    cases this may make sense.

Packit

423ecb

  the second decision was which encoding. From the XML spec only UTF8 and

Packit

423ecb

    UTF16 really makes sense as being the two only encodings for which there

Packit

423ecb

    is mandatory support. UCS-4 (32 bits fixed size encoding) could be

Packit

423ecb

    considered an intelligent choice too since it's a direct Unicode mapping

Packit

423ecb

    support. I selected UTF-8 on the basis of efficiency and compatibility

Packit

423ecb

    with surrounding software:

Packit

423ecb

Packit

423ecb

      UTF-8 while a bit more complex to convert from/to (i.e. slightly

Packit

423ecb

        more costly to import and export CPU wise) is also far more compact

Packit

423ecb

        than UTF-16 (and UCS-4) for a majority of the documents I see it used

Packit

423ecb

        for right now (RPM RDF catalogs, advogato data, various configuration

Packit

423ecb

        file formats, etc.) and the key point for today's computer

Packit

423ecb

        architecture is efficient uses of caches. If one nearly double the

Packit

423ecb

        memory requirement to store the same amount of data, this will trash

Packit

423ecb

        caches (main memory/external caches/internal caches) and my take is

Packit

423ecb

        that this harms the system far more than the CPU requirements needed

Packit

423ecb

        for the conversion to UTF-8

Packit

423ecb

      Most of libxml2 version 1 users were using it with straight ASCII

Packit

423ecb

        most of the time, doing the conversion with an internal encoding

Packit

423ecb

        requiring all their code to be rewritten was a serious show-stopper

Packit

423ecb

        for using UTF-16 or UCS-4.

Packit

423ecb

      UTF-8 is being used as the de-facto internal encoding standard for

Packit

423ecb

        related code like the pango

Packit

423ecb

        upcoming Gnome text widget, and a lot of Unix code (yet another place

Packit

423ecb

        where Unix programmer base takes a different approach from Microsoft

Packit

423ecb

        - they are using UTF-16)

Packit

423ecb

Packit

423ecb

Packit

423ecb

What does this mean in practice for the libxml2 user:

Packit

423ecb

  xmlChar, the libxml2 data type is a byte, those bytes must be assembled

Packit

423ecb

    as UTF-8 valid strings. The proper way to terminate an xmlChar * string

Packit

423ecb

    is simply to append 0 byte, as usual.

Packit

423ecb

  One just need to make sure that when using chars outside the ASCII set,

Packit

423ecb

    the values has been properly converted to UTF-8

Packit

423ecb

How is it implemented ?
Let's describe how all this works within libxml, basically the I18N

Packit

423ecb

(internationalization) support get triggered only during I/O operation, i.e.

Packit

423ecb

when reading a document or saving one. Let's look first at the reading

Packit

423ecb

sequence:

Packit

423ecb

  when a document is processed, we usually don't know the encoding, a

Packit

423ecb

    simple heuristic allows to detect UTF-16 and UCS-4 from encodings where

Packit

423ecb

    the ASCII range (0-0x7F) maps with ASCII

Packit

423ecb

  the xml declaration if available is parsed, including the encoding

Packit

423ecb

    declaration. At that point, if the autodetected encoding is different

Packit

423ecb

    from the one declared a call to xmlSwitchEncoding() is issued.

Packit

423ecb

  If there is no encoding declaration, then the input has to be in either

Packit

423ecb

    UTF-8 or UTF-16, if it is not then at some point when processing the

Packit

423ecb

    input, the converter/checker of UTF-8 form will raise an encoding error.

Packit

423ecb

    You may end-up with a garbled document, or no document at all ! Example:

Packit

423ecb

    ~/XML -> ./xmllint err.xml

Packit

423ecb

err.xml:1: error: Input is not proper UTF-8, indicate encoding !

Packit

423ecb

<très>là </très>

Packit

423ecb

Packit

423ecb

err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C

Packit

423ecb

<très>là </très>

Packit

423ecb

Packit

423ecb

Packit

423ecb

  xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and

Packit

423ecb

    then search the default registered encoding converters for that encoding.

Packit

423ecb

    If it's not within the default set and iconv() support has been compiled

Packit

423ecb

    it, it will ask iconv for such an encoder. If this fails then the parser

Packit

423ecb

    will report an error and stops processing:

Packit

423ecb

    ~/XML -> ./xmllint err2.xml

Packit

423ecb

err2.xml:1: error: Unsupported encoding UnsupportedEnc

Packit

423ecb

<?xml version="1.0" encoding="UnsupportedEnc"?>

Packit

423ecb

Packit

423ecb

Packit

423ecb

  From that point the encoder processes progressively the input (it is

Packit

423ecb

    plugged as a front-end to the I/O module) for that entity. It captures

Packit

423ecb

    and converts on-the-fly the document to be parsed to UTF-8. The parser

Packit

423ecb

    itself just does UTF-8 checking of this input and process it

Packit

423ecb

    transparently. The only difference is that the encoding information has

Packit

423ecb

    been added to the parsing context (more precisely to the input

Packit

423ecb

    corresponding to this entity).

Packit

423ecb

  The result (when using DOM) is an internal form completely in UTF-8

Packit

423ecb

    with just an encoding information on the document node.

Packit

423ecb

Ok then what happens when saving the document (assuming you

Packit

423ecb

collected/built an xmlDoc DOM like structure) ? It depends on the function

Packit

423ecb

called, xmlSaveFile() will just try to save in the original encoding, while

Packit

423ecb

xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given

Packit

423ecb

encoding:

Packit

423ecb

  if no encoding is given, libxml2 will look for an encoding value

Packit

423ecb

    associated to the document and if it exists will try to save to that

Packit

423ecb

    encoding,

Packit

423ecb

    otherwise everything is written in the internal form, i.e. UTF-8

Packit

423ecb

Packit

423ecb

  so if an encoding was specified, either at the API level or on the

Packit

423ecb

    document, libxml2 will again canonicalize the encoding name, lookup for a

Packit

423ecb

    converter in the registered set or through iconv. If not found the

Packit

423ecb

    function will return an error code

Packit

423ecb

  the converter is placed before the I/O buffer layer, as another kind of

Packit

423ecb

    buffer, then libxml2 will simply push the UTF-8 serialization to through

Packit

423ecb

    that buffer, which will then progressively be converted and pushed onto

Packit

423ecb

    the I/O layer.

Packit

423ecb

  It is possible that the converter code fails on some input, for example

Packit

423ecb

    trying to push an UTF-8 encoded Chinese character through the UTF-8 to

Packit

423ecb

    ISO-8859-1 converter won't work. Since the encoders are progressive they

Packit

423ecb

    will just report the error and the number of bytes converted, at that

Packit

423ecb

    point libxml2 will decode the offending character, remove it from the

Packit

423ecb

    buffer and replace it with the associated charRef encoding &#123; and

Packit

423ecb

    resume the conversion. This guarantees that any document will be saved

Packit

423ecb

    without losses (except for markup names where this is not legal, this is

Packit

423ecb

    a problem in the current version, in practice avoid using non-ascii

Packit

423ecb

    characters for tag or attribute names). A special "ascii" encoding name

Packit

423ecb

    is used to save documents to a pure ascii form can be used when

Packit

423ecb

    portability is really crucial

Packit

423ecb

Here are a few examples based on the same test document and assumin a

Packit

423ecb

terminal using ISO-8859-1 as the text encoding:
~/XML -> ./xmllint isolat1

Packit

423ecb

<?xml version="1.0" encoding="ISO-8859-1"?>

Packit

423ecb

<très>lÃ </très>

Packit

423ecb

~/XML -> ./xmllint --encode UTF-8 isolat1

Packit

423ecb

<?xml version="1.0" encoding="UTF-8"?>

Packit

423ecb

<trÃ¨s>lÃ  </trÃ¨s>

Packit

423ecb

~/XML -> The same processing is applied (and reuse most of the code) for HTML I18N

Packit

423ecb

processing. Looking up and modifying the content encoding is a bit more

Packit

423ecb

difficult since it is located in a <meta> tag under the <head>,

Packit

423ecb

so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have

Packit

423ecb

been provided. The parser also attempts to switch encoding on the fly when

Packit

423ecb

detecting such a tag on input. Except for that the processing is the same

Packit

423ecb

(and again reuses the same code).
Default supported encodings
libxml2 has a set of default converters for the following encodings

Packit

423ecb

(located in encoding.c):

Packit

423ecb

  UTF-8 is supported by default (null handlers)

Packit

423ecb

  UTF-16, both little and big endian

Packit

423ecb

  ISO-Latin-1 (ISO-8859-1) covering most western languages

Packit

423ecb

  ASCII, useful mostly for saving

Packit

423ecb

  HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML

Packit

423ecb

    predefined entities like &copy; for the Copyright sign.

Packit

423ecb

More over when compiled on an Unix platform with iconv support the full

Packit

423ecb

set of encodings supported by iconv can be instantly be used by libxml. On a

Packit

423ecb

linux machine with glibc-2.1 the list of supported encodings and aliases fill

Packit

423ecb

3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the

Packit

423ecb

various Japanese ones.
To convert from the UTF-8 values returned from the API to another encoding

Packit

423ecb

then it is possible to use the function provided from the encoding module like UTF8Toisolat1, or use the

Packit

423ecb

POSIX iconv()

Packit

423ecb

API directly.
Encoding aliases
From 2.2.3, libxml2 has support to register encoding names aliases. The

Packit

423ecb

goal is to be able to parse document whose encoding is supported but where

Packit

423ecb

the name differs (for example from the default set of names accepted by

Packit

423ecb

iconv). The following functions allow to register and handle new aliases for

Packit

423ecb

existing encodings. Once registered libxml2 will automatically lookup the

Packit

423ecb

aliases when handling a document:

Packit

423ecb

  int xmlAddEncodingAlias(const char *name, const char *alias);

Packit

423ecb

  int xmlDelEncodingAlias(const char *alias);

Packit

423ecb

  const char * xmlGetEncodingAlias(const char *alias);

Packit

423ecb

  void xmlCleanupEncodingAliases(void);

Packit

423ecb

How to extend the existing support
Well adding support for new encoding, or overriding one of the encoders

Packit

423ecb

(assuming it is buggy) should not be hard, just write input and output

Packit

423ecb

conversion routines to/from UTF-8, and register them using

Packit

423ecb

xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx),  and they will be

Packit

423ecb

called automatically if the parser(s) encounter such an encoding name

Packit

423ecb

(register it uppercase, this will help). The description of the encoders,

Packit

423ecb

their arguments and expected return values are described in the encoding.h

Packit

423ecb

header.
Daniel Veillard
</body></html>

source-git / libxml2

Source Code

Blame doc/encoding.html

The XML C parser and toolkit of Gnome

Encodings support

What does internationalization support mean ?

The internal encoding, how and why

How is it implemented ?

Default supported encodings

Encoding aliases

How to extend the existing support