Blame doc/encoding.html

Packit 423ecb
Packit 423ecb
Packit 423ecb
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="SHORTCUT ICON" href="/favicon.ico" /><style type="text/css">
Packit 423ecb
TD {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
Packit 423ecb
H1 {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
H2 {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
H3 {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
A:link, A:visited, A:active { text-decoration: underline }
Packit 423ecb
</style><title>Encodings support</title></head><body bgcolor="#8b7765" text="#000000" link="#a06060" vlink="#000000">
Action against software patentsGnome2 LogoW3C LogoRed Hat Logo
Made with Libxml2 Logo

The XML C parser and toolkit of Gnome

Encodings support

<center>Main Menu</center>
<form action="search.php" enctype="application/x-www-form-urlencoded" method="get"><input name="query" type="text" size="20" value="" /><input name="submit" type="submit" value="Search ..." /></form>
<center>Related links</center>

If you are not really familiar with Internationalization (usual shortcut

Packit 423ecb
is I18N) , Unicode, characters and glyphs, I suggest you read a presentation
Packit 423ecb
by Tim Bray on Unicode and why you should care about it.

If you don't understand why it does not make sense to have a string

Packit 423ecb
without knowing what encoding it uses, then as Joel Spolsky said please do not
Packit 423ecb
write another line of code until you finish reading that article.. It is
Packit 423ecb
a prerequisite to understand this page, and avoid a lot of problems with
Packit 423ecb
libxml2, XML or text processing in general.

Table of Content:

    Packit 423ecb
      
  1. What does internationalization support
  2. Packit 423ecb
        mean ?
    Packit 423ecb
      
  3. The internal encoding, how and
  4. Packit 423ecb
      why
    Packit 423ecb
      
  5. How is it implemented ?
  6. Packit 423ecb
      
  7. Default supported encodings
  8. Packit 423ecb
      
  9. How to extend the existing
  10. Packit 423ecb
      support
    Packit 423ecb

    What does internationalization support mean ?

    XML was designed from the start to allow the support of any character set

    Packit 423ecb
    by using Unicode. Any conformant XML parser has to support the UTF-8 and
    Packit 423ecb
    UTF-16 default encodings which can both express the full unicode ranges. UTF8
    Packit 423ecb
    is a variable length encoding whose greatest points are to reuse the same
    Packit 423ecb
    encoding for ASCII and to save space for Western encodings, but it is a bit
    Packit 423ecb
    more complex to handle in practice. UTF-16 use 2 bytes per character (and
    Packit 423ecb
    sometimes combines two pairs), it makes implementation easier, but looks a
    Packit 423ecb
    bit overkill for Western languages encoding. Moreover the XML specification
    Packit 423ecb
    allows the document to be encoded in other encodings at the condition that
    Packit 423ecb
    they are clearly labeled as such. For example the following is a wellformed
    Packit 423ecb
    XML document encoded in ISO-8859-1 and using accentuated letters that we
    Packit 423ecb
    French like for both markup and content:

    <?xml version="1.0" encoding="ISO-8859-1"?>
    Packit 423ecb
    <très>là </très>

    Having internationalization support in libxml2 means the following:

      Packit 423ecb
        
    • the document is properly parsed
    • Packit 423ecb
        
    • information about it's encoding is saved
    • Packit 423ecb
        
    • it can be modified
    • Packit 423ecb
        
    • it can be saved in its original encoding
    • Packit 423ecb
        
    • it can also be saved in another encoding supported by libxml2 (for
    • Packit 423ecb
          example straight UTF8 or even an ASCII form)
      Packit 423ecb

      Another very important point is that the whole libxml2 API, with the

      Packit 423ecb
      exception of a few routines to read with a specific encoding or save to a
      Packit 423ecb
      specific encoding, is completely agnostic about the original encoding of the
      Packit 423ecb
      document.

      It should be noted too that the HTML parser embedded in libxml2 now obey

      Packit 423ecb
      the same rules too, the following document will be (as of 2.2.2) handled  in
      Packit 423ecb
      an internationalized fashion by libxml2 too:

      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
      Packit 423ecb
                            "http://www.w3.org/TR/REC-html40/loose.dtd">
      Packit 423ecb
      <html lang="fr">
      Packit 423ecb
      <head>
      Packit 423ecb
        <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
      Packit 423ecb
      </head>
      Packit 423ecb
      <body>
      Packit 423ecb
      <p>W3C crée des standards pour le Web.</body>
      Packit 423ecb
      </html>

      The internal encoding, how and why

      One of the core decisions was to force all documents to be converted to a

      Packit 423ecb
      default internal encoding, and that encoding to be UTF-8, here are the
      Packit 423ecb
      rationales for those choices:

        Packit 423ecb
          
      • keeping the native encoding in the internal form would force the libxml
      • Packit 423ecb
            users (or the code associated) to be fully aware of the encoding of the
        Packit 423ecb
            original document, for examples when adding a text node to a document,
        Packit 423ecb
            the content would have to be provided in the document encoding, i.e. the
        Packit 423ecb
            client code would have to check it before hand, make sure it's conformant
        Packit 423ecb
            to the encoding, etc ... Very hard in practice, though in some specific
        Packit 423ecb
            cases this may make sense.
        Packit 423ecb
          
      • the second decision was which encoding. From the XML spec only UTF8 and
      • Packit 423ecb
            UTF16 really makes sense as being the two only encodings for which there
        Packit 423ecb
            is mandatory support. UCS-4 (32 bits fixed size encoding) could be
        Packit 423ecb
            considered an intelligent choice too since it's a direct Unicode mapping
        Packit 423ecb
            support. I selected UTF-8 on the basis of efficiency and compatibility
        Packit 423ecb
            with surrounding software:
        Packit 423ecb
            
          Packit 423ecb
                
        • UTF-8 while a bit more complex to convert from/to (i.e. slightly
        • Packit 423ecb
                  more costly to import and export CPU wise) is also far more compact
          Packit 423ecb
                  than UTF-16 (and UCS-4) for a majority of the documents I see it used
          Packit 423ecb
                  for right now (RPM RDF catalogs, advogato data, various configuration
          Packit 423ecb
                  file formats, etc.) and the key point for today's computer
          Packit 423ecb
                  architecture is efficient uses of caches. If one nearly double the
          Packit 423ecb
                  memory requirement to store the same amount of data, this will trash
          Packit 423ecb
                  caches (main memory/external caches/internal caches) and my take is
          Packit 423ecb
                  that this harms the system far more than the CPU requirements needed
          Packit 423ecb
                  for the conversion to UTF-8
          Packit 423ecb
                
        • Most of libxml2 version 1 users were using it with straight ASCII
        • Packit 423ecb
                  most of the time, doing the conversion with an internal encoding
          Packit 423ecb
                  requiring all their code to be rewritten was a serious show-stopper
          Packit 423ecb
                  for using UTF-16 or UCS-4.
          Packit 423ecb
                
        • UTF-8 is being used as the de-facto internal encoding standard for
        • Packit 423ecb
                  related code like the pango
          Packit 423ecb
                  upcoming Gnome text widget, and a lot of Unix code (yet another place
          Packit 423ecb
                  where Unix programmer base takes a different approach from Microsoft
          Packit 423ecb
                  - they are using UTF-16)
          Packit 423ecb
              
          Packit 423ecb
            
          Packit 423ecb

          What does this mean in practice for the libxml2 user:

            Packit 423ecb
              
          • xmlChar, the libxml2 data type is a byte, those bytes must be assembled
          • Packit 423ecb
                as UTF-8 valid strings. The proper way to terminate an xmlChar * string
            Packit 423ecb
                is simply to append 0 byte, as usual.
            Packit 423ecb
              
          • One just need to make sure that when using chars outside the ASCII set,
          • Packit 423ecb
                the values has been properly converted to UTF-8
            Packit 423ecb

            How is it implemented ?

            Let's describe how all this works within libxml, basically the I18N

            Packit 423ecb
            (internationalization) support get triggered only during I/O operation, i.e.
            Packit 423ecb
            when reading a document or saving one. Let's look first at the reading
            Packit 423ecb
            sequence:

              Packit 423ecb
                
            1. when a document is processed, we usually don't know the encoding, a
            2. Packit 423ecb
                  simple heuristic allows to detect UTF-16 and UCS-4 from encodings where
              Packit 423ecb
                  the ASCII range (0-0x7F) maps with ASCII
              Packit 423ecb
                
            3. the xml declaration if available is parsed, including the encoding
            4. Packit 423ecb
                  declaration. At that point, if the autodetected encoding is different
              Packit 423ecb
                  from the one declared a call to xmlSwitchEncoding() is issued.
              Packit 423ecb
                
            5. If there is no encoding declaration, then the input has to be in either
            6. Packit 423ecb
                  UTF-8 or UTF-16, if it is not then at some point when processing the
              Packit 423ecb
                  input, the converter/checker of UTF-8 form will raise an encoding error.
              Packit 423ecb
                  You may end-up with a garbled document, or no document at all ! Example:
              Packit 423ecb
                  
              ~/XML -> ./xmllint err.xml 
              Packit 423ecb
              err.xml:1: error: Input is not proper UTF-8, indicate encoding !
              Packit 423ecb
              <très>là </très>
              Packit 423ecb
                 ^
              Packit 423ecb
              err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
              Packit 423ecb
              <très>là </très>
              Packit 423ecb
                 ^
              Packit 423ecb
                
              Packit 423ecb
                
            7. xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and
            8. Packit 423ecb
                  then search the default registered encoding converters for that encoding.
              Packit 423ecb
                  If it's not within the default set and iconv() support has been compiled
              Packit 423ecb
                  it, it will ask iconv for such an encoder. If this fails then the parser
              Packit 423ecb
                  will report an error and stops processing:
              Packit 423ecb
                  
              ~/XML -> ./xmllint err2.xml 
              Packit 423ecb
              err2.xml:1: error: Unsupported encoding UnsupportedEnc
              Packit 423ecb
              <?xml version="1.0" encoding="UnsupportedEnc"?>
              Packit 423ecb
                                                           ^
              Packit 423ecb
                
              Packit 423ecb
                
            9. From that point the encoder processes progressively the input (it is
            10. Packit 423ecb
                  plugged as a front-end to the I/O module) for that entity. It captures
              Packit 423ecb
                  and converts on-the-fly the document to be parsed to UTF-8. The parser
              Packit 423ecb
                  itself just does UTF-8 checking of this input and process it
              Packit 423ecb
                  transparently. The only difference is that the encoding information has
              Packit 423ecb
                  been added to the parsing context (more precisely to the input
              Packit 423ecb
                  corresponding to this entity).
              Packit 423ecb
                
            11. The result (when using DOM) is an internal form completely in UTF-8
            12. Packit 423ecb
                  with just an encoding information on the document node.
              Packit 423ecb

              Ok then what happens when saving the document (assuming you

              Packit 423ecb
              collected/built an xmlDoc DOM like structure) ? It depends on the function
              Packit 423ecb
              called, xmlSaveFile() will just try to save in the original encoding, while
              Packit 423ecb
              xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
              Packit 423ecb
              encoding:

                Packit 423ecb
                  
              1. if no encoding is given, libxml2 will look for an encoding value
              2. Packit 423ecb
                    associated to the document and if it exists will try to save to that
                Packit 423ecb
                    encoding,
                Packit 423ecb
                    

                otherwise everything is written in the internal form, i.e. UTF-8

                Packit 423ecb
                  
                Packit 423ecb
                  
              3. so if an encoding was specified, either at the API level or on the
              4. Packit 423ecb
                    document, libxml2 will again canonicalize the encoding name, lookup for a
                Packit 423ecb
                    converter in the registered set or through iconv. If not found the
                Packit 423ecb
                    function will return an error code
                Packit 423ecb
                  
              5. the converter is placed before the I/O buffer layer, as another kind of
              6. Packit 423ecb
                    buffer, then libxml2 will simply push the UTF-8 serialization to through
                Packit 423ecb
                    that buffer, which will then progressively be converted and pushed onto
                Packit 423ecb
                    the I/O layer.
                Packit 423ecb
                  
              7. It is possible that the converter code fails on some input, for example
              8. Packit 423ecb
                    trying to push an UTF-8 encoded Chinese character through the UTF-8 to
                Packit 423ecb
                    ISO-8859-1 converter won't work. Since the encoders are progressive they
                Packit 423ecb
                    will just report the error and the number of bytes converted, at that
                Packit 423ecb
                    point libxml2 will decode the offending character, remove it from the
                Packit 423ecb
                    buffer and replace it with the associated charRef encoding &#123; and
                Packit 423ecb
                    resume the conversion. This guarantees that any document will be saved
                Packit 423ecb
                    without losses (except for markup names where this is not legal, this is
                Packit 423ecb
                    a problem in the current version, in practice avoid using non-ascii
                Packit 423ecb
                    characters for tag or attribute names). A special "ascii" encoding name
                Packit 423ecb
                    is used to save documents to a pure ascii form can be used when
                Packit 423ecb
                    portability is really crucial
                Packit 423ecb

                Here are a few examples based on the same test document and assumin a

                Packit 423ecb
                terminal using ISO-8859-1 as the text encoding:

                ~/XML -> ./xmllint isolat1 
                Packit 423ecb
                <?xml version="1.0" encoding="ISO-8859-1"?>
                Packit 423ecb
                <très>là</très>
                Packit 423ecb
                ~/XML -> ./xmllint --encode UTF-8 isolat1 
                Packit 423ecb
                <?xml version="1.0" encoding="UTF-8"?>
                Packit 423ecb
                <très>là  </très>
                Packit 423ecb
                ~/XML -> 

                The same processing is applied (and reuse most of the code) for HTML I18N

                Packit 423ecb
                processing. Looking up and modifying the content encoding is a bit more
                Packit 423ecb
                difficult since it is located in a <meta> tag under the <head>,
                Packit 423ecb
                so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
                Packit 423ecb
                been provided. The parser also attempts to switch encoding on the fly when
                Packit 423ecb
                detecting such a tag on input. Except for that the processing is the same
                Packit 423ecb
                (and again reuses the same code).

                Default supported encodings

                libxml2 has a set of default converters for the following encodings

                Packit 423ecb
                (located in encoding.c):

                  Packit 423ecb
                    
                1. UTF-8 is supported by default (null handlers)
                2. Packit 423ecb
                    
                3. UTF-16, both little and big endian
                4. Packit 423ecb
                    
                5. ISO-Latin-1 (ISO-8859-1) covering most western languages
                6. Packit 423ecb
                    
                7. ASCII, useful mostly for saving
                8. Packit 423ecb
                    
                9. HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
                10. Packit 423ecb
                      predefined entities like &copy; for the Copyright sign.
                  Packit 423ecb

                  More over when compiled on an Unix platform with iconv support the full

                  Packit 423ecb
                  set of encodings supported by iconv can be instantly be used by libxml. On a
                  Packit 423ecb
                  linux machine with glibc-2.1 the list of supported encodings and aliases fill
                  Packit 423ecb
                  3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the
                  Packit 423ecb
                  various Japanese ones.

                  To convert from the UTF-8 values returned from the API to another encoding

                  Packit 423ecb
                  then it is possible to use the function provided from the encoding module like UTF8Toisolat1, or use the
                  Packit 423ecb
                  POSIX iconv()
                  Packit 423ecb
                  API directly.

                  Encoding aliases

                  From 2.2.3, libxml2 has support to register encoding names aliases. The

                  Packit 423ecb
                  goal is to be able to parse document whose encoding is supported but where
                  Packit 423ecb
                  the name differs (for example from the default set of names accepted by
                  Packit 423ecb
                  iconv). The following functions allow to register and handle new aliases for
                  Packit 423ecb
                  existing encodings. Once registered libxml2 will automatically lookup the
                  Packit 423ecb
                  aliases when handling a document:

                    Packit 423ecb
                      
                  • int xmlAddEncodingAlias(const char *name, const char *alias);
                  • Packit 423ecb
                      
                  • int xmlDelEncodingAlias(const char *alias);
                  • Packit 423ecb
                      
                  • const char * xmlGetEncodingAlias(const char *alias);
                  • Packit 423ecb
                      
                  • void xmlCleanupEncodingAliases(void);
                  • Packit 423ecb

                    How to extend the existing support

                    Well adding support for new encoding, or overriding one of the encoders

                    Packit 423ecb
                    (assuming it is buggy) should not be hard, just write input and output
                    Packit 423ecb
                    conversion routines to/from UTF-8, and register them using
                    Packit 423ecb
                    xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx),  and they will be
                    Packit 423ecb
                    called automatically if the parser(s) encounter such an encoding name
                    Packit 423ecb
                    (register it uppercase, this will help). The description of the encoders,
                    Packit 423ecb
                    their arguments and expected return values are described in the encoding.h
                    Packit 423ecb
                    header.

                    Daniel Veillard

                    </body></html>