Blame doc/encoding.html

Packit Service a31ea6
Packit Service a31ea6
Packit Service a31ea6
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="SHORTCUT ICON" href="/favicon.ico" /><style type="text/css">
Packit Service a31ea6
TD {font-family: Verdana,Arial,Helvetica}
Packit Service a31ea6
BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
Packit Service a31ea6
H1 {font-family: Verdana,Arial,Helvetica}
Packit Service a31ea6
H2 {font-family: Verdana,Arial,Helvetica}
Packit Service a31ea6
H3 {font-family: Verdana,Arial,Helvetica}
Packit Service a31ea6
A:link, A:visited, A:active { text-decoration: underline }
Packit Service a31ea6
</style><title>Encodings support</title></head><body bgcolor="#8b7765" text="#000000" link="#a06060" vlink="#000000">
Action against software patentsGnome2 LogoW3C LogoRed Hat Logo
Made with Libxml2 Logo

The XML C parser and toolkit of Gnome

Encodings support

<center>Main Menu</center>
<form action="search.php" enctype="application/x-www-form-urlencoded" method="get"><input name="query" type="text" size="20" value="" /><input name="submit" type="submit" value="Search ..." /></form>
<center>Related links</center>

If you are not really familiar with Internationalization (usual shortcut

Packit Service a31ea6
is I18N) , Unicode, characters and glyphs, I suggest you read a presentation
Packit Service a31ea6
by Tim Bray on Unicode and why you should care about it.

If you don't understand why it does not make sense to have a string

Packit Service a31ea6
without knowing what encoding it uses, then as Joel Spolsky said please do not
Packit Service a31ea6
write another line of code until you finish reading that article.. It is
Packit Service a31ea6
a prerequisite to understand this page, and avoid a lot of problems with
Packit Service a31ea6
libxml2, XML or text processing in general.

Table of Content:

    Packit Service a31ea6
      
  1. What does internationalization support
  2. Packit Service a31ea6
        mean ?
    Packit Service a31ea6
      
  3. The internal encoding, how and
  4. Packit Service a31ea6
      why
    Packit Service a31ea6
      
  5. How is it implemented ?
  6. Packit Service a31ea6
      
  7. Default supported encodings
  8. Packit Service a31ea6
      
  9. How to extend the existing
  10. Packit Service a31ea6
      support
    Packit Service a31ea6

    What does internationalization support mean ?

    XML was designed from the start to allow the support of any character set

    Packit Service a31ea6
    by using Unicode. Any conformant XML parser has to support the UTF-8 and
    Packit Service a31ea6
    UTF-16 default encodings which can both express the full unicode ranges. UTF8
    Packit Service a31ea6
    is a variable length encoding whose greatest points are to reuse the same
    Packit Service a31ea6
    encoding for ASCII and to save space for Western encodings, but it is a bit
    Packit Service a31ea6
    more complex to handle in practice. UTF-16 use 2 bytes per character (and
    Packit Service a31ea6
    sometimes combines two pairs), it makes implementation easier, but looks a
    Packit Service a31ea6
    bit overkill for Western languages encoding. Moreover the XML specification
    Packit Service a31ea6
    allows the document to be encoded in other encodings at the condition that
    Packit Service a31ea6
    they are clearly labeled as such. For example the following is a wellformed
    Packit Service a31ea6
    XML document encoded in ISO-8859-1 and using accentuated letters that we
    Packit Service a31ea6
    French like for both markup and content:

    <?xml version="1.0" encoding="ISO-8859-1"?>
    Packit Service a31ea6
    <très>là </très>

    Having internationalization support in libxml2 means the following:

      Packit Service a31ea6
        
    • the document is properly parsed
    • Packit Service a31ea6
        
    • information about it's encoding is saved
    • Packit Service a31ea6
        
    • it can be modified
    • Packit Service a31ea6
        
    • it can be saved in its original encoding
    • Packit Service a31ea6
        
    • it can also be saved in another encoding supported by libxml2 (for
    • Packit Service a31ea6
          example straight UTF8 or even an ASCII form)
      Packit Service a31ea6

      Another very important point is that the whole libxml2 API, with the

      Packit Service a31ea6
      exception of a few routines to read with a specific encoding or save to a
      Packit Service a31ea6
      specific encoding, is completely agnostic about the original encoding of the
      Packit Service a31ea6
      document.

      It should be noted too that the HTML parser embedded in libxml2 now obey

      Packit Service a31ea6
      the same rules too, the following document will be (as of 2.2.2) handled  in
      Packit Service a31ea6
      an internationalized fashion by libxml2 too:

      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
      Packit Service a31ea6
                            "http://www.w3.org/TR/REC-html40/loose.dtd">
      Packit Service a31ea6
      <html lang="fr">
      Packit Service a31ea6
      <head>
      Packit Service a31ea6
        <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
      Packit Service a31ea6
      </head>
      Packit Service a31ea6
      <body>
      Packit Service a31ea6
      <p>W3C crée des standards pour le Web.</body>
      Packit Service a31ea6
      </html>

      The internal encoding, how and why

      One of the core decisions was to force all documents to be converted to a

      Packit Service a31ea6
      default internal encoding, and that encoding to be UTF-8, here are the
      Packit Service a31ea6
      rationales for those choices:

        Packit Service a31ea6
          
      • keeping the native encoding in the internal form would force the libxml
      • Packit Service a31ea6
            users (or the code associated) to be fully aware of the encoding of the
        Packit Service a31ea6
            original document, for examples when adding a text node to a document,
        Packit Service a31ea6
            the content would have to be provided in the document encoding, i.e. the
        Packit Service a31ea6
            client code would have to check it before hand, make sure it's conformant
        Packit Service a31ea6
            to the encoding, etc ... Very hard in practice, though in some specific
        Packit Service a31ea6
            cases this may make sense.
        Packit Service a31ea6
          
      • the second decision was which encoding. From the XML spec only UTF8 and
      • Packit Service a31ea6
            UTF16 really makes sense as being the two only encodings for which there
        Packit Service a31ea6
            is mandatory support. UCS-4 (32 bits fixed size encoding) could be
        Packit Service a31ea6
            considered an intelligent choice too since it's a direct Unicode mapping
        Packit Service a31ea6
            support. I selected UTF-8 on the basis of efficiency and compatibility
        Packit Service a31ea6
            with surrounding software:
        Packit Service a31ea6
            
          Packit Service a31ea6
                
        • UTF-8 while a bit more complex to convert from/to (i.e. slightly
        • Packit Service a31ea6
                  more costly to import and export CPU wise) is also far more compact
          Packit Service a31ea6
                  than UTF-16 (and UCS-4) for a majority of the documents I see it used
          Packit Service a31ea6
                  for right now (RPM RDF catalogs, advogato data, various configuration
          Packit Service a31ea6
                  file formats, etc.) and the key point for today's computer
          Packit Service a31ea6
                  architecture is efficient uses of caches. If one nearly double the
          Packit Service a31ea6
                  memory requirement to store the same amount of data, this will trash
          Packit Service a31ea6
                  caches (main memory/external caches/internal caches) and my take is
          Packit Service a31ea6
                  that this harms the system far more than the CPU requirements needed
          Packit Service a31ea6
                  for the conversion to UTF-8
          Packit Service a31ea6
                
        • Most of libxml2 version 1 users were using it with straight ASCII
        • Packit Service a31ea6
                  most of the time, doing the conversion with an internal encoding
          Packit Service a31ea6
                  requiring all their code to be rewritten was a serious show-stopper
          Packit Service a31ea6
                  for using UTF-16 or UCS-4.
          Packit Service a31ea6
                
        • UTF-8 is being used as the de-facto internal encoding standard for
        • Packit Service a31ea6
                  related code like the pango
          Packit Service a31ea6
                  upcoming Gnome text widget, and a lot of Unix code (yet another place
          Packit Service a31ea6
                  where Unix programmer base takes a different approach from Microsoft
          Packit Service a31ea6
                  - they are using UTF-16)
          Packit Service a31ea6
              
          Packit Service a31ea6
            
          Packit Service a31ea6

          What does this mean in practice for the libxml2 user:

            Packit Service a31ea6
              
          • xmlChar, the libxml2 data type is a byte, those bytes must be assembled
          • Packit Service a31ea6
                as UTF-8 valid strings. The proper way to terminate an xmlChar * string
            Packit Service a31ea6
                is simply to append 0 byte, as usual.
            Packit Service a31ea6
              
          • One just need to make sure that when using chars outside the ASCII set,
          • Packit Service a31ea6
                the values has been properly converted to UTF-8
            Packit Service a31ea6

            How is it implemented ?

            Let's describe how all this works within libxml, basically the I18N

            Packit Service a31ea6
            (internationalization) support get triggered only during I/O operation, i.e.
            Packit Service a31ea6
            when reading a document or saving one. Let's look first at the reading
            Packit Service a31ea6
            sequence:

              Packit Service a31ea6
                
            1. when a document is processed, we usually don't know the encoding, a
            2. Packit Service a31ea6
                  simple heuristic allows to detect UTF-16 and UCS-4 from encodings where
              Packit Service a31ea6
                  the ASCII range (0-0x7F) maps with ASCII
              Packit Service a31ea6
                
            3. the xml declaration if available is parsed, including the encoding
            4. Packit Service a31ea6
                  declaration. At that point, if the autodetected encoding is different
              Packit Service a31ea6
                  from the one declared a call to xmlSwitchEncoding() is issued.
              Packit Service a31ea6
                
            5. If there is no encoding declaration, then the input has to be in either
            6. Packit Service a31ea6
                  UTF-8 or UTF-16, if it is not then at some point when processing the
              Packit Service a31ea6
                  input, the converter/checker of UTF-8 form will raise an encoding error.
              Packit Service a31ea6
                  You may end-up with a garbled document, or no document at all ! Example:
              Packit Service a31ea6
                  
              ~/XML -> ./xmllint err.xml 
              Packit Service a31ea6
              err.xml:1: error: Input is not proper UTF-8, indicate encoding !
              Packit Service a31ea6
              <très>là </très>
              Packit Service a31ea6
                 ^
              Packit Service a31ea6
              err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
              Packit Service a31ea6
              <très>là </très>
              Packit Service a31ea6
                 ^
              Packit Service a31ea6
                
              Packit Service a31ea6
                
            7. xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and
            8. Packit Service a31ea6
                  then search the default registered encoding converters for that encoding.
              Packit Service a31ea6
                  If it's not within the default set and iconv() support has been compiled
              Packit Service a31ea6
                  it, it will ask iconv for such an encoder. If this fails then the parser
              Packit Service a31ea6
                  will report an error and stops processing:
              Packit Service a31ea6
                  
              ~/XML -> ./xmllint err2.xml 
              Packit Service a31ea6
              err2.xml:1: error: Unsupported encoding UnsupportedEnc
              Packit Service a31ea6
              <?xml version="1.0" encoding="UnsupportedEnc"?>
              Packit Service a31ea6
                                                           ^
              Packit Service a31ea6
                
              Packit Service a31ea6
                
            9. From that point the encoder processes progressively the input (it is
            10. Packit Service a31ea6
                  plugged as a front-end to the I/O module) for that entity. It captures
              Packit Service a31ea6
                  and converts on-the-fly the document to be parsed to UTF-8. The parser
              Packit Service a31ea6
                  itself just does UTF-8 checking of this input and process it
              Packit Service a31ea6
                  transparently. The only difference is that the encoding information has
              Packit Service a31ea6
                  been added to the parsing context (more precisely to the input
              Packit Service a31ea6
                  corresponding to this entity).
              Packit Service a31ea6
                
            11. The result (when using DOM) is an internal form completely in UTF-8
            12. Packit Service a31ea6
                  with just an encoding information on the document node.
              Packit Service a31ea6

              Ok then what happens when saving the document (assuming you

              Packit Service a31ea6
              collected/built an xmlDoc DOM like structure) ? It depends on the function
              Packit Service a31ea6
              called, xmlSaveFile() will just try to save in the original encoding, while
              Packit Service a31ea6
              xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
              Packit Service a31ea6
              encoding:

                Packit Service a31ea6
                  
              1. if no encoding is given, libxml2 will look for an encoding value
              2. Packit Service a31ea6
                    associated to the document and if it exists will try to save to that
                Packit Service a31ea6
                    encoding,
                Packit Service a31ea6
                    

                otherwise everything is written in the internal form, i.e. UTF-8

                Packit Service a31ea6
                  
                Packit Service a31ea6
                  
              3. so if an encoding was specified, either at the API level or on the
              4. Packit Service a31ea6
                    document, libxml2 will again canonicalize the encoding name, lookup for a
                Packit Service a31ea6
                    converter in the registered set or through iconv. If not found the
                Packit Service a31ea6
                    function will return an error code
                Packit Service a31ea6
                  
              5. the converter is placed before the I/O buffer layer, as another kind of
              6. Packit Service a31ea6
                    buffer, then libxml2 will simply push the UTF-8 serialization to through
                Packit Service a31ea6
                    that buffer, which will then progressively be converted and pushed onto
                Packit Service a31ea6
                    the I/O layer.
                Packit Service a31ea6
                  
              7. It is possible that the converter code fails on some input, for example
              8. Packit Service a31ea6
                    trying to push an UTF-8 encoded Chinese character through the UTF-8 to
                Packit Service a31ea6
                    ISO-8859-1 converter won't work. Since the encoders are progressive they
                Packit Service a31ea6
                    will just report the error and the number of bytes converted, at that
                Packit Service a31ea6
                    point libxml2 will decode the offending character, remove it from the
                Packit Service a31ea6
                    buffer and replace it with the associated charRef encoding &#123; and
                Packit Service a31ea6
                    resume the conversion. This guarantees that any document will be saved
                Packit Service a31ea6
                    without losses (except for markup names where this is not legal, this is
                Packit Service a31ea6
                    a problem in the current version, in practice avoid using non-ascii
                Packit Service a31ea6
                    characters for tag or attribute names). A special "ascii" encoding name
                Packit Service a31ea6
                    is used to save documents to a pure ascii form can be used when
                Packit Service a31ea6
                    portability is really crucial
                Packit Service a31ea6

                Here are a few examples based on the same test document and assumin a

                Packit Service a31ea6
                terminal using ISO-8859-1 as the text encoding:

                ~/XML -> ./xmllint isolat1 
                Packit Service a31ea6
                <?xml version="1.0" encoding="ISO-8859-1"?>
                Packit Service a31ea6
                <très>là</très>
                Packit Service a31ea6
                ~/XML -> ./xmllint --encode UTF-8 isolat1 
                Packit Service a31ea6
                <?xml version="1.0" encoding="UTF-8"?>
                Packit Service a31ea6
                <très>là  </très>
                Packit Service a31ea6
                ~/XML -> 

                The same processing is applied (and reuse most of the code) for HTML I18N

                Packit Service a31ea6
                processing. Looking up and modifying the content encoding is a bit more
                Packit Service a31ea6
                difficult since it is located in a <meta> tag under the <head>,
                Packit Service a31ea6
                so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
                Packit Service a31ea6
                been provided. The parser also attempts to switch encoding on the fly when
                Packit Service a31ea6
                detecting such a tag on input. Except for that the processing is the same
                Packit Service a31ea6
                (and again reuses the same code).

                Default supported encodings

                libxml2 has a set of default converters for the following encodings

                Packit Service a31ea6
                (located in encoding.c):

                  Packit Service a31ea6
                    
                1. UTF-8 is supported by default (null handlers)
                2. Packit Service a31ea6
                    
                3. UTF-16, both little and big endian
                4. Packit Service a31ea6
                    
                5. ISO-Latin-1 (ISO-8859-1) covering most western languages
                6. Packit Service a31ea6
                    
                7. ASCII, useful mostly for saving
                8. Packit Service a31ea6
                    
                9. HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
                10. Packit Service a31ea6
                      predefined entities like &copy; for the Copyright sign.
                  Packit Service a31ea6

                  More over when compiled on an Unix platform with iconv support the full

                  Packit Service a31ea6
                  set of encodings supported by iconv can be instantly be used by libxml. On a
                  Packit Service a31ea6
                  linux machine with glibc-2.1 the list of supported encodings and aliases fill
                  Packit Service a31ea6
                  3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the
                  Packit Service a31ea6
                  various Japanese ones.

                  To convert from the UTF-8 values returned from the API to another encoding

                  Packit Service a31ea6
                  then it is possible to use the function provided from the encoding module like UTF8Toisolat1, or use the
                  Packit Service a31ea6
                  POSIX iconv()
                  Packit Service a31ea6
                  API directly.

                  Encoding aliases

                  From 2.2.3, libxml2 has support to register encoding names aliases. The

                  Packit Service a31ea6
                  goal is to be able to parse document whose encoding is supported but where
                  Packit Service a31ea6
                  the name differs (for example from the default set of names accepted by
                  Packit Service a31ea6
                  iconv). The following functions allow to register and handle new aliases for
                  Packit Service a31ea6
                  existing encodings. Once registered libxml2 will automatically lookup the
                  Packit Service a31ea6
                  aliases when handling a document:

                    Packit Service a31ea6
                      
                  • int xmlAddEncodingAlias(const char *name, const char *alias);
                  • Packit Service a31ea6
                      
                  • int xmlDelEncodingAlias(const char *alias);
                  • Packit Service a31ea6
                      
                  • const char * xmlGetEncodingAlias(const char *alias);
                  • Packit Service a31ea6
                      
                  • void xmlCleanupEncodingAliases(void);
                  • Packit Service a31ea6

                    How to extend the existing support

                    Well adding support for new encoding, or overriding one of the encoders

                    Packit Service a31ea6
                    (assuming it is buggy) should not be hard, just write input and output
                    Packit Service a31ea6
                    conversion routines to/from UTF-8, and register them using
                    Packit Service a31ea6
                    xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx),  and they will be
                    Packit Service a31ea6
                    called automatically if the parser(s) encounter such an encoding name
                    Packit Service a31ea6
                    (register it uppercase, this will help). The description of the encoders,
                    Packit Service a31ea6
                    their arguments and expected return values are described in the encoding.h
                    Packit Service a31ea6
                    header.

                    Daniel Veillard

                    </body></html>