Blame doc/tutorial/ar01s09.html

Packit 423ecb
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Encoding Conversion</title><meta name="generator" content="DocBook XSL Stylesheets V1.61.2"><link rel="home" href="index.html" title="Libxml Tutorial"><link rel="up" href="index.html" title="Libxml Tutorial"><link rel="previous" href="ar01s08.html" title="Retrieving Attributes"><link rel="next" href="apa.html" title="A. Compilation"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">

Encoding Conversion

Packit 423ecb
Data encoding compatibility problems are one of the most common
Packit 423ecb
      difficulties encountered by programmers new to XML in
Packit 423ecb
      general and libxml in particular. Thinking
Packit 423ecb
      through the design of your application in light of this issue will help
Packit 423ecb
      avoid difficulties later. Internally, libxml
Packit 423ecb
      stores and manipulates data in the UTF-8 format. Data used by your program
Packit 423ecb
      in other formats, such as the commonly used ISO-8859-1 encoding, must be
Packit 423ecb
      converted to UTF-8 before passing it to libxml
Packit 423ecb
      functions. If you want your program's output in an encoding other than
Packit 423ecb
      UTF-8, you also must convert it.

Libxml uses

Packit 423ecb
      iconv if it is available to convert
Packit 423ecb
    data. Without iconv, only UTF-8, UTF-16 and
Packit 423ecb
    ISO-8859-1 can be used as external formats. With
Packit 423ecb
    iconv, any format can be used provided
Packit 423ecb
    iconv is able to convert it to and from
Packit 423ecb
    UTF-8. Currently iconv supports about 150
Packit 423ecb
    different character formats with ability to convert from any to any. While
Packit 423ecb
    the actual number of supported formats varies between implementations, every
Packit 423ecb
    iconv implementation is almost guaranteed to
Packit 423ecb
    support every format anyone has ever heard of.

[Warning]Warning

A common mistake is to use different formats for the internal data

Packit 423ecb
	in different parts of one's code. The most common case is an application
Packit 423ecb
	that assumes ISO-8859-1 to be the internal data format, combined with
Packit 423ecb
	libxml, which assumes UTF-8 to be the
Packit 423ecb
	internal data format. The result is an application that treats internal
Packit 423ecb
	data differently, depending on which code section is executing. The one or
Packit 423ecb
	the other part of code will then, naturally, misinterpret the data.
Packit 423ecb
      

This example constructs a simple document, then adds content provided

Packit 423ecb
    at the command line to the document's root element and outputs the results
Packit 423ecb
    to <tt class="filename">stdout</tt> in the proper encoding. For this example, we
Packit 423ecb
    use ISO-8859-1 encoding. The encoding of the string input at the command
Packit 423ecb
    line is converted from ISO-8859-1 to UTF-8. Full code: Appendix H, Code for Encoding Conversion Example

The conversion, encapsulated in the example code in the

Packit 423ecb
      <tt class="function">convert</tt> function, uses
Packit 423ecb
      libxml's
Packit 423ecb
    <tt class="function">xmlFindCharEncodingHandler</tt> function:
Packit 423ecb
      

Packit 423ecb
	1xmlCharEncodingHandlerPtr handler;
Packit 423ecb
        2size = (int)strlen(in)+1; 
Packit 423ecb
        out_size = size*2-1; 
Packit 423ecb
        out = malloc((size_t)out_size); 
Packit 423ecb
Packit 423ecb
Packit 423ecb
	3handler = xmlFindCharEncodingHandler(encoding);
Packit 423ecb
Packit 423ecb
	4handler->input(out, &out_size, in, &temp);
Packit 423ecb
Packit 423ecb
	5xmlSaveFormatFileEnc("-", doc, encoding, 1);
Packit 423ecb
      

Packit 423ecb
      

1

<tt class="varname">handler</tt> is declared as a pointer to an

Packit 423ecb
	    <tt class="function">xmlCharEncodingHandler</tt> function.

2

The <tt class="function">xmlCharEncodingHandler</tt> function needs

Packit 423ecb
	  to be given the size of the input and output strings, which are
Packit 423ecb
	    calculated here for strings <tt class="varname">in</tt> and
Packit 423ecb
	  <tt class="varname">out</tt>.

3

<tt class="function">xmlFindCharEncodingHandler</tt> takes as its

Packit 423ecb
	    argument the data's initial encoding and searches
Packit 423ecb
	    libxml's built-in set of conversion
Packit 423ecb
	    handlers, returning a pointer to the function or NULL if none is
Packit 423ecb
	    found.

4

The conversion function identified by <tt class="varname">handler</tt>

Packit 423ecb
	  requires as its arguments pointers to the input and output strings,
Packit 423ecb
	  along with the length of each. The lengths must be determined
Packit 423ecb
	  separately by the application.

5

To output in a specified encoding rather than UTF-8, we use

Packit 423ecb
	    <tt class="function">xmlSaveFormatFileEnc</tt>, specifying the
Packit 423ecb
	    encoding.

Packit 423ecb
    

</body></html>