|
Packit |
423ecb |
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Encoding Conversion</title><meta name="generator" content="DocBook XSL Stylesheets V1.61.2"><link rel="home" href="index.html" title="Libxml Tutorial"><link rel="up" href="index.html" title="Libxml Tutorial"><link rel="previous" href="ar01s08.html" title="Retrieving Attributes"><link rel="next" href="apa.html" title="A. Compilation"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
|
|
Packit |
423ecb |
Data encoding compatibility problems are one of the most common
|
|
Packit |
423ecb |
difficulties encountered by programmers new to XML in
|
|
Packit |
423ecb |
general and libxml in particular. Thinking
|
|
Packit |
423ecb |
through the design of your application in light of this issue will help
|
|
Packit |
423ecb |
avoid difficulties later. Internally, libxml
|
|
Packit |
423ecb |
stores and manipulates data in the UTF-8 format. Data used by your program
|
|
Packit |
423ecb |
in other formats, such as the commonly used ISO-8859-1 encoding, must be
|
|
Packit |
423ecb |
converted to UTF-8 before passing it to libxml
|
|
Packit |
423ecb |
functions. If you want your program's output in an encoding other than
|
|
Packit |
423ecb |
UTF-8, you also must convert it.Libxml uses
|
|
Packit |
423ecb |
iconv if it is available to convert
|
|
Packit |
423ecb |
data. Without iconv, only UTF-8, UTF-16 and
|
|
Packit |
423ecb |
ISO-8859-1 can be used as external formats. With
|
|
Packit |
423ecb |
iconv, any format can be used provided
|
|
Packit |
423ecb |
iconv is able to convert it to and from
|
|
Packit |
423ecb |
UTF-8. Currently iconv supports about 150
|
|
Packit |
423ecb |
different character formats with ability to convert from any to any. While
|
|
Packit |
423ecb |
the actual number of supported formats varies between implementations, every
|
|
Packit |
423ecb |
iconv implementation is almost guaranteed to
|
|
Packit |
423ecb |
support every format anyone has ever heard of.![[Warning]](images/warning.png) | Warning |
---|
A common mistake is to use different formats for the internal data |
|
|
Packit |
423ecb |
in different parts of one's code. The most common case is an application
|
|
Packit |
423ecb |
that assumes ISO-8859-1 to be the internal data format, combined with
|
|
Packit |
423ecb |
libxml, which assumes UTF-8 to be the
|
|
Packit |
423ecb |
internal data format. The result is an application that treats internal
|
|
Packit |
423ecb |
data differently, depending on which code section is executing. The one or
|
|
Packit |
423ecb |
the other part of code will then, naturally, misinterpret the data.
|
|
Packit |
423ecb |
This example constructs a simple document, then adds content provided
|
|
Packit |
423ecb |
at the command line to the document's root element and outputs the results
|
|
Packit |
423ecb |
to <tt class="filename">stdout</tt> in the proper encoding. For this example, we
|
|
Packit |
423ecb |
use ISO-8859-1 encoding. The encoding of the string input at the command
|
|
Packit |
423ecb |
line is converted from ISO-8859-1 to UTF-8. Full code: Appendix H, Code for Encoding Conversion ExampleThe conversion, encapsulated in the example code in the
|
|
Packit |
423ecb |
<tt class="function">convert</tt> function, uses
|
|
Packit |
423ecb |
libxml's
|
|
Packit |
423ecb |
<tt class="function">xmlFindCharEncodingHandler</tt> function:
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
xmlCharEncodingHandlerPtr handler;
|
|
Packit |
423ecb |
size = (int)strlen(in)+1;
|
|
Packit |
423ecb |
out_size = size*2-1;
|
|
Packit |
423ecb |
out = malloc((size_t)out_size);
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
…
|
|
Packit |
423ecb |
handler = xmlFindCharEncodingHandler(encoding);
|
|
Packit |
423ecb |
…
|
|
Packit |
423ecb |
handler->input(out, &out_size, in, &temp);
|
|
Packit |
423ecb |
…
|
|
Packit |
423ecb |
xmlSaveFormatFileEnc("-", doc, encoding, 1);
|
|
Packit |
423ecb |
|
|
Packit |
423ecb |
| <tt class="varname">handler</tt> is declared as a pointer to an |
|
|
Packit |
423ecb |
<tt class="function">xmlCharEncodingHandler</tt> function. The <tt class="function">xmlCharEncodingHandler</tt> function needs
|
|
Packit |
423ecb |
to be given the size of the input and output strings, which are
|
|
Packit |
423ecb |
calculated here for strings <tt class="varname">in</tt> and
|
|
Packit |
423ecb |
<tt class="varname">out</tt>. <tt class="function">xmlFindCharEncodingHandler</tt> takes as its
|
|
Packit |
423ecb |
argument the data's initial encoding and searches
|
|
Packit |
423ecb |
libxml's built-in set of conversion
|
|
Packit |
423ecb |
handlers, returning a pointer to the function or NULL if none is
|
|
Packit |
423ecb |
found. The conversion function identified by <tt class="varname">handler</tt>
|
|
Packit |
423ecb |
requires as its arguments pointers to the input and output strings,
|
|
Packit |
423ecb |
along with the length of each. The lengths must be determined
|
|
Packit |
423ecb |
separately by the application. To output in a specified encoding rather than UTF-8, we use
|
|
Packit |
423ecb |
<tt class="function">xmlSaveFormatFileEnc</tt>, specifying the
|
|
Packit |
423ecb |
encoding.
|
|
Packit |
423ecb |
</body></html>
|