Blob Blame History Raw
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org" />
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii" />
<title>docbook2X: Character set conversion</title>
<link rel="stylesheet" href="docbook2X.css" type="text/css" />
<link rev="made" href="mailto:stevecheng@users.sourceforge.net" />
<meta name="generator" content="DocBook XSL Stylesheets V1.68.1" />
<meta name="description" content=
"Discussion on reproducing non-ASCII characters in the   converted output" />
<link rel="start" href="docbook2X.html" title=
"docbook2X: Documentation Table of Contents" />
<link rel="up" href="docbook2X.html" title=
"docbook2X: Documentation Table of Contents" />
<link rel="prev" href="sgml2xml-isoent.html" title=
"docbook2X: sgml2xml-isoent" />
<link rel="next" href="utf8trans.html" title=
"docbook2X: utf8trans" />
</head>
<body>
<div class="navheader">
<table width="100%" summary="Navigation header">
<tr>
<th colspan="3" align="center">Character set conversion</th>
</tr>
<tr>
<td width="20%" align="left"><a accesskey="p" href=
"sgml2xml-isoent.html">&lt;&lt; Previous</a>&nbsp;</td>
<th width="60%" align="center">&nbsp;</th>
<td width="20%" align="right">&nbsp;<a accesskey="n" href=
"utf8trans.html">Next &gt;&gt;</a></td>
</tr>
</table>
<hr /></div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a id="charsets" name="charsets"></a>Character
set conversion</h2>
</div>
</div>
</div>
<a id="id2537977" class="indexterm" name="id2537977"></a><a id=
"id2537983" class="indexterm" name="id2537983"></a><a id=
"id2537990" class="indexterm" name="id2537990"></a><a id=
"id2537997" class="indexterm" name="id2537997"></a><a id=
"id2538612" class="indexterm" name="id2538612"></a><a id=
"id2538619" class="indexterm" name="id2538619"></a><a id=
"id2538625" class="indexterm" name="id2538625"></a><a id=
"id2538632" class="indexterm" name="id2538632"></a><a id=
"id2538639" class="indexterm" name="id2538639"></a><a id=
"id2538649" class="indexterm" name="id2538649"></a><a id=
"id2538656" class="indexterm" name="id2538656"></a>
<p>When translating XML to legacy ASCII-based formats with poor
support for Unicode, such as man pages and Texinfo, there is always
the problem that Unicode characters in the source document also
have to be translated somehow.</p>
<p>A straightforward character set conversion from Unicode does not
suffice, because the target character set, usually US-ASCII or ISO
Latin-1, do not contain common characters such as dashes and
directional quotation marks that are widely used in XML documents.
But document formatters (man and Texinfo) allow such characters to
be entered by a markup escape: for example, <span class=
"markup">\(lq</span> for the left directional quote <code class=
"literal">&ldquo;</code>. And if a markup-level escape is not
available, an ASCII transliteration might be used: for example,
using the ASCII less-than sign <span class="markup">&lt;</span> for
the angle quotation mark <span class="markup">&lang;</span>.</p>
<p>So the Unicode character problem can be solved in two steps:</p>
<div class="orderedlist">
<ol type="1">
<li>
<p><a href="utf8trans.html"><span><strong class=
"command">utf8trans</strong></span></a>, a program included in
docbook2X, maps Unicode characters to markup-level escapes or
transliterations.</p>
<p>Since there is not necessarily a fixed, official mapping of
Unicode characters, <span><strong class=
"command">utf8trans</strong></span> can read in user-modifiable
character mappings expressed in text files and apply them. (Unlike
most character set converters.)</p>
<p>In <code class="filename">charmaps/man/roff.charmap</code> and
<code class="filename">charmaps/man/texi.charmap</code> are
character maps that may be used for man-page and Texinfo
conversion. The programs <a href=
"db2x_manxml.html"><span><strong class=
"command">db2x_manxml</strong></span></a> and <a href=
"db2x_texixml.html"><span><strong class=
"command">db2x_texixml</strong></span></a> will apply these
character maps, or another character map specified by the user,
automatically.</p>
</li>
<li>
<p>The rest of the Unicode text is converted to some other
character set (encoding). For example, a French document with
accented characters (such as <code class="literal">&eacute;</code>)
might be converted to ISO Latin 1.</p>
<p>This step is applied after <span><strong class=
"command">utf8trans</strong></span> character mapping, using the
<span class="citerefentry"><span class=
"refentrytitle"><span><strong class=
"command">iconv</strong></span></span></span> encoding conversion
tool. Both <a href="db2x_manxml.html"><span><strong class=
"command">db2x_manxml</strong></span></a> and <a href=
"db2x_texixml.html"><span><strong class=
"command">db2x_texixml</strong></span></a> can call <span class=
"citerefentry"><span class="refentrytitle"><span><strong class=
"command">iconv</strong></span></span></span> automatically when
producing their output.</p>
</li>
</ol>
</div>
</div>
<div class="navfooter">
<hr />
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left"><a accesskey="p" href=
"sgml2xml-isoent.html">&lt;&lt; Previous</a>&nbsp;</td>
<td width="20%" align="center">&nbsp;</td>
<td width="40%" align="right">&nbsp;<a accesskey="n" href=
"utf8trans.html">Next &gt;&gt;</a></td>
</tr>
<tr>
<td width="40%" align="left" valign="top"><span><strong class=
"command">sgml2xml-isoent</strong></span>&nbsp;</td>
<td width="20%" align="center"><a accesskey="h" href=
"docbook2X.html">Table of Contents</a></td>
<td width="40%" align="right" valign="top">
&nbsp;<span><strong class="command">utf8trans</strong></span></td>
</tr>
</table>
</div>
<p class="footer-homepage"><a href=
"http://docbook2x.sourceforge.net/" title=
"docbook2X: Home page">docbook2X home page</a></p>
</body>
</html>