Blame doc/entities.html

Packit 423ecb
Packit 423ecb
Packit 423ecb
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="SHORTCUT ICON" href="/favicon.ico" /><style type="text/css">
Packit 423ecb
TD {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
Packit 423ecb
H1 {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
H2 {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
H3 {font-family: Verdana,Arial,Helvetica}
Packit 423ecb
A:link, A:visited, A:active { text-decoration: underline }
Packit 423ecb
</style><title>Entities or no entities</title></head><body bgcolor="#8b7765" text="#000000" link="#a06060" vlink="#000000">
Action against software patentsGnome2 LogoW3C LogoRed Hat Logo
Made with Libxml2 Logo

The XML C parser and toolkit of Gnome

Entities or no entities

<center>Developer Menu</center>
<form action="search.php" enctype="application/x-www-form-urlencoded" method="get"><input name="query" type="text" size="20" value="" /><input name="submit" type="submit" value="Search ..." /></form>
<center>API Indexes</center>
<center>Related links</center>

Entities in principle are similar to simple C macros. An entity defines an

Packit 423ecb
abbreviation for a given string that you can reuse many times throughout the
Packit 423ecb
content of your document. Entities are especially useful when a given string
Packit 423ecb
may occur frequently within a document, or to confine the change needed to a
Packit 423ecb
document to a restricted area in the internal subset of the document (at the
Packit 423ecb
beginning). Example:

1 <?xml version="1.0"?>
Packit 423ecb
2 <!DOCTYPE EXAMPLE SYSTEM "example.dtd" [
Packit 423ecb
3 <!ENTITY xml "Extensible Markup Language">
Packit 423ecb
4 ]>
Packit 423ecb
5 <EXAMPLE>
Packit 423ecb
6    &xml;
Packit 423ecb
7 </EXAMPLE>

Line 3 declares the xml entity. Line 6 uses the xml entity, by prefixing

Packit 423ecb
its name with '&' and following it by ';' without any spaces added. There
Packit 423ecb
are 5 predefined entities in libxml2 allowing you to escape characters with
Packit 423ecb
predefined meaning in some parts of the xml document content:
Packit 423ecb
&lt; for the character '<', &gt;
Packit 423ecb
for the character '>',  &apos; for the character ''',
Packit 423ecb
&quot; for the character '"', and
Packit 423ecb
&amp; for the character '&'.

One of the problems related to entities is that you may want the parser to

Packit 423ecb
substitute an entity's content so that you can see the replacement text in
Packit 423ecb
your application. Or you may prefer to keep entity references as such in the
Packit 423ecb
content to be able to save the document back without losing this usually
Packit 423ecb
precious information (if the user went through the pain of explicitly
Packit 423ecb
defining entities, he may have a a rather negative attitude if you blindly
Packit 423ecb
substitute them as saving time). The xmlSubstituteEntitiesDefault()
Packit 423ecb
function allows you to check and change the behaviour, which is to not
Packit 423ecb
substitute entities by default.

Here is the DOM tree built by libxml2 for the previous document in the

Packit 423ecb
default case:

/gnome/src/gnome-xml -> ./xmllint --debug test/ent1
Packit 423ecb
DOCUMENT
Packit 423ecb
version=1.0
Packit 423ecb
   ELEMENT EXAMPLE
Packit 423ecb
     TEXT
Packit 423ecb
     content=
Packit 423ecb
     ENTITY_REF
Packit 423ecb
       INTERNAL_GENERAL_ENTITY xml
Packit 423ecb
       content=Extensible Markup Language
Packit 423ecb
     TEXT
Packit 423ecb
     content=

And here is the result when substituting entities:

/gnome/src/gnome-xml -> ./tester --debug --noent test/ent1
Packit 423ecb
DOCUMENT
Packit 423ecb
version=1.0
Packit 423ecb
   ELEMENT EXAMPLE
Packit 423ecb
     TEXT
Packit 423ecb
     content=     Extensible Markup Language

So, entities or no entities? Basically, it depends on your use case. I

Packit 423ecb
suggest that you keep the non-substituting default behaviour and avoid using
Packit 423ecb
entities in your XML document or data if you are not willing to handle the
Packit 423ecb
entity references elements in the DOM tree.

Note that at save time libxml2 enforces the conversion of the predefined

Packit 423ecb
entities where necessary to prevent well-formedness problems, and will also
Packit 423ecb
transparently replace those with chars (i.e. it will not generate entity
Packit 423ecb
reference elements in the DOM tree or call the reference() SAX callback when
Packit 423ecb
finding them in the input).

WARNING: handling entities

Packit 423ecb
on top of the libxml2 SAX interface is difficult!!! If you plan to use
Packit 423ecb
non-predefined entities in your documents, then the learning curve to handle
Packit 423ecb
then using the SAX API may be long. If you plan to use complex documents, I
Packit 423ecb
strongly suggest you consider using the DOM interface instead and let libxml
Packit 423ecb
deal with the complexity rather than trying to do it yourself.

Daniel Veillard

</body></html>