Blob Blame History Raw
Memory management
=================

There can be two types of nodes:

* those connected to an existing tree

* those unconnected. These may be the top node of a tree

Nodes consist of a C-level libxml2 node, Node for short, and
optionally a Python-level proxy node, Proxy. Zero, one or more Proxies can
exist for a single Node.

Proxies are garbage collected automatically by Python. Nodes are not
garbage collected at all. Instead, explicit mechanisms exist for
Nodes to clear them and the tree they may be the top of.

A Node can be safely freed when:

* no Proxy is connected to this Node

* no Proxy cannot be created for this Node

A Proxy cannot be created to a CNode when:

* no Proxy exist for nodes that are connected to that Node

This is the case when:

* the Node is in a tree that has no Proxy connected to any of the nodes.

This means that the whole tree in such a condition can be freed.

Detecting whether a Node is in a tree that has no Proxies connected to
it can be done by relying on Python's garbage collection
algorithm. Each Proxy can have a reference to the Proxy that points to
the top of the tree. In case of a document tree, this reference is to
the Document Proxy. When no more references exist in the system to the
top Proxy, this means no more Proxies exist that point to the Node
tree the top Proxy is the top of. If this Node tree is unconnected;
i.e. it is not a subtree, this means that tree can be safely garbage
collected.

A special case exists for document references. Each Proxy will always
have a reference to the Document Proxy, as any Node will have such a
reference to the Document Node. This means that a Document Node can
only be garbage collected when no more Proxies at all exist anymore
which refer to the Document. This is a separate system from the
top-Node references, even though the top-node in many cases will be
the Document. This because there is no way to get to a node that is
not connected to the Document tree from a Document Proxy.

This approach requires a system that can keep track of the top of the
tree in any case. Usually this is simple: when a Proxy gets connected,
the tree top becomes the tree top of whatever node it is connected
to. 

Sometimes this is more difficult: a Proxy may exist pointing to a node
in a subtree that just got connected. The top reference cannot be
updated. This is a problem in the following case:

    a
  b    c         h
d  e  f  g     i  j
              k

now imagine we have a proxy to k, K, and a proxy of i, I. They both
have a pointer to proxy H.

Now imagine i gets moved under g through proxy I. Proxy I will have an
updated pointer to proxy A. However, proxy K cannot be updated and still
points to H, from which it is now in fact disconnected.

proxy H cannot be removed now until proxy A is removed. In addition,
proxy A has a refcount that is too low because proxy K doesn't point
to it but should.

Another strategy involves having a reference count on the underlying
nodes, one per proxy. A node can only be freed if there is no
descendant-or-self that has the refcount higher than 0. A node, when
no more Python references to it exist, will check for refcounts first.
The drawback of this is potentially heavy tree-walking each time a proxy
can be removed.