Blame docs/usermanual-clusters.xml

Packit 874993
<chapter id="clusters">
Packit 874993
<sect1 id="clusters">
Packit 874993
  <title>Clusters</title>
Packit 874993
  <para>
Packit 874993
    In shaping text, a <emphasis>cluster</emphasis> is a sequence of
Packit 874993
    code points that needs to be treated as a single, indivisible unit.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    When you add text to a HB buffer, each character is associated with
Packit 874993
    a <emphasis>cluster value</emphasis>. This is an arbitrary number as
Packit 874993
    far as HB is concerned.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the
Packit 874993
    actual number does not matter. Moreover, it is not required for the
Packit 874993
    cluster values to be monotonically increasing, but pretty much all
Packit 874993
    of HB's tests are performed on monotonically increasing cluster
Packit 874993
    numbers. Nevertheless, there is no such assumption in the code
Packit 874993
    itself. With that in mind, let's examine what happens with cluster
Packit 874993
    values during shaping under each cluster-level.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    HarfBuzz provides three <emphasis>levels</emphasis> of clustering
Packit 874993
    support. Level 0 is the default behavior and reproduces the behavior
Packit 874993
    of the old HarfBuzz library. Level 1 tweaks this behavior slightly
Packit 874993
    to produce better results, so level 1 clustering is recommended for
Packit 874993
    code that is not required to implement backward compatibility with
Packit 874993
    the old HarfBuzz.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    Level 2 differs significantly in how it treats cluster values.
Packit 874993
    Levels 0 and 1 both process ligatures and glyph decomposition by
Packit 874993
    merging clusters; level 2 does not.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    The conceptual model for what the cluster values mean, in levels 0
Packit 874993
    and 1, is this:
Packit 874993
  </para>
Packit 874993
  <itemizedlist spacing="compact">
Packit 874993
    <listitem>
Packit 874993
      <para>
Packit 874993
        the sequence of cluster values will always remain monotone
Packit 874993
      </para>
Packit 874993
    </listitem>
Packit 874993
    <listitem>
Packit 874993
      <para>
Packit 874993
        each value represents a single cluster
Packit 874993
      </para>
Packit 874993
    </listitem>
Packit 874993
    <listitem>
Packit 874993
      <para>
Packit 874993
        each cluster contains one or more glyphs and one or more
Packit 874993
        characters
Packit 874993
      </para>
Packit 874993
    </listitem>
Packit 874993
  </itemizedlist>
Packit 874993
  <para>
Packit 874993
    Assuming that initial cluster numbers were monotonically increasing
Packit 874993
    and distinct, then all adjacent glyphs having the same cluster
Packit 874993
    number belong to the same cluster, and all characters belong to the
Packit 874993
    cluster that has the highest number not larger than their initial
Packit 874993
    cluster number. This will become clearer with an example.
Packit 874993
  </para>
Packit 874993
</sect1>
Packit 874993
<sect1 id="a-clustering-example-for-levels-0-and-1">
Packit 874993
  <title>A clustering example for levels 0 and 1</title>
Packit 874993
  <para>
Packit 874993
    Let's say we start with the following character sequence and cluster
Packit 874993
    values:
Packit 874993
  </para>
Packit 874993
  <programlisting>
Packit 874993
   A,B,C,D,E
Packit 874993
   0,1,2,3,4
Packit 874993
</programlisting>
Packit 874993
  <para>
Packit 874993
    We then map the characters to glyphs. For simplicity, let's assume
Packit 874993
    that each character maps to the corresponding, identical-looking
Packit 874993
    glyph:
Packit 874993
  </para>
Packit 874993
  <programlisting>
Packit 874993
   A,B,C,D,E
Packit 874993
   0,1,2,3,4
Packit 874993
</programlisting>
Packit 874993
  <para>
Packit 874993
    Now if, for example, <literal>B</literal> and <literal>C</literal>
Packit 874993
    ligate, then the clusters to which they belong "merge".
Packit 874993
    This merged cluster takes for its cluster number the minimum of all
Packit 874993
    the cluster numbers of the clusters that went in. In this case, we
Packit 874993
    get:
Packit 874993
  </para>
Packit 874993
  <programlisting>
Packit 874993
   A,BC,D,E
Packit 874993
   0,1 ,3,4
Packit 874993
</programlisting>
Packit 874993
  <para>
Packit 874993
    Now let's assume that the <literal>BC</literal> glyph decomposes
Packit 874993
    into three components, and <literal>D</literal> also decomposes into
Packit 874993
    two. The components each inherit the cluster value of their parent:
Packit 874993
  </para>
Packit 874993
  <programlisting>
Packit 874993
   A,BC0,BC1,BC2,D0,D1,E
Packit 874993
   0,1  ,1  ,1  ,3 ,3 ,4
Packit 874993
</programlisting>
Packit 874993
  <para>
Packit 874993
    Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then
Packit 874993
    their clusters (numbers 1 and 3) merge into
Packit 874993
    <literal>min(1,3) = 1</literal>:
Packit 874993
  </para>
Packit 874993
  <programlisting>
Packit 874993
   A,BC0,BC1,BC2D0,D1,E
Packit 874993
   0,1  ,1  ,1    ,1 ,4
Packit 874993
</programlisting>
Packit 874993
  <para>
Packit 874993
    At this point, cluster 1 means: the character sequence
Packit 874993
    <literal>BCD</literal> is represented by glyphs
Packit 874993
    <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
Packit 874993
    further.
Packit 874993
  </para>
Packit 874993
</sect1>
Packit 874993
<sect1 id="reordering-in-levels-0-and-1">
Packit 874993
  <title>Reordering in levels 0 and 1</title>
Packit 874993
  <para>
Packit 874993
    Another common operation in the more complex shapers is when things
Packit 874993
    reorder. In those cases, to maintain monotone clusters, HB merges
Packit 874993
    the clusters of everything in the reordering sequence. For example,
Packit 874993
    let's again start with the character sequence:
Packit 874993
  </para>
Packit 874993
  <programlisting>
Packit 874993
   A,B,C,D,E
Packit 874993
   0,1,2,3,4
Packit 874993
</programlisting>
Packit 874993
  <para>
Packit 874993
    If <literal>D</literal> is reordered before <literal>B</literal>,
Packit 874993
    then the <literal>B</literal>, <literal>C</literal>, and
Packit 874993
    <literal>D</literal> clusters merge, and we get:
Packit 874993
  </para>
Packit 874993
  <programlisting>
Packit 874993
   A,D,B,C,E
Packit 874993
   0,1,1,1,4
Packit 874993
</programlisting>
Packit 874993
  <para>
Packit 874993
    This is clearly not ideal, but it is the only sensible way to
Packit 874993
    maintain monotone indices and retain the true relationship between
Packit 874993
    glyphs and characters.
Packit 874993
  </para>
Packit 874993
</sect1>
Packit 874993
<sect1 id="the-distinction-between-levels-0-and-1">
Packit 874993
  <title>The distinction between levels 0 and 1</title>
Packit 874993
  <para>
Packit 874993
    So, the above is pretty much what cluster levels 0 and 1 do. The
Packit 874993
    only difference between the two is this: in level 0, at the very
Packit 874993
    beginning of the shaping process, we also merge clusters between
Packit 874993
    base characters and all Unicode marks (combining or not) following
Packit 874993
    them. E.g.:
Packit 874993
  </para>
Packit 874993
  <programlisting>
Packit 874993
  A,acute,B
Packit 874993
  0,1    ,2
Packit 874993
</programlisting>
Packit 874993
  <para>
Packit 874993
    will become:
Packit 874993
  </para>
Packit 874993
  <programlisting>
Packit 874993
  A,acute,B
Packit 874993
  0,0    ,2
Packit 874993
</programlisting>
Packit 874993
  <para>
Packit 874993
    This is the default behavior. We do it because Windows did it and
Packit 874993
    old HarfBuzz did it, so this remained the default. But this behavior
Packit 874993
    makes it impossible to color diacritic marks differently from their
Packit 874993
    base characters. That's why in level 1 we do not perform this
Packit 874993
    initial merging step.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    For clients, level 0 is more convenient if they rely on HarfBuzz
Packit 874993
    clusters for cursor positioning. But that's wrong anyway: cursor
Packit 874993
    positions should be determined based on Unicode grapheme boundaries,
Packit 874993
    NOT shaping clusters. As such, level 1 clusters are preferred.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    One last note about levels 0 and 1. We currently don't allow a
Packit 874993
    <literal>MultipleSubst</literal> lookup to replace a glyph with zero
Packit 874993
    glyphs (i.e., to delete a glyph). But in some other situations,
Packit 874993
    glyphs can be deleted. In those cases, if the glyph being deleted is
Packit 874993
    the last glyph of its cluster, we make sure to merge the cluster
Packit 874993
    with a neighboring cluster.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    This is, primarily, to make sure that the starting cluster of the
Packit 874993
    text always has the cluster index pointing to the start of the text
Packit 874993
    for the run; more than one client currently relies on this
Packit 874993
    guarantee.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    Incidentally, Apple's CoreText does something else to maintain the
Packit 874993
    same promise: it inserts a glyph with id 65535 at the beginning of
Packit 874993
    the glyph string if the glyph corresponding to the first character
Packit 874993
    in the run was deleted. HarfBuzz might do something similar in the
Packit 874993
    future.
Packit 874993
  </para>
Packit 874993
</sect1>
Packit 874993
<sect1 id="level-2">
Packit 874993
  <title>Level 2</title>
Packit 874993
  <para>
Packit 874993
    Level 2 is a different beast from levels 0 and 1. It is simple to
Packit 874993
    describe, but hard to make sense of. It simply doesn't do any
Packit 874993
    cluster merging whatsoever. When things ligate or otherwise multiple
Packit 874993
    glyphs turn into one, the cluster value of the first glyph is
Packit 874993
    retained.
Packit 874993
  </para>
Packit 874993
  <para>
Packit 874993
    Here are a few examples of why processing cluster values produced at
Packit 874993
    this level might be tricky:
Packit 874993
  </para>
Packit 874993
  <sect2 id="ligatures-with-combining-marks">
Packit 874993
    <title>Ligatures with combining marks</title>
Packit 874993
    <para>
Packit 874993
      Imagine capital letters are bases and lower case letters are
Packit 874993
      combining marks. With an input sequence like this:
Packit 874993
    </para>
Packit 874993
    <programlisting>
Packit 874993
  A,a,B,b,C,c
Packit 874993
  0,1,2,3,4,5
Packit 874993
</programlisting>
Packit 874993
    <para>
Packit 874993
      if <literal>A,B,C</literal> ligate, then here are the cluster
Packit 874993
      values one would get under the various levels:
Packit 874993
    </para>
Packit 874993
    <para>
Packit 874993
      level 0:
Packit 874993
    </para>
Packit 874993
    <programlisting>
Packit 874993
  ABC,a,b,c
Packit 874993
  0  ,0,0,0
Packit 874993
</programlisting>
Packit 874993
    <para>
Packit 874993
      level 1:
Packit 874993
    </para>
Packit 874993
    <programlisting>
Packit 874993
  ABC,a,b,c
Packit 874993
  0  ,0,0,5
Packit 874993
</programlisting>
Packit 874993
    <para>
Packit 874993
      level 2:
Packit 874993
    </para>
Packit 874993
    <programlisting>
Packit 874993
  ABC,a,b,c
Packit 874993
  0  ,1,3,5
Packit 874993
</programlisting>
Packit 874993
    <para>
Packit 874993
      Making sense of the last example is the hardest for a client,
Packit 874993
      because there is nothing in the cluster values to suggest that
Packit 874993
      <literal>B</literal> and <literal>C</literal> ligated with
Packit 874993
      <literal>A</literal>.
Packit 874993
    </para>
Packit 874993
  </sect2>
Packit 874993
  <sect2 id="reordering">
Packit 874993
    <title>Reordering</title>
Packit 874993
    <para>
Packit 874993
      Another tricky case is when things reorder. Under level 2:
Packit 874993
    </para>
Packit 874993
    <programlisting>
Packit 874993
  A,B,C,D,E
Packit 874993
  0,1,2,3,4
Packit 874993
</programlisting>
Packit 874993
    <para>
Packit 874993
      Now imagine <literal>D</literal> moves before
Packit 874993
      <literal>B</literal>:
Packit 874993
    </para>
Packit 874993
    <programlisting>
Packit 874993
  A,D,B,C,E
Packit 874993
  0,3,1,2,4
Packit 874993
</programlisting>
Packit 874993
    <para>
Packit 874993
      Now, if <literal>D</literal> ligates with <literal>B</literal>, we
Packit 874993
      get:
Packit 874993
    </para>
Packit 874993
    <programlisting>
Packit 874993
  A,DB,C,E
Packit 874993
  0,3 ,2,4
Packit 874993
</programlisting>
Packit 874993
    <para>
Packit 874993
      In a different scenario, <literal>A</literal> and
Packit 874993
      <literal>B</literal> could have ligated
Packit 874993
      <emphasis>before</emphasis> <literal>D</literal> reordered; that
Packit 874993
      would have resulted in:
Packit 874993
    </para>
Packit 874993
    <programlisting>
Packit 874993
  AB,D,C,E
Packit 874993
  0 ,3,2,4   
Packit 874993
</programlisting>
Packit 874993
    <para>
Packit 874993
      There's no way to differentitate between these two scenarios based
Packit 874993
      on the cluster numbers alone.
Packit 874993
    </para>
Packit 874993
    <para>
Packit 874993
      Another problem appens with ligatures under level 2 if the
Packit 874993
      direction of the text is forced to opposite of its natural
Packit 874993
      direction (e.g. left-to-right Arabic). But that's too much of a
Packit 874993
      corner case to worry about.
Packit 874993
    </para>
Packit 874993
  </sect2>
Packit 874993
</sect1>
Packit 874993
</chapter>