Blame docs/diff-internals.md

Packit ae9e2a
Diff is broken into four phases:
Packit ae9e2a
Packit ae9e2a
1. Building a list of things that have changed.  These changes are called
Packit ae9e2a
   deltas (git_diff_delta objects) and are grouped into a git_diff_list.
Packit ae9e2a
2. Applying file similarity measurement for rename and copy detection (and
Packit ae9e2a
   to potentially split files that have changed radically).  This step is
Packit ae9e2a
   optional.
Packit ae9e2a
3. Computing the textual diff for each delta.  Not all deltas have a
Packit ae9e2a
   meaningful textual diff.  For those that do, the textual diff can
Packit ae9e2a
   either be generated on the fly and passed to output callbacks or can be
Packit ae9e2a
   turned into a git_diff_patch object.
Packit ae9e2a
4. Formatting the diff and/or patch into standard text formats (such as
Packit ae9e2a
   patches, raw lists, etc).
Packit ae9e2a
Packit ae9e2a
In the source code, step 1 is implemented in `src/diff.c`, step 2 in
Packit ae9e2a
`src/diff_tform.c`, step 3 in `src/diff_patch.c`, and step 4 in
Packit ae9e2a
`src/diff_print.c`.  Additionally, when it comes to accessing file
Packit ae9e2a
content, everything goes through diff drivers that are implemented in
Packit ae9e2a
`src/diff_driver.c`.
Packit ae9e2a
Packit ae9e2a
External Objects
Packit ae9e2a
----------------
Packit ae9e2a
Packit ae9e2a
* `git_diff_options` represents user choices about how a diff should be
Packit ae9e2a
  performed and is passed to most diff generating functions.
Packit ae9e2a
* `git_diff_file` represents an item on one side of a possible delta
Packit ae9e2a
* `git_diff_delta` represents a pair of items that have changed in some
Packit ae9e2a
  way - it contains two `git_diff_file` plus a status and other stuff.
Packit ae9e2a
* `git_diff_list` is a list of deltas along with information about how
Packit ae9e2a
  those particular deltas were found.
Packit ae9e2a
* `git_diff_patch` represents the actual diff between a pair of items.  In
Packit ae9e2a
  some cases, a delta may not have a corresponding patch, if the objects
Packit ae9e2a
  are binary, for example.  The content of a patch will be a set of hunks
Packit ae9e2a
  and lines.
Packit ae9e2a
* A `hunk` is range of lines described by a `git_diff_range` (i.e.  "lines
Packit ae9e2a
  10-20 in the old file became lines 12-23 in the new").  It will have a
Packit ae9e2a
  header that compactly represents that information, and it will have a
Packit ae9e2a
  number of lines of context surrounding added and deleted lines.
Packit ae9e2a
* A `line` is simple a line of data along with a `git_diff_line_t` value
Packit ae9e2a
  that tells how the data should be interpreted (e.g. context or added).
Packit ae9e2a
Packit ae9e2a
Internal Objects
Packit ae9e2a
----------------
Packit ae9e2a
Packit ae9e2a
* `git_diff_file_content` is an internal structure that represents the
Packit ae9e2a
  data on one side of an item to be diffed; it is an augmented
Packit ae9e2a
  `git_diff_file` with more flags and the actual file data.
Packit ae9e2a
Packit ae9e2a
    * it is created from a repository plus a) a git_diff_file, b) a git_blob,
Packit ae9e2a
   or c) raw data and size
Packit ae9e2a
    * there are three main operations on git_diff_file_content:
Packit ae9e2a
    
Packit ae9e2a
        * _initialization_ sets up the data structure and does what it can up to,
Packit ae9e2a
          but not including loading and looking at the actual data
Packit ae9e2a
        * _loading_ loads the data, preprocesses it (i.e. applies filters) and
Packit ae9e2a
          potentially analyzes it (to decide if binary)
Packit ae9e2a
        * _free_ releases loaded data and frees any allocated memory
Packit ae9e2a
Packit ae9e2a
* The internal structure of a `git_diff_patch` stores the actual diff
Packit ae9e2a
  between a pair of `git_diff_file_content` items
Packit ae9e2a
Packit ae9e2a
    * it may be "unset" if the items are not diffable
Packit ae9e2a
    * "empty" if the items are the same
Packit ae9e2a
    * otherwise it will consist of a set of hunks each of which covers some
Packit ae9e2a
      number of lines of context, additions and deletions
Packit ae9e2a
    * a patch is created from two git_diff_file_content items
Packit ae9e2a
    * a patch is fully instantiated in three phases:
Packit ae9e2a
    
Packit ae9e2a
        * initial creation and initialization
Packit ae9e2a
        * loading of data and preliminary data examination
Packit ae9e2a
        * diffing of data and optional storage of diffs
Packit ae9e2a
    * (TBD) if a patch is asked to store the diffs and the size of the diff
Packit ae9e2a
      is significantly smaller than the raw data of the two sides, then the
Packit ae9e2a
      patch may be flattened using a pool of string data
Packit ae9e2a
Packit ae9e2a
* `git_diff_output` is an internal structure that represents an output
Packit ae9e2a
  target for a `git_diff_patch`
Packit ae9e2a
    * It consists of file, hunk, and line callbacks, plus a payload
Packit ae9e2a
    * There is a standard flattened output that can be used for plain text output
Packit ae9e2a
    * Typically we use a `git_xdiff_output` which drives the callbacks via the
Packit ae9e2a
      xdiff code taken from core Git.
Packit ae9e2a
Packit ae9e2a
* `git_diff_driver` is an internal structure that encapsulates the logic
Packit ae9e2a
  for a given type of file
Packit ae9e2a
    * a driver is looked up based on the name and mode of a file.
Packit ae9e2a
    * the driver can then be used to:
Packit ae9e2a
        * determine if a file is binary (by attributes, by git_diff_options
Packit ae9e2a
          settings, or by examining the content)
Packit ae9e2a
        * give you a function pointer that is used to evaluate function context
Packit ae9e2a
          for hunk headers
Packit ae9e2a
    * At some point, the logic for getting a filtered version of file content
Packit ae9e2a
      or calculating the OID of a file may be moved into the driver.