Blame src/psl-make-dafsa

Packit Service bdd612
#!/usr/libexec/platform-python
Packit Service dcb6c2
# Copyright 2014 The Chromium Authors. All rights reserved.
Packit Service dcb6c2
# Use of this source code is governed by a BSD-style license that can be
Packit Service dcb6c2
# found in the LICENSE.chromium file.
Packit Service dcb6c2
Packit Service dcb6c2
"""
Packit Service dcb6c2
A Deterministic acyclic finite state automaton (DAFSA) is a compact
Packit Service dcb6c2
representation of an unordered word list (dictionary).
Packit Service dcb6c2
Packit Service dcb6c2
https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
Packit Service dcb6c2
Packit Service dcb6c2
This python program converts a list of strings to a byte array in C++.
Packit Service dcb6c2
This python program fetches strings and return values from a gperf file
Packit Service dcb6c2
and generates a C++ file with a byte array representing graph that can be
Packit Service dcb6c2
used as a memory efficient replacement for the perfect hash table.
Packit Service dcb6c2
Packit Service dcb6c2
The input strings must consist of printable 7-bit ASCII characters or UTF-8
Packit Service dcb6c2
multibyte sequences. Control characters in the range [0x00-0x1F] are not
Packit Service dcb6c2
allowed. The return values must be one digit integers. .
Packit Service dcb6c2
Packit Service dcb6c2
In this program a DAFSA is a diamond shaped graph starting at a common
Packit Service dcb6c2
source node and ending at a common sink node. All internal nodes contain
Packit Service dcb6c2
a label and each word is represented by the labels in one path from
Packit Service dcb6c2
the source node to the sink node.
Packit Service dcb6c2
Packit Service dcb6c2
The following python represention is used for nodes:
Packit Service dcb6c2
Packit Service dcb6c2
  Source node: [ children ]
Packit Service dcb6c2
  Internal node: (label, [ children ])
Packit Service dcb6c2
  Sink node: None
Packit Service dcb6c2
Packit Service dcb6c2
The graph is first compressed by prefixes like a trie. In the next step
Packit Service dcb6c2
suffixes are compressed so that the graph gets diamond shaped. Finally
Packit Service dcb6c2
one to one linked nodes are replaced by nodes with the labels joined.
Packit Service dcb6c2
Packit Service dcb6c2
The order of the operations is crucial since lookups will be performed
Packit Service dcb6c2
starting from the source with no backtracking. Thus a node must have at
Packit Service dcb6c2
most one child with a label starting by the same character. The output
Packit Service dcb6c2
is also arranged so that all jumps are to increasing addresses, thus forward
Packit Service dcb6c2
in memory.
Packit Service dcb6c2
Packit Service dcb6c2
The generated output has suffix free decoding so that the sign of leading
Packit Service dcb6c2
bits in a link (a reference to a child node) indicate if it has a size of one,
Packit Service dcb6c2
two or three bytes and if it is the last outgoing link from the actual node.
Packit Service dcb6c2
A node label is terminated by a byte with the leading bit set.
Packit Service dcb6c2
Packit Service dcb6c2
The generated byte array can described by the following BNF:
Packit Service dcb6c2
Packit Service dcb6c2
<byte> ::= < 8-bit value in range [0x00-0xFF] >
Packit Service dcb6c2
Packit Service dcb6c2
<char> ::= < byte in range [0x1F-0x7F] >
Packit Service dcb6c2
<end_char> ::= < char + 0x80, byte in range [0x9F-0xFF] >
Packit Service dcb6c2
<return value> ::= < value + 0x80, byte in range [0x80-0x8F] >
Packit Service dcb6c2
Packit Service dcb6c2
<offset1> ::= < byte in range [0x00-0x3F] >
Packit Service dcb6c2
<offset2> ::= < byte in range [0x40-0x5F] >
Packit Service dcb6c2
<offset3> ::= < byte in range [0x60-0x7F] >
Packit Service dcb6c2
Packit Service dcb6c2
<end_offset1> ::= < byte in range [0x80-0xBF] >
Packit Service dcb6c2
<end_offset2> ::= < byte in range [0xC0-0xDF] >
Packit Service dcb6c2
<end_offset3> ::= < byte in range [0xE0-0xFF] >
Packit Service dcb6c2
Packit Service dcb6c2
<prefix> ::= <char>
Packit Service dcb6c2
Packit Service dcb6c2
<label> ::= <end_char>
Packit Service dcb6c2
          | <char> <label>
Packit Service dcb6c2
Packit Service dcb6c2
<end_label> ::= <return_value>
Packit Service dcb6c2
          | <char> <end_label>
Packit Service dcb6c2
Packit Service dcb6c2
<offset> ::= <offset1>
Packit Service dcb6c2
           | <offset2> <byte>
Packit Service dcb6c2
           | <offset3> <byte> <byte>
Packit Service dcb6c2
Packit Service dcb6c2
<end_offset> ::= <end_offset1>
Packit Service dcb6c2
               | <end_offset2> <byte>
Packit Service dcb6c2
               | <end_offset3> <byte> <byte>
Packit Service dcb6c2
Packit Service dcb6c2
<offsets> ::= <end_offset>
Packit Service dcb6c2
            | <offset> <offsets>
Packit Service dcb6c2
Packit Service dcb6c2
<source> ::= <offsets>
Packit Service dcb6c2
Packit Service dcb6c2
<node> ::= <label> <offsets>
Packit Service dcb6c2
         | <prefix> <node>
Packit Service dcb6c2
         | <end_label>
Packit Service dcb6c2
Packit Service dcb6c2
<graph> ::= <graph>
Packit Service dcb6c2
          | <graph> <node>
Packit Service dcb6c2
Packit Service dcb6c2
<version> ::= <empty>            # The DAFSA was generated in ASCII mode.
Packit Service dcb6c2
          | < byte value 0x01 >  # The DAFSA was generated in UTF-8 mode.
Packit Service dcb6c2
Packit Service dcb6c2
<dafsa> ::= <graph> <version>
Packit Service dcb6c2
Packit Service dcb6c2
Decoding:
Packit Service dcb6c2
Packit Service dcb6c2
<char> -> character
Packit Service dcb6c2
<end_char> & 0x7F -> character
Packit Service dcb6c2
<return value> & 0x0F -> integer
Packit Service dcb6c2
<offset1 & 0x3F> -> integer
Packit Service dcb6c2
((<offset2> & 0x1F>) << 8) + <byte> -> integer
Packit Service dcb6c2
((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer
Packit Service dcb6c2
Packit Service dcb6c2
end_offset1, end_offset2 and and_offset3 are decoded same as offset1,
Packit Service dcb6c2
offset2 and offset3 respectively.
Packit Service dcb6c2
Packit Service dcb6c2
The first offset in a list of offsets is the distance in bytes between the
Packit Service dcb6c2
offset itself and the first child node. Subsequent offsets are the distance
Packit Service dcb6c2
between previous child node and next child node. Thus each offset links a node
Packit Service dcb6c2
to a child node. The distance is always counted between start addresses, i.e.
Packit Service dcb6c2
first byte in decoded offset or first byte in child node.
Packit Service dcb6c2
Packit Service dcb6c2
Transcoding of UTF-8 multibyte sequences:
Packit Service dcb6c2
Packit Service dcb6c2
The original DAFSA format was limited to 7-bit printable ASCII characters in
Packit Service dcb6c2
range [0x20-0xFF], but has been extended to allow UTF-8 multibyte sequences.
Packit Service dcb6c2
By transcoding of such characters the new format preserves compatibility with
Packit Service dcb6c2
old parsers, so that a DAFSA in the extended format can be used by an old
Packit Service dcb6c2
parser without false positives, although strings containing transcoded
Packit Service dcb6c2
characters will never match. Since the format is extended rather than being
Packit Service dcb6c2
changed, a parser supporting the new format will automatically support data
Packit Service dcb6c2
generated in the old format.
Packit Service dcb6c2
Packit Service dcb6c2
Transcoding is performed by insertion of a start byte with the special value
Packit Service dcb6c2
0x1F, followed by 2-4 bytes shifted into the range [0x40-0x7F], thus inside
Packit Service dcb6c2
the range of printable ASCII.
Packit Service dcb6c2
Packit Service dcb6c2
2-byte: 110nnnnn, 10nnnnnn -> 00011111, 010nnnnn, 01nnnnnn
Packit Service dcb6c2
Packit Service dcb6c2
3-byte: 1110nnnn, 10nnnnnn, 10nnnnnn -> 00011111, 0110nnnn, 01nnnnnn, 01nnnnnn
Packit Service dcb6c2
Packit Service dcb6c2
4-byte: 11110nnn, 10nnnnnn, 10nnnnnn, 10nnnnnn ->
Packit Service dcb6c2
                00011111, 01110nnn, 01nnnnnn, 01nnnnnn, 01nnnnnn
Packit Service dcb6c2
Packit Service dcb6c2
Example 1:
Packit Service dcb6c2
Packit Service dcb6c2
%%
Packit Service dcb6c2
aa, 1
Packit Service dcb6c2
a, 2
Packit Service dcb6c2
%%
Packit Service dcb6c2
Packit Service dcb6c2
The input is first parsed to a list of words:
Packit Service dcb6c2
["aa1", "a2"]
Packit Service dcb6c2
Packit Service dcb6c2
A fully expanded graph is created from the words:
Packit Service dcb6c2
source = [node1, node4]
Packit Service dcb6c2
node1 = ("a", [node2])
Packit Service dcb6c2
node2 = ("a", [node3])
Packit Service dcb6c2
node3 = ("\x01", [sink])
Packit Service dcb6c2
node4 = ("a", [node5])
Packit Service dcb6c2
node5 = ("\x02", [sink])
Packit Service dcb6c2
sink = None
Packit Service dcb6c2
Packit Service dcb6c2
Compression results in the following graph:
Packit Service dcb6c2
source = [node1]
Packit Service dcb6c2
node1 = ("a", [node2, node3])
Packit Service dcb6c2
node2 = ("\x02", [sink])
Packit Service dcb6c2
node3 = ("a\x01", [sink])
Packit Service dcb6c2
sink = None
Packit Service dcb6c2
Packit Service dcb6c2
A C++ representation of the compressed graph is generated:
Packit Service dcb6c2
Packit Service dcb6c2
const unsigned char dafsa[7] = {
Packit Service dcb6c2
  0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81,
Packit Service dcb6c2
};
Packit Service dcb6c2
Packit Service dcb6c2
The bytes in the generated array has the following meaning:
Packit Service dcb6c2
Packit Service dcb6c2
 0: 0x81 <end_offset1>  child at position 0 + (0x81 & 0x3F) -> jump to 1
Packit Service dcb6c2
Packit Service dcb6c2
 1: 0xE1 <end_char>     label character (0xE1 & 0x7F) -> match "a"
Packit Service dcb6c2
 2: 0x02 <offset1>      child at position 2 + (0x02 & 0x3F) -> jump to 4
Packit Service dcb6c2
Packit Service dcb6c2
 3: 0x81 <end_offset1>  child at position 4 + (0x81 & 0x3F) -> jump to 5
Packit Service dcb6c2
 4: 0x82 <return_value> 0x82 & 0x0F -> return 2
Packit Service dcb6c2
Packit Service dcb6c2
 5: 0x61 <char>         label character 0x61 -> match "a"
Packit Service dcb6c2
 6: 0x81 <return_value> 0x81 & 0x0F -> return 1
Packit Service dcb6c2
Packit Service dcb6c2
Example 2:
Packit Service dcb6c2
Packit Service dcb6c2
%%
Packit Service dcb6c2
aa, 1
Packit Service dcb6c2
bbb, 2
Packit Service dcb6c2
baa, 1
Packit Service dcb6c2
%%
Packit Service dcb6c2
Packit Service dcb6c2
The input is first parsed to a list of words:
Packit Service dcb6c2
["aa1", "bbb2", "baa1"]
Packit Service dcb6c2
Packit Service dcb6c2
Compression results in the following graph:
Packit Service dcb6c2
source = [node1, node2]
Packit Service dcb6c2
node1 = ("b", [node2, node3])
Packit Service dcb6c2
node2 = ("aa\x01", [sink])
Packit Service dcb6c2
node3 = ("bb\x02", [sink])
Packit Service dcb6c2
sink = None
Packit Service dcb6c2
Packit Service dcb6c2
A C++ representation of the compressed graph is generated:
Packit Service dcb6c2
Packit Service dcb6c2
const unsigned char dafsa[11] = {
Packit Service dcb6c2
  0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82,
Packit Service dcb6c2
};
Packit Service dcb6c2
Packit Service dcb6c2
The bytes in the generated array has the following meaning:
Packit Service dcb6c2
Packit Service dcb6c2
 0: 0x02 <offset1>      child at position 0 + (0x02 & 0x3F) -> jump to 2
Packit Service dcb6c2
 1: 0x83 <end_offset1>  child at position 2 + (0x83 & 0x3F) -> jump to 5
Packit Service dcb6c2
Packit Service dcb6c2
 2: 0xE2 <end_char>     label character (0xE2 & 0x7F) -> match "b"
Packit Service dcb6c2
 3: 0x02 <offset1>      child at position 3 + (0x02 & 0x3F) -> jump to 5
Packit Service dcb6c2
 4: 0x83 <end_offset1>  child at position 5 + (0x83 & 0x3F) -> jump to 8
Packit Service dcb6c2
Packit Service dcb6c2
 5: 0x61 <char>         label character 0x61 -> match "a"
Packit Service dcb6c2
 6: 0x61 <char>         label character 0x61 -> match "a"
Packit Service dcb6c2
 7: 0x81 <return_value> 0x81 & 0x0F -> return 1
Packit Service dcb6c2
Packit Service dcb6c2
 8: 0x62 <char>         label character 0x62 -> match "b"
Packit Service dcb6c2
 9: 0x62 <char>         label character 0x62 -> match "b"
Packit Service dcb6c2
10: 0x82 <return_value> 0x82 & 0x0F -> return 2
Packit Service dcb6c2
"""
Packit Service dcb6c2
Packit Service dcb6c2
import sys
Packit Service dcb6c2
import os.path
Packit Service dcb6c2
import time
Packit Service dcb6c2
import hashlib
Packit Service dcb6c2
Packit Service dcb6c2
class InputError(Exception):
Packit Service dcb6c2
  """Exception raised for errors in the input file."""
Packit Service dcb6c2
Packit Service dcb6c2
# Length of a character starting at a given byte.
Packit Service dcb6c2
char_length_table = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x00-0x0F
Packit Service dcb6c2
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x10-0x1F
Packit Service dcb6c2
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x20-0x2F
Packit Service dcb6c2
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x30-x03F
Packit Service dcb6c2
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x40-0x4F
Packit Service dcb6c2
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x50-x05F
Packit Service dcb6c2
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x60-0x6F
Packit Service dcb6c2
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x70-x07F
Packit Service dcb6c2
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x80-0x8F
Packit Service dcb6c2
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x90-0x9F
Packit Service dcb6c2
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xA0-0xAF
Packit Service dcb6c2
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xB0-0xBF
Packit Service dcb6c2
                      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  # 0xC0-0xCF
Packit Service dcb6c2
                      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  # 0xD0-0xDF
Packit Service dcb6c2
                      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,  # 0xE0-0xEF
Packit Service dcb6c2
                      4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0 ) # 0xF0-0xFF
Packit Service dcb6c2
Packit Service dcb6c2
def to_bytes(n):
Packit Service dcb6c2
  """Converts an integer value to a bytes object."""
Packit Service dcb6c2
  return bytes(bytearray((n,)))
Packit Service dcb6c2
Packit Service dcb6c2
def to_dafsa(words, utf_mode):
Packit Service dcb6c2
  """Generates a DAFSA from a word list and returns the source node.
Packit Service dcb6c2
Packit Service dcb6c2
  Each word is split into characters so that each character is represented by
Packit Service dcb6c2
  a unique node. It is assumed the word list is not empty.
Packit Service dcb6c2
  """
Packit Service dcb6c2
  if not words:
Packit Service dcb6c2
    raise InputError('The domain list must not be empty')
Packit Service dcb6c2
  def to_nodes(word, multibyte_length):
Packit Service dcb6c2
    """Split words into characters"""
Packit Service dcb6c2
    byte = ord(word[:1])
Packit Service dcb6c2
    if multibyte_length:
Packit Service dcb6c2
      # Consume next byte in multibyte sequence.
Packit Service dcb6c2
      if byte & 0xC0 != 0x80:
Packit Service dcb6c2
        raise InputError('Invalid UTF-8 multibyte sequence')
Packit Service dcb6c2
      return to_bytes(byte ^ 0xC0), [to_nodes(word[1:], multibyte_length - 1)]
Packit Service dcb6c2
    char_length = char_length_table[byte]
Packit Service dcb6c2
    if char_length == 1:
Packit Service dcb6c2
      # 7-bit printable ASCII.
Packit Service dcb6c2
      if len(word) == 1:
Packit Service dcb6c2
        return to_bytes(int(word[:1], 16) & 0x0F), [None]
Packit Service dcb6c2
      return word[:1], [to_nodes(word[1:], 0)]
Packit Service dcb6c2
    elif char_length > 1:
Packit Service dcb6c2
      # Leading byte in multibyte sequence.
Packit Service dcb6c2
      if not utf_mode:
Packit Service dcb6c2
        raise InputError('UTF-8 encoded characters are not allowed in ASCII mode')
Packit Service dcb6c2
      if len(word) <= char_length:
Packit Service dcb6c2
        raise InputError('Unterminated UTF-8 multibyte sequence')
Packit Service dcb6c2
      return to_bytes(0x1F), [(to_bytes(byte ^ 0x80), [to_nodes(word[1:], char_length - 1)])]
Packit Service dcb6c2
    # Unexpected character.
Packit Service dcb6c2
    raise InputError('Domain names must be printable ASCII or UTF-8')
Packit Service dcb6c2
Packit Service dcb6c2
  return [to_nodes(word, 0) for word in words]
Packit Service dcb6c2
Packit Service dcb6c2
def to_words(node):
Packit Service dcb6c2
  """Generates a word list from all paths starting from an internal node."""
Packit Service dcb6c2
  if not node:
Packit Service dcb6c2
    return [b'']
Packit Service dcb6c2
  return [(node[0] + word) for child in node[1] for word in to_words(child)]
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def reverse(dafsa):
Packit Service dcb6c2
  """Generates a new DAFSA that is reversed, so that the old sink node becomes
Packit Service dcb6c2
  the new source node.
Packit Service dcb6c2
  """
Packit Service dcb6c2
  sink = []
Packit Service dcb6c2
  nodemap = {}
Packit Service dcb6c2
Packit Service dcb6c2
  def dfs(node, parent):
Packit Service dcb6c2
    """Creates reverse nodes.
Packit Service dcb6c2
Packit Service dcb6c2
    A new reverse node will be created for each old node. The new node will
Packit Service dcb6c2
    get a reversed label and the parents of the old node as children.
Packit Service dcb6c2
    """
Packit Service dcb6c2
    if not node:
Packit Service dcb6c2
      sink.append(parent)
Packit Service dcb6c2
    elif id(node) not in nodemap:
Packit Service dcb6c2
      nodemap[id(node)] = (node[0][::-1], [parent])
Packit Service dcb6c2
      for child in node[1]:
Packit Service dcb6c2
        dfs(child, nodemap[id(node)])
Packit Service dcb6c2
    else:
Packit Service dcb6c2
      nodemap[id(node)][1].append(parent)
Packit Service dcb6c2
Packit Service dcb6c2
  for node in dafsa:
Packit Service dcb6c2
    dfs(node, None)
Packit Service dcb6c2
  return sink
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def join_labels(dafsa):
Packit Service dcb6c2
  """Generates a new DAFSA where internal nodes are merged if there is a one to
Packit Service dcb6c2
  one connection.
Packit Service dcb6c2
  """
Packit Service dcb6c2
  parentcount = {id(None): 2}
Packit Service dcb6c2
  nodemap = {id(None): None}
Packit Service dcb6c2
Packit Service dcb6c2
  def count_parents(node):
Packit Service dcb6c2
    """Count incoming references"""
Packit Service dcb6c2
    if id(node) in parentcount:
Packit Service dcb6c2
      parentcount[id(node)] += 1
Packit Service dcb6c2
    else:
Packit Service dcb6c2
      parentcount[id(node)] = 1
Packit Service dcb6c2
      for child in node[1]:
Packit Service dcb6c2
        count_parents(child)
Packit Service dcb6c2
Packit Service dcb6c2
  def join(node):
Packit Service dcb6c2
    """Create new nodes"""
Packit Service dcb6c2
    if id(node) not in nodemap:
Packit Service dcb6c2
      children = [join(child) for child in node[1]]
Packit Service dcb6c2
      if len(children) == 1 and parentcount[id(node[1][0])] == 1:
Packit Service dcb6c2
        child = children[0]
Packit Service dcb6c2
        nodemap[id(node)] = (node[0] + child[0], child[1])
Packit Service dcb6c2
      else:
Packit Service dcb6c2
        nodemap[id(node)] = (node[0], children)
Packit Service dcb6c2
    return nodemap[id(node)]
Packit Service dcb6c2
Packit Service dcb6c2
  for node in dafsa:
Packit Service dcb6c2
    count_parents(node)
Packit Service dcb6c2
  return [join(node) for node in dafsa]
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def join_suffixes(dafsa):
Packit Service dcb6c2
  """Generates a new DAFSA where nodes that represent the same word lists
Packit Service dcb6c2
  towards the sink are merged.
Packit Service dcb6c2
  """
Packit Service dcb6c2
  nodemap = {frozenset((b'',)): None}
Packit Service dcb6c2
Packit Service dcb6c2
  def join(node):
Packit Service dcb6c2
    """Returns a macthing node. A new node is created if no matching node
Packit Service dcb6c2
    exists. The graph is accessed in dfs order.
Packit Service dcb6c2
    """
Packit Service dcb6c2
    suffixes = frozenset(to_words(node))
Packit Service dcb6c2
    if suffixes not in nodemap:
Packit Service dcb6c2
      nodemap[suffixes] = (node[0], [join(child) for child in node[1]])
Packit Service dcb6c2
    return nodemap[suffixes]
Packit Service dcb6c2
Packit Service dcb6c2
  return [join(node) for node in dafsa]
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def top_sort(dafsa):
Packit Service dcb6c2
  """Generates list of nodes in topological sort order."""
Packit Service dcb6c2
  incoming = {}
Packit Service dcb6c2
Packit Service dcb6c2
  def count_incoming(node):
Packit Service dcb6c2
    """Counts incoming references."""
Packit Service dcb6c2
    if node:
Packit Service dcb6c2
      if id(node) not in incoming:
Packit Service dcb6c2
        incoming[id(node)] = 1
Packit Service dcb6c2
        for child in node[1]:
Packit Service dcb6c2
          count_incoming(child)
Packit Service dcb6c2
      else:
Packit Service dcb6c2
        incoming[id(node)] += 1
Packit Service dcb6c2
Packit Service dcb6c2
  for node in dafsa:
Packit Service dcb6c2
    count_incoming(node)
Packit Service dcb6c2
Packit Service dcb6c2
  for node in dafsa:
Packit Service dcb6c2
    incoming[id(node)] -= 1
Packit Service dcb6c2
Packit Service dcb6c2
  waiting = [node for node in dafsa if incoming[id(node)] == 0]
Packit Service dcb6c2
  nodes = []
Packit Service dcb6c2
Packit Service dcb6c2
  while waiting:
Packit Service dcb6c2
    node = waiting.pop()
Packit Service dcb6c2
    assert incoming[id(node)] == 0
Packit Service dcb6c2
    nodes.append(node)
Packit Service dcb6c2
    for child in node[1]:
Packit Service dcb6c2
      if child:
Packit Service dcb6c2
        incoming[id(child)] -= 1
Packit Service dcb6c2
        if incoming[id(child)] == 0:
Packit Service dcb6c2
          waiting.append(child)
Packit Service dcb6c2
  return nodes
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def encode_links(children, offsets, current):
Packit Service dcb6c2
  """Encodes a list of children as one, two or three byte offsets."""
Packit Service dcb6c2
  if not children[0]:
Packit Service dcb6c2
    # This is an <end_label> node and no links follow such nodes
Packit Service dcb6c2
    assert len(children) == 1
Packit Service dcb6c2
    return []
Packit Service dcb6c2
  guess = 3 * len(children)
Packit Service dcb6c2
  assert children
Packit Service dcb6c2
  children = sorted(children, key=lambda x: -offsets[id(x)])
Packit Service dcb6c2
  while True:
Packit Service dcb6c2
    offset = current + guess
Packit Service dcb6c2
    buf = []
Packit Service dcb6c2
    for child in children:
Packit Service dcb6c2
      last = len(buf)
Packit Service dcb6c2
      distance = offset - offsets[id(child)]
Packit Service dcb6c2
      assert distance > 0 and distance < (1 << 21)
Packit Service dcb6c2
Packit Service dcb6c2
      if distance < (1 << 6):
Packit Service dcb6c2
        # A 6-bit offset: "s0xxxxxx"
Packit Service dcb6c2
        buf.append(distance)
Packit Service dcb6c2
      elif distance < (1 << 13):
Packit Service dcb6c2
        # A 13-bit offset: "s10xxxxxxxxxxxxx"
Packit Service dcb6c2
        buf.append(0x40 | (distance >> 8))
Packit Service dcb6c2
        buf.append(distance & 0xFF)
Packit Service dcb6c2
      else:
Packit Service dcb6c2
        # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx"
Packit Service dcb6c2
        buf.append(0x60 | (distance >> 16))
Packit Service dcb6c2
        buf.append((distance >> 8) & 0xFF)
Packit Service dcb6c2
        buf.append(distance & 0xFF)
Packit Service dcb6c2
      # Distance in first link is relative to following record.
Packit Service dcb6c2
      # Distance in other links are relative to previous link.
Packit Service dcb6c2
      offset -= distance
Packit Service dcb6c2
    if len(buf) == guess:
Packit Service dcb6c2
      break
Packit Service dcb6c2
    guess = len(buf)
Packit Service dcb6c2
  # Set most significant bit to mark end of links in this node.
Packit Service dcb6c2
  buf[last] |= (1 << 7)
Packit Service dcb6c2
  buf.reverse()
Packit Service dcb6c2
  return buf
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def encode_prefix(label):
Packit Service dcb6c2
  """Encodes a node label as a list of bytes without a trailing high byte.
Packit Service dcb6c2
Packit Service dcb6c2
  This method encodes a node if there is exactly one child  and the
Packit Service dcb6c2
  child follows immediately after so that no jump is needed. This label
Packit Service dcb6c2
  will then be a prefix to the label in the child node.
Packit Service dcb6c2
  """
Packit Service dcb6c2
  assert label
Packit Service dcb6c2
  return [c for c in bytearray(reversed(label))]
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def encode_label(label):
Packit Service dcb6c2
  """Encodes a node label as a list of bytes with a trailing high byte >0x80.
Packit Service dcb6c2
  """
Packit Service dcb6c2
  buf = encode_prefix(label)
Packit Service dcb6c2
  # Set most significant bit to mark end of label in this node.
Packit Service dcb6c2
  buf[0] |= (1 << 7)
Packit Service dcb6c2
  return buf
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def encode(dafsa, utf_mode):
Packit Service dcb6c2
  """Encodes a DAFSA to a list of bytes"""
Packit Service dcb6c2
  output = []
Packit Service dcb6c2
  offsets = {}
Packit Service dcb6c2
Packit Service dcb6c2
  for node in reversed(top_sort(dafsa)):
Packit Service dcb6c2
    if (len(node[1]) == 1 and node[1][0] and
Packit Service dcb6c2
        (offsets[id(node[1][0])] == len(output))):
Packit Service dcb6c2
      output.extend(encode_prefix(node[0]))
Packit Service dcb6c2
    else:
Packit Service dcb6c2
      output.extend(encode_links(node[1], offsets, len(output)))
Packit Service dcb6c2
      output.extend(encode_label(node[0]))
Packit Service dcb6c2
    offsets[id(node)] = len(output)
Packit Service dcb6c2
Packit Service dcb6c2
  output.extend(encode_links(dafsa, offsets, len(output)))
Packit Service dcb6c2
  output.reverse()
Packit Service dcb6c2
  if utf_mode:
Packit Service dcb6c2
    output.append(0x01)
Packit Service dcb6c2
  return output
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def to_cxx(data, codecs):
Packit Service dcb6c2
  """Generates C++ code from a list of encoded bytes."""
Packit Service dcb6c2
  text = b'/* This file has been generated by psl-make-dafsa. DO NOT EDIT!\n\n'
Packit Service dcb6c2
  text += b'The byte array encodes effective tld names. See psl-make-dafsa source for'
Packit Service dcb6c2
  text += b' documentation.'
Packit Service dcb6c2
  text += b'*/\n\n'
Packit Service dcb6c2
  text += b'static const unsigned char kDafsa['
Packit Service dcb6c2
  text += bytes(str(len(data)), **codecs)
Packit Service dcb6c2
  text += b'] = {\n'
Packit Service dcb6c2
  for i in range(0, len(data), 12):
Packit Service dcb6c2
    text += b'  '
Packit Service dcb6c2
    text += bytes(', '.join('0x%02x' % byte for byte in data[i:i + 12]), **codecs)
Packit Service dcb6c2
    text += b',\n'
Packit Service dcb6c2
  text += b'};\n'
Packit Service dcb6c2
  return text
Packit Service dcb6c2
Packit Service dcb6c2
def sha1_file(name):
Packit Service dcb6c2
  sha1 = hashlib.sha1()
Packit Service dcb6c2
  with open(name, 'rb') as f:
Packit Service dcb6c2
    while True:
Packit Service dcb6c2
        data = f.read(65536)
Packit Service dcb6c2
        if not data:
Packit Service dcb6c2
            break
Packit Service dcb6c2
        sha1.update(data)
Packit Service dcb6c2
  return sha1.hexdigest()
Packit Service dcb6c2
Packit Service dcb6c2
def to_cxx_plus(data, codecs):
Packit Service dcb6c2
  """Generates C++ code from a word list plus some variable assignments as needed by libpsl"""
Packit Service dcb6c2
  text = to_cxx(data, codecs)
Packit Service dcb6c2
  text += b'static time_t _psl_file_time = %d;\n' % os.stat(psl_input_file).st_mtime
Packit Service dcb6c2
  text += b'static int _psl_nsuffixes = %d;\n' % psl_nsuffixes
Packit Service dcb6c2
  text += b'static int _psl_nexceptions = %d;\n' % psl_nexceptions
Packit Service dcb6c2
  text += b'static int _psl_nwildcards = %d;\n' % psl_nwildcards
Packit Service dcb6c2
  text += b'static const char _psl_sha1_checksum[] = "%s";\n' % bytes(sha1_file(psl_input_file), **codecs)
Packit Service dcb6c2
  text += b'static const char _psl_filename[] = "%s";\n' % bytes(psl_input_file, **codecs)
Packit Service dcb6c2
  return text
Packit Service dcb6c2
Packit Service dcb6c2
def words_to_whatever(words, converter, utf_mode, codecs):
Packit Service dcb6c2
  """Generates C++ code from a word list"""
Packit Service dcb6c2
  dafsa = to_dafsa(words, utf_mode)
Packit Service dcb6c2
  for fun in (reverse, join_suffixes, reverse, join_suffixes, join_labels):
Packit Service dcb6c2
    dafsa = fun(dafsa)
Packit Service dcb6c2
  return converter(encode(dafsa, utf_mode), codecs)
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def words_to_cxx(words, utf_mode, codecs):
Packit Service dcb6c2
  """Generates C++ code from a word list"""
Packit Service dcb6c2
  return words_to_whatever(words, to_cxx, utf_mode, codecs)
Packit Service dcb6c2
Packit Service dcb6c2
def words_to_cxx_plus(words, utf_mode, codecs):
Packit Service dcb6c2
  """Generates C++ code from a word list plus some variable assignments as needed by libpsl"""
Packit Service dcb6c2
  return words_to_whatever(words, to_cxx_plus, utf_mode, codecs)
Packit Service dcb6c2
Packit Service dcb6c2
def words_to_binary(words, utf_mode, codecs):
Packit Service dcb6c2
  """Generates C++ code from a word list"""
Packit Service dcb6c2
  return b'.DAFSA@PSL_0   \n' + words_to_whatever(words, lambda x, _: bytearray(x), utf_mode, codecs)
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def parse_psl(infile, utf_mode, codecs):
Packit Service dcb6c2
  """Parses PSL file and extract strings and return code"""
Packit Service dcb6c2
  PSL_FLAG_EXCEPTION = (1<<0)
Packit Service dcb6c2
  PSL_FLAG_WILDCARD = (1<<1)
Packit Service dcb6c2
  PSL_FLAG_ICANN = (1<<2) # entry of ICANN section
Packit Service dcb6c2
  PSL_FLAG_PRIVATE = (1<<3) # entry of PRIVATE section
Packit Service dcb6c2
  PSL_FLAG_PLAIN = (1<<4) #just used for PSL syntax checking
Packit Service dcb6c2
Packit Service dcb6c2
  global psl_nsuffixes, psl_nexceptions, psl_nwildcards
Packit Service dcb6c2
Packit Service dcb6c2
  psl = {}
Packit Service dcb6c2
  section = 0
Packit Service dcb6c2
Packit Service dcb6c2
  for line in infile:
Packit Service dcb6c2
    line = bytes(line.strip(), **codecs)
Packit Service dcb6c2
    if not line:
Packit Service dcb6c2
      continue
Packit Service dcb6c2
Packit Service dcb6c2
    if line.startswith(b'//'):
Packit Service dcb6c2
      if section == 0:
Packit Service dcb6c2
        if b'===BEGIN ICANN DOMAINS===' in line:
Packit Service dcb6c2
          section = PSL_FLAG_ICANN
Packit Service dcb6c2
        elif section == 0 and b'===BEGIN PRIVATE DOMAINS===' in line:
Packit Service dcb6c2
          section = PSL_FLAG_PRIVATE
Packit Service dcb6c2
      elif section == PSL_FLAG_ICANN and b'===END ICANN DOMAINS===' in line:
Packit Service dcb6c2
        section = 0
Packit Service dcb6c2
      elif section == PSL_FLAG_PRIVATE and b'===END PRIVATE DOMAINS===' in line:
Packit Service dcb6c2
        section = 0
Packit Service dcb6c2
      continue # skip comments
Packit Service dcb6c2
Packit Service dcb6c2
    if line[:1] == b'!':
Packit Service dcb6c2
      psl_nexceptions += 1
Packit Service dcb6c2
      flags = PSL_FLAG_EXCEPTION | section
Packit Service dcb6c2
      line = line[1:]
Packit Service dcb6c2
    elif line[:1] == b'*':
Packit Service dcb6c2
      if line[1:2] != b'.':
Packit Service dcb6c2
        print('Unsupported kind of rule (ignored): %s' % line)
Packit Service dcb6c2
        continue
Packit Service dcb6c2
      psl_nwildcards += 1
Packit Service dcb6c2
      psl_nsuffixes += 1
Packit Service dcb6c2
      flags = PSL_FLAG_WILDCARD | PSL_FLAG_PLAIN | section
Packit Service dcb6c2
      line = line[2:]
Packit Service dcb6c2
    else:
Packit Service dcb6c2
      psl_nsuffixes += 1
Packit Service dcb6c2
      flags = PSL_FLAG_PLAIN | section
Packit Service dcb6c2
Packit Service dcb6c2
    punycode = line.decode('utf-8').encode('idna')
Packit Service dcb6c2
Packit Service dcb6c2
    if punycode in psl:
Packit Service dcb6c2
      """Found existing entry:
Packit Service dcb6c2
         Combination of exception and plain rule is ambiguous
Packit Service dcb6c2
           !foo.bar
Packit Service dcb6c2
            foo.bar
Packit Service dcb6c2
Packit Service dcb6c2
         Allowed:
Packit Service dcb6c2
           !foo.bar + *.foo.bar
Packit Service dcb6c2
            foo.bar + *.foo.bar
Packit Service dcb6c2
      """
Packit Service dcb6c2
      print('Found %s/%X (now %X)' % punycode, psl[punycode], flags)
Packit Service dcb6c2
      continue
Packit Service dcb6c2
Packit Service dcb6c2
    if utf_mode:
Packit Service dcb6c2
      psl[line] = flags
Packit Service dcb6c2
    psl[punycode] = flags
Packit Service dcb6c2
Packit Service dcb6c2
#  with open("psl.out", 'w') as outfile:
Packit Service dcb6c2
#    for (domain, flags) in sorted(psl.iteritems()):
Packit Service dcb6c2
#      outfile.write(domain + "%X" % (flags & 0x0F) + "\n")
Packit Service dcb6c2
Packit Service dcb6c2
  return [domain + bytes('%X' % (flags & 0x0F), **codecs) for (domain, flags) in sorted(psl.items())]
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def usage():
Packit Service dcb6c2
  """Prints the usage"""
Packit Service dcb6c2
  print('usage: %s [options] infile outfile' % sys.argv[0])
Packit Service dcb6c2
  print('  --output-format=cxx     Write DAFSA as C/C++ code (default)')
Packit Service dcb6c2
  print('  --output-format=cxx+    Write DAFSA as C/C++ code plus statistical assignments')
Packit Service dcb6c2
  print('  --output-format=binary  Write DAFSA binary data')
Packit Service dcb6c2
  print('  --encoding=ascii        7-bit ASCII mode')
Packit Service dcb6c2
  print('  --encoding=utf-8        UTF-8 mode (default)')
Packit Service dcb6c2
  exit(1)
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
def main():
Packit Service dcb6c2
  """Convert PSL file into C or binary DAFSA file"""
Packit Service dcb6c2
  if len(sys.argv) < 3:
Packit Service dcb6c2
    usage()
Packit Service dcb6c2
Packit Service dcb6c2
  converter = words_to_cxx
Packit Service dcb6c2
  parser = parse_psl
Packit Service dcb6c2
  utf_mode = True
Packit Service dcb6c2
Packit Service dcb6c2
  codecs = dict()
Packit Service dcb6c2
  if sys.version_info.major > 2:
Packit Service dcb6c2
    codecs['encoding'] = 'utf-8'
Packit Service dcb6c2
Packit Service dcb6c2
  for arg in sys.argv[1:-2]:
Packit Service dcb6c2
    # Check --input-format for backward compatibility
Packit Service dcb6c2
    if arg.startswith('--input-format='):
Packit Service dcb6c2
      value = arg[15:].lower()
Packit Service dcb6c2
      if value == 'psl':
Packit Service dcb6c2
        parser = parse_psl
Packit Service dcb6c2
      else:
Packit Service dcb6c2
        print("Unknown input format '%s'" % value)
Packit Service dcb6c2
        return 1
Packit Service dcb6c2
    elif arg.startswith('--output-format='):
Packit Service dcb6c2
      value = arg[16:].lower()
Packit Service dcb6c2
      if value == 'binary':
Packit Service dcb6c2
        converter = words_to_binary
Packit Service dcb6c2
      elif value == 'cxx':
Packit Service dcb6c2
        converter = words_to_cxx
Packit Service dcb6c2
      elif value == 'cxx+':
Packit Service dcb6c2
        converter = words_to_cxx_plus
Packit Service dcb6c2
      else:
Packit Service dcb6c2
        print("Unknown output format '%s'" % value)
Packit Service dcb6c2
        return 1
Packit Service dcb6c2
    elif arg.startswith('--encoding='):
Packit Service dcb6c2
      value = arg[11:].lower()
Packit Service dcb6c2
      if value == 'ascii':
Packit Service dcb6c2
        utf_mode = False
Packit Service dcb6c2
      elif value == 'utf-8':
Packit Service dcb6c2
        utf_mode = True
Packit Service dcb6c2
      else:
Packit Service dcb6c2
        print("Unknown encoding '%s'" % value)
Packit Service dcb6c2
        return 1
Packit Service dcb6c2
    else:
Packit Service dcb6c2
      usage()
Packit Service dcb6c2
Packit Service dcb6c2
  if sys.argv[-2] == '-':
Packit Service dcb6c2
    with open(sys.argv[-1], 'wb') as outfile:
Packit Service dcb6c2
      outfile.write(converter(parser(sys.stdin, utf_mode, codecs), utf_mode, codecs))
Packit Service dcb6c2
  else:
Packit Service dcb6c2
    """Some statistical data for --output-format=cxx+"""
Packit Service dcb6c2
    global psl_input_file, psl_nsuffixes, psl_nexceptions, psl_nwildcards
Packit Service dcb6c2
Packit Service dcb6c2
    psl_input_file = sys.argv[-2]
Packit Service dcb6c2
    psl_nsuffixes = 0
Packit Service dcb6c2
    psl_nexceptions = 0
Packit Service dcb6c2
    psl_nwildcards = 0
Packit Service dcb6c2
Packit Service dcb6c2
    with open(sys.argv[-2], 'r', **codecs) as infile, open(sys.argv[-1], 'wb') as outfile:
Packit Service dcb6c2
      outfile.write(converter(parser(infile, utf_mode, codecs), utf_mode, codecs))
Packit Service dcb6c2
Packit Service dcb6c2
  return 0
Packit Service dcb6c2
Packit Service dcb6c2
Packit Service dcb6c2
if __name__ == '__main__':
Packit Service dcb6c2
  sys.exit(main())