/* Copyright 2005-2009 Nadav Har'El and Dan Kenigsberg */ /* specfilter.c - word prefix-specifier * The way prefixes currently work in Hspell is that each word has an 8-bit * (only 6 are used) "specifier", and each prefix has a bit mask, and the * word+prefix combination is accepted if one of the bits is 1 in both the * word specifier and prefix mask. * The idea in this is that each bit corresponds to a prefix feature that * words need, and certain prefixes supply. If a word has two or more meanings, * the word's specifier might get two or more 1 bits. * * Now, it turns out (see bug 74) that while the word specifiers can take * on many values (currently, 32), some specifiers are actually equivalent, * in the sense that they end up allowing or disallowing exactly the same * set of prefixes. For example, a word with specifier 2 (PS_L) and a word * with specifier 3 (PS_L | PS_B) accepts the same prefixes because in our * existing prefix set (sett genprefixes.pl), every prefix which supplies * PS_B also supplies PS_L. Another example, a word with specifier 24 * (PS_IMPER | PS_NONDEF) can get the same prefixes as a word with just a * specifier of 8 (PS_NONDEF). * * The goal of this program is to find which sets of word specifiers are * equivalent in the above sense, and given an input stream of word * specifiers (e.g., uncompressed hebrew.wgz.prefixes) it replaces all * different specifiers in one equivalence class to the same member * (the list of values left after this process is known in Mathematics * as a quotient set). * * The purpose of all this is to reduce the number of different word * specifiers present in hebrew.wgz.prefixes. For example, in Hspell 0.9 * this brought the number from 32 down to 9. This allows slightly (10%) * better compression of hebrew.wgz.prefixes, but more importantly, * allows us to generate a much smaller affix file for aspell (see * mk_he_affix.c), because it will contain just 9 prefix sets instead of 32. * * CAVEAT: * * I'm still not sure whether it is wise or not to use this process on * the final hebrew.wgz.prefixes used by Hspell. One one hand it will make * it slightly smaller, but on the other hand it makes use of information * previously available only in the code (prefixes.c - the list of prefixes * and their masks) inside the word list. Meaning that if the prefix list * code changes, the word data will need to be changed as well - a situation * we didn't have previously. * Moreover, it will also mean that we will not be able to have run-time * options that chose among different prefix sets. Luckily, the "-h" option * is fine in this respect (see comment below on why), but it is conceivable * (but not likely) that in the future we might want to use completely * different sets of prefix that behave differently. (?? what actually * matters is just the bag of different masks in the masks[] array)) */ #include "prefixes.c" #include /* NOTE: currently, the equivalence of two word specifiers does not depend on whether He Hashe'ela is allowed or not. This is because He Hashe'ela only adds another prefix like shin hashimush - so whatever specifiers that can be distinguished by He Hashe'ela can already be distinguished by the shin. It is important to remember that this fact may not remain true if we add more options of prefix bitmasks arrays, so this decision might need to be revisited. */ #define MASKS masks_noH #define SPECBITS 6 /* maximum number of bits in PrefixBits.pl */ #define NMASKS (sizeof(MASKS)/sizeof(MASKS[0])-1) #define NSPECS (1< int main(void){ int i,j; int num; genequiv(); #if 0 fprintf(stderr, "Prefix specifier equivalences:\n"); for(i=0;i= NSPECS){ fprintf(stderr, "value %d in hebrew.wgz.prefixes out of bound\n", i); exit(1); } putchar(equivalent[i]); } return 0; }