Blame README

Packit 73e109
libexttextcat is an N-Gram-Based Text Categorization library primarily intended
Packit 73e109
for language guessing.
Packit 73e109
Packit 73e109
Fundamentally this is an adaption of wiseguys libtextcat extended to be UTF-8
Packit 73e109
aware. See README.libtextcat for details on original libtextcat.
Packit 73e109
Packit 73e109
Building:
Packit 73e109
Packit 73e109
 * ./configure
Packit 73e109
 * make
Packit 73e109
 * make check
Packit 73e109
Packit 73e109
the tests can be run under valgrind's memcheck with export VALGRIND=memcheck,
Packit 73e109
e.g.
Packit 73e109
Packit 73e109
 * export VALGRIND=memcheck
Packit 73e109
 * make check
Packit 73e109
Packit 73e109
Quickstart: language guesser
Packit 73e109
  
Packit 73e109
 Assuming that you have successfully compiled the library, you need some
Packit 73e109
language models to start guessing languages. A collection of over 150 language
Packit 73e109
models, mostly derived from using the included "createfp" utility on UDHR
Packit 73e109
translations, is bundled, with a matching configuration file, in the langclass
Packit 73e109
directory:
Packit 73e109
Packit 73e109
  * cd langclass/LM
Packit 73e109
  * ../../src/testtextcat ../fpdb.conf
Packit 73e109
  	 
Packit 73e109
Paste some text onto the commandline, and watch it get classified.
Packit 73e109
     
Packit 73e109
Using the API:
Packit 73e109
  
Packit 73e109
Classifying the language of a textbuffer can be as easy as:
Packit 73e109
Packit 73e109
 #include "textcat.h"
Packit 73e109
 ...
Packit 73e109
 void *h = textcat_Init( "fpdb.conf" );
Packit 73e109
 ...
Packit 73e109
 printf( "Language: %s\n", textcat_Classify(h, buffer, 400);
Packit 73e109
 ...
Packit 73e109
 textcat_Done(h);
Packit 73e109
      
Packit 73e109
Creating your own fingerprints:
Packit 73e109
  
Packit 73e109
The createfp program allows you to easily create your own document
Packit 73e109
fingerprints. Just feed it an example document on standard input, and store the
Packit 73e109
standard output:
Packit 73e109
Packit 73e109
Put the names of your fingerprints in a configuration file, add some id's and
Packit 73e109
you're ready to classify.
Packit 73e109
Packit 73e109
Here's a worked example. The UN Declaration of Human Rights is available in a
Packit 73e109
massive pile of translations[4], and and unicode.org makes much of these
Packit 73e109
available as plain text[5], so...
Packit 73e109
Packit 73e109
% cd langclass/ShortTexts/
Packit 73e109
% wget http://unicode.org/udhr/d/udhr_abk.txt
Packit 73e109
% tail -n+7 udhr_abk.txt > ab.txt #skip english header, name is using BCP-47
Packit 73e109
% cd ../LM
Packit 73e109
% ../../src/createfp < ../ShortTexts/ab.txt > ab.lm
Packit 73e109
% echo "ab.lm       ab--utf8" >> ../fpdb.conf
Packit 73e109
Packit 73e109
Eventually we'll drop fpdb.conf and assume the name of the fingerprint .lm file
Packit 73e109
is the correct BCP-47 tag for the language it detects.
Packit 73e109
    
Packit 73e109
Performance tuning:
Packit 73e109
Packit 73e109
This library was made with efficiency in mind. There are couple of
Packit 73e109
parameters you may wish to tweak if you intend to use it for other
Packit 73e109
tasks than language guessing.
Packit 73e109
Packit 73e109
The most important thing is buffer size. For reliable language
Packit 73e109
guessing the classifier only needs a couple of hundreds of bytes max.
Packit 73e109
So don't feed it 100KB of text unless you are creating a fingerprint.
Packit 73e109
Packit 73e109
If you insist on feeding the classifier lots of text, try fiddling
Packit 73e109
with TABLEPOW, which determines the size of the hash table that is
Packit 73e109
used to store the n-grams. Making it too small will result in many
Packit 73e109
hashtable clashes, making it too large will cause wild memory
Packit 73e109
behaviour and both are bad for the performance.
Packit 73e109
Packit 73e109
Putting the most probable models at the top of the list in your config
Packit 73e109
file improves performance, because this will raise the threshold for
Packit 73e109
likely candidates more quickly.
Packit 73e109
Packit 73e109
Since the speed of the classifier is roughly linear with respect to
Packit 73e109
the number of models, you should consider how many models you really
Packit 73e109
need. In case of language guessing: do you really want to recognize
Packit 73e109
every language ever invented?
Packit 73e109
Packit 73e109
Acknowledgements
Packit 73e109
Packit 73e109
UTF-8 conversion and adaption for OpenOffice.org, Jocelyn Merand.
Packit 73e109
Original libTextCat, Frank Scheelen & Rob de Wit at wise-guys.nl.
Packit 73e109
Original language models, copyright Gertjan van Noord.
Packit 73e109
Packit 73e109
References:
Packit 73e109
Packit 73e109
[1] The document that started it all can be downloaded at John M.
Packit 73e109
Trenkle's site: N-Gram-Based Text Categorization
Packit 73e109
Packit 73e109
http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz
Packit 73e109
Packit 73e109
[2] The Perl implementation by Gertjan van Noord (code + language
Packit 73e109
models): downloadable from his website
Packit 73e109
Packit 73e109
http://odur.let.rug.nl/~vannoord/TextCat/
Packit 73e109
Packit 73e109
[3] Original libtextcat implementation at
Packit 73e109
Packit 73e109
http://software.wise-guys.nl/libtextcat/
Packit 73e109
Packit 73e109
[4] http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx
Packit 73e109
Packit 73e109
[5] http://unicode.org/udhr/index_by_name.html
Packit 73e109
Packit 73e109
Contact:
Packit 73e109
Packit 73e109
Questions or patches can be directed to libreoffice@lists.freedesktop.org.
Packit 73e109
Bugs can be directed to https://bugs.freedesktop.org