|
Packit |
73e109 |
libexttextcat is an N-Gram-Based Text Categorization library primarily intended
|
|
Packit |
73e109 |
for language guessing.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Fundamentally this is an adaption of wiseguys libtextcat extended to be UTF-8
|
|
Packit |
73e109 |
aware. See README.libtextcat for details on original libtextcat.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Building:
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
* ./configure
|
|
Packit |
73e109 |
* make
|
|
Packit |
73e109 |
* make check
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
the tests can be run under valgrind's memcheck with export VALGRIND=memcheck,
|
|
Packit |
73e109 |
e.g.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
* export VALGRIND=memcheck
|
|
Packit |
73e109 |
* make check
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Quickstart: language guesser
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Assuming that you have successfully compiled the library, you need some
|
|
Packit |
73e109 |
language models to start guessing languages. A collection of over 150 language
|
|
Packit |
73e109 |
models, mostly derived from using the included "createfp" utility on UDHR
|
|
Packit |
73e109 |
translations, is bundled, with a matching configuration file, in the langclass
|
|
Packit |
73e109 |
directory:
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
* cd langclass/LM
|
|
Packit |
73e109 |
* ../../src/testtextcat ../fpdb.conf
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Paste some text onto the commandline, and watch it get classified.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Using the API:
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Classifying the language of a textbuffer can be as easy as:
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
#include "textcat.h"
|
|
Packit |
73e109 |
...
|
|
Packit |
73e109 |
void *h = textcat_Init( "fpdb.conf" );
|
|
Packit |
73e109 |
...
|
|
Packit |
73e109 |
printf( "Language: %s\n", textcat_Classify(h, buffer, 400);
|
|
Packit |
73e109 |
...
|
|
Packit |
73e109 |
textcat_Done(h);
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Creating your own fingerprints:
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
The createfp program allows you to easily create your own document
|
|
Packit |
73e109 |
fingerprints. Just feed it an example document on standard input, and store the
|
|
Packit |
73e109 |
standard output:
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Put the names of your fingerprints in a configuration file, add some id's and
|
|
Packit |
73e109 |
you're ready to classify.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Here's a worked example. The UN Declaration of Human Rights is available in a
|
|
Packit |
73e109 |
massive pile of translations[4], and and unicode.org makes much of these
|
|
Packit |
73e109 |
available as plain text[5], so...
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
% cd langclass/ShortTexts/
|
|
Packit |
73e109 |
% wget http://unicode.org/udhr/d/udhr_abk.txt
|
|
Packit |
73e109 |
% tail -n+7 udhr_abk.txt > ab.txt #skip english header, name is using BCP-47
|
|
Packit |
73e109 |
% cd ../LM
|
|
Packit |
73e109 |
% ../../src/createfp < ../ShortTexts/ab.txt > ab.lm
|
|
Packit |
73e109 |
% echo "ab.lm ab--utf8" >> ../fpdb.conf
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Eventually we'll drop fpdb.conf and assume the name of the fingerprint .lm file
|
|
Packit |
73e109 |
is the correct BCP-47 tag for the language it detects.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Performance tuning:
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
This library was made with efficiency in mind. There are couple of
|
|
Packit |
73e109 |
parameters you may wish to tweak if you intend to use it for other
|
|
Packit |
73e109 |
tasks than language guessing.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
The most important thing is buffer size. For reliable language
|
|
Packit |
73e109 |
guessing the classifier only needs a couple of hundreds of bytes max.
|
|
Packit |
73e109 |
So don't feed it 100KB of text unless you are creating a fingerprint.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
If you insist on feeding the classifier lots of text, try fiddling
|
|
Packit |
73e109 |
with TABLEPOW, which determines the size of the hash table that is
|
|
Packit |
73e109 |
used to store the n-grams. Making it too small will result in many
|
|
Packit |
73e109 |
hashtable clashes, making it too large will cause wild memory
|
|
Packit |
73e109 |
behaviour and both are bad for the performance.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Putting the most probable models at the top of the list in your config
|
|
Packit |
73e109 |
file improves performance, because this will raise the threshold for
|
|
Packit |
73e109 |
likely candidates more quickly.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Since the speed of the classifier is roughly linear with respect to
|
|
Packit |
73e109 |
the number of models, you should consider how many models you really
|
|
Packit |
73e109 |
need. In case of language guessing: do you really want to recognize
|
|
Packit |
73e109 |
every language ever invented?
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Acknowledgements
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
UTF-8 conversion and adaption for OpenOffice.org, Jocelyn Merand.
|
|
Packit |
73e109 |
Original libTextCat, Frank Scheelen & Rob de Wit at wise-guys.nl.
|
|
Packit |
73e109 |
Original language models, copyright Gertjan van Noord.
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
References:
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
[1] The document that started it all can be downloaded at John M.
|
|
Packit |
73e109 |
Trenkle's site: N-Gram-Based Text Categorization
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
[2] The Perl implementation by Gertjan van Noord (code + language
|
|
Packit |
73e109 |
models): downloadable from his website
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
http://odur.let.rug.nl/~vannoord/TextCat/
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
[3] Original libtextcat implementation at
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
http://software.wise-guys.nl/libtextcat/
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
[4] http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
[5] http://unicode.org/udhr/index_by_name.html
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Contact:
|
|
Packit |
73e109 |
|
|
Packit |
73e109 |
Questions or patches can be directed to libreoffice@lists.freedesktop.org.
|
|
Packit |
73e109 |
Bugs can be directed to https://bugs.freedesktop.org
|