Tree - source-git/libexttextcat

source-git / libexttextcat

Blame README

Blob History Raw

Packit	73e109	`libexttextcat is an N-Gram-Based Text Categorization library primarily intended`
Packit	73e109	`for language guessing.`
Packit	73e109
Packit	73e109	`Fundamentally this is an adaption of wiseguys libtextcat extended to be UTF-8`
Packit	73e109	`aware. See README.libtextcat for details on original libtextcat.`
Packit	73e109
Packit	73e109	`Building:`
Packit	73e109
Packit	73e109	`* ./configure`
Packit	73e109	`* make`
Packit	73e109	`* make check`
Packit	73e109
Packit	73e109	`the tests can be run under valgrind's memcheck with export VALGRIND=memcheck,`
Packit	73e109	`e.g.`
Packit	73e109
Packit	73e109	`* export VALGRIND=memcheck`
Packit	73e109	`* make check`
Packit	73e109
Packit	73e109	`Quickstart: language guesser`
Packit	73e109
Packit	73e109	`Assuming that you have successfully compiled the library, you need some`
Packit	73e109	`language models to start guessing languages. A collection of over 150 language`
Packit	73e109	`models, mostly derived from using the included "createfp" utility on UDHR`
Packit	73e109	`translations, is bundled, with a matching configuration file, in the langclass`
Packit	73e109	`directory:`
Packit	73e109
Packit	73e109	`* cd langclass/LM`
Packit	73e109	`* ../../src/testtextcat ../fpdb.conf`
Packit	73e109
Packit	73e109	`Paste some text onto the commandline, and watch it get classified.`
Packit	73e109
Packit	73e109	`Using the API:`
Packit	73e109
Packit	73e109	`Classifying the language of a textbuffer can be as easy as:`
Packit	73e109
Packit	73e109	`#include "textcat.h"`
Packit	73e109	`...`
Packit	73e109	`void *h = textcat_Init( "fpdb.conf" );`
Packit	73e109	`...`
Packit	73e109	`printf( "Language: %s\n", textcat_Classify(h, buffer, 400);`
Packit	73e109	`...`
Packit	73e109	`textcat_Done(h);`
Packit	73e109
Packit	73e109	`Creating your own fingerprints:`
Packit	73e109
Packit	73e109	`The createfp program allows you to easily create your own document`
Packit	73e109	`fingerprints. Just feed it an example document on standard input, and store the`
Packit	73e109	`standard output:`
Packit	73e109
Packit	73e109	`Put the names of your fingerprints in a configuration file, add some id's and`
Packit	73e109	`you're ready to classify.`
Packit	73e109
Packit	73e109	`Here's a worked example. The UN Declaration of Human Rights is available in a`
Packit	73e109	`massive pile of translations[4], and and unicode.org makes much of these`
Packit	73e109	`available as plain text[5], so...`
Packit	73e109
Packit	73e109	`% cd langclass/ShortTexts/`
Packit	73e109	`% wget http://unicode.org/udhr/d/udhr_abk.txt`
Packit	73e109	`% tail -n+7 udhr_abk.txt > ab.txt #skip english header, name is using BCP-47`
Packit	73e109	`% cd ../LM`
Packit	73e109	`% ../../src/createfp < ../ShortTexts/ab.txt > ab.lm`
Packit	73e109	`% echo "ab.lm ab--utf8" >> ../fpdb.conf`
Packit	73e109
Packit	73e109	`Eventually we'll drop fpdb.conf and assume the name of the fingerprint .lm file`
Packit	73e109	`is the correct BCP-47 tag for the language it detects.`
Packit	73e109
Packit	73e109	`Performance tuning:`
Packit	73e109
Packit	73e109	`This library was made with efficiency in mind. There are couple of`
Packit	73e109	`parameters you may wish to tweak if you intend to use it for other`
Packit	73e109	`tasks than language guessing.`
Packit	73e109
Packit	73e109	`The most important thing is buffer size. For reliable language`
Packit	73e109	`guessing the classifier only needs a couple of hundreds of bytes max.`
Packit	73e109	`So don't feed it 100KB of text unless you are creating a fingerprint.`
Packit	73e109
Packit	73e109	`If you insist on feeding the classifier lots of text, try fiddling`
Packit	73e109	`with TABLEPOW, which determines the size of the hash table that is`
Packit	73e109	`used to store the n-grams. Making it too small will result in many`
Packit	73e109	`hashtable clashes, making it too large will cause wild memory`
Packit	73e109	`behaviour and both are bad for the performance.`
Packit	73e109
Packit	73e109	`Putting the most probable models at the top of the list in your config`
Packit	73e109	`file improves performance, because this will raise the threshold for`
Packit	73e109	`likely candidates more quickly.`
Packit	73e109
Packit	73e109	`Since the speed of the classifier is roughly linear with respect to`
Packit	73e109	`the number of models, you should consider how many models you really`
Packit	73e109	`need. In case of language guessing: do you really want to recognize`
Packit	73e109	`every language ever invented?`
Packit	73e109
Packit	73e109	`Acknowledgements`
Packit	73e109
Packit	73e109	`UTF-8 conversion and adaption for OpenOffice.org, Jocelyn Merand.`
Packit	73e109	`Original libTextCat, Frank Scheelen & Rob de Wit at wise-guys.nl.`
Packit	73e109	`Original language models, copyright Gertjan van Noord.`
Packit	73e109
Packit	73e109	`References:`
Packit	73e109
Packit	73e109	`[1] The document that started it all can be downloaded at John M.`
Packit	73e109	`Trenkle's site: N-Gram-Based Text Categorization`
Packit	73e109
Packit	73e109	`http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz`
Packit	73e109
Packit	73e109	`[2] The Perl implementation by Gertjan van Noord (code + language`
Packit	73e109	`models): downloadable from his website`
Packit	73e109
Packit	73e109	`http://odur.let.rug.nl/~vannoord/TextCat/`
Packit	73e109
Packit	73e109	`[3] Original libtextcat implementation at`
Packit	73e109
Packit	73e109	`http://software.wise-guys.nl/libtextcat/`
Packit	73e109
Packit	73e109	`[4] http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx`
Packit	73e109
Packit	73e109	`[5] http://unicode.org/udhr/index_by_name.html`
Packit	73e109
Packit	73e109	`Contact:`
Packit	73e109
Packit	73e109	`Questions or patches can be directed to libreoffice@lists.freedesktop.org.`
Packit	73e109	`Bugs can be directed to https://bugs.freedesktop.org`

source-git / libexttextcat

Source Code

Blame README