Blame data_layout.txt

Packit 1184b9
Description of the Structure of the Data needed by MyThes
Packit 1184b9
--------------------------------------------------------
Packit 1184b9
Packit 1184b9
MyThes is very simple.  Almost all of the "smarts" are really
Packit 1184b9
in the thesaurus data file itself.
Packit 1184b9
Packit 1184b9
The format for this file is at follows:
Packit 1184b9
Packit 1184b9
- no binary data 
Packit 1184b9
Packit 1184b9
- line ending is a newline '\n' and not carriage return/linefeeds
Packit 1184b9
Packit 1184b9
- Line 1 is a character string that describes the encoding
Packit 1184b9
used for the file.  It is up to the calling program to convert
Packit 1184b9
to and from this encoding if necessary.
Packit 1184b9
Packit 1184b9
     ISO8859-1 is used by the th_en_US_new.dat file.
Packit 1184b9
Packit 1184b9
     Strings currently recognized by OpenOffice.org are:
Packit 1184b9
Packit 1184b9
     UTF-8
Packit 1184b9
     ISO8859-1
Packit 1184b9
     ISO8859-2
Packit 1184b9
     ISO8859-3
Packit 1184b9
     ISO8859-4
Packit 1184b9
     ISO8859-5
Packit 1184b9
     ISO8859-6
Packit 1184b9
     ISO8859-7
Packit 1184b9
     ISO8859-8
Packit 1184b9
     ISO8859-9
Packit 1184b9
     ISO8859-10
Packit 1184b9
     KOI8-R
Packit 1184b9
     CP-1251
Packit 1184b9
     ISO8859-14
Packit 1184b9
     ISCII-DEVANAGARI
Packit 1184b9
Packit 1184b9
Packit 1184b9
- All of the remaning lines of the file follow this structure
Packit 1184b9
Packit 1184b9
entry|num_mean
Packit 1184b9
pos|syn1_mean|syn2|...
Packit 1184b9
.
Packit 1184b9
.
Packit 1184b9
.
Packit 1184b9
pos|mean_syn1|syn2|...
Packit 1184b9
Packit 1184b9
Packit 1184b9
where:
Packit 1184b9
Packit 1184b9
   entry      - all lowercase version of the word or phrase being described
Packit 1184b9
   num_mean   - number of meanings for this entry
Packit 1184b9
Packit 1184b9
   There is one meaning per line and each meaning is comprised of
Packit 1184b9
Packit 1184b9
   pos        -  part of speech or other meaning specific description
Packit 1184b9
   syn1_mean  -  synonym 1 also used to describe the meaning itself 
Packit 1184b9
   syn2       - synonym 2 for that meaning etc.
Packit 1184b9
Packit 1184b9
Packit 1184b9
To make this even more clearer, here is actual data for the
Packit 1184b9
entry "simple".
Packit 1184b9
Packit 1184b9
simple|9
Packit 1184b9
(adj)|simple |elemental|ultimate|oversimplified|simplistic|simplex|simplified|unanalyzable|
Packit 1184b9
undecomposable|uncomplicated|unsophisticated|easy|plain|unsubdivided
Packit 1184b9
(adj)|elementary|uncomplicated|unproblematic|easy
Packit 1184b9
(adj)|bare|mere|plain
Packit 1184b9
(adj)|childlike|wide-eyed|dewy-eyed|naive |naif
Packit 1184b9
(adj)|dim-witted|half-witted|simple-minded|retarded
Packit 1184b9
(adj)|simple |unsubdivided|unlobed|smooth
Packit 1184b9
(adj)|plain
Packit 1184b9
(noun)|herb|herbaceous plant
Packit 1184b9
(noun)|simpleton|person|individual|someone|somebody|mortal|human|soul
Packit 1184b9
Packit 1184b9
Packit 1184b9
It says that "simple" has 9 different meanings and each 
Packit 1184b9
meaning will have its part of speech and at least 1 synonym 
Packit 1184b9
with other if presetn following on the same line.
Packit 1184b9
Packit 1184b9
Packit 1184b9
Packit 1184b9
Once you ahve created your own structured text file you can use
Packit 1184b9
the perl program "th_gen_idx.pl" which can be found in this
Packit 1184b9
directory to create an index file that is used to seek into
Packit 1184b9
your data file by the MyThes code.
Packit 1184b9
Packit 1184b9
The correct way to run the perl program is as follows:
Packit 1184b9
Packit 1184b9
cat th_en_US_new.dat | ./th_gen_idx.pl > th_en_US_new.idx
Packit 1184b9
Packit 1184b9
Packit 1184b9
Packit 1184b9
Then if you head the resulting index file you should see the 
Packit 1184b9
following:
Packit 1184b9
Packit 1184b9
ISO8859-1
Packit 1184b9
142689
Packit 1184b9
'hood|10
Packit 1184b9
's gravenhage|88
Packit 1184b9
'tween|173
Packit 1184b9
'tween decks|196
Packit 1184b9
.22|231
Packit 1184b9
.22 caliber|319
Packit 1184b9
.22 calibre|365
Packit 1184b9
.38 caliber|411
Packit 1184b9
.38 calibre|457
Packit 1184b9
.45 caliber|503
Packit 1184b9
.45 calibre|549
Packit 1184b9
0|595
Packit 1184b9
1|666
Packit 1184b9
1 chronicles|6283
Packit 1184b9
1 esdras|6336
Packit 1184b9
Packit 1184b9
Packit 1184b9
Line 1 is the same encoding string taken from the 
Packit 1184b9
structured thesaurus data file.
Packit 1184b9
Packit 1184b9
Line 2 is a count of the total number of entries
Packit 1184b9
in your thesaurus.
Packit 1184b9
Packit 1184b9
All of the remaining lines are of the form
Packit 1184b9
Packit 1184b9
entry|byte_offset_into_data_file_where_entry_is_found
Packit 1184b9
Packit 1184b9
Packit 1184b9
That's all there is too it.
Packit 1184b9
Packit 1184b9
Packit 1184b9
Kevin
Packit 1184b9
kevin.hendricks@sympatico.ca
Packit 1184b9