|
Packit |
1184b9 |
Description of the Structure of the Data needed by MyThes
|
|
Packit |
1184b9 |
--------------------------------------------------------
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
MyThes is very simple. Almost all of the "smarts" are really
|
|
Packit |
1184b9 |
in the thesaurus data file itself.
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
The format for this file is at follows:
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
- no binary data
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
- line ending is a newline '\n' and not carriage return/linefeeds
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
- Line 1 is a character string that describes the encoding
|
|
Packit |
1184b9 |
used for the file. It is up to the calling program to convert
|
|
Packit |
1184b9 |
to and from this encoding if necessary.
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
ISO8859-1 is used by the th_en_US_new.dat file.
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
Strings currently recognized by OpenOffice.org are:
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
UTF-8
|
|
Packit |
1184b9 |
ISO8859-1
|
|
Packit |
1184b9 |
ISO8859-2
|
|
Packit |
1184b9 |
ISO8859-3
|
|
Packit |
1184b9 |
ISO8859-4
|
|
Packit |
1184b9 |
ISO8859-5
|
|
Packit |
1184b9 |
ISO8859-6
|
|
Packit |
1184b9 |
ISO8859-7
|
|
Packit |
1184b9 |
ISO8859-8
|
|
Packit |
1184b9 |
ISO8859-9
|
|
Packit |
1184b9 |
ISO8859-10
|
|
Packit |
1184b9 |
KOI8-R
|
|
Packit |
1184b9 |
CP-1251
|
|
Packit |
1184b9 |
ISO8859-14
|
|
Packit |
1184b9 |
ISCII-DEVANAGARI
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
- All of the remaning lines of the file follow this structure
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
entry|num_mean
|
|
Packit |
1184b9 |
pos|syn1_mean|syn2|...
|
|
Packit |
1184b9 |
.
|
|
Packit |
1184b9 |
.
|
|
Packit |
1184b9 |
.
|
|
Packit |
1184b9 |
pos|mean_syn1|syn2|...
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
where:
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
entry - all lowercase version of the word or phrase being described
|
|
Packit |
1184b9 |
num_mean - number of meanings for this entry
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
There is one meaning per line and each meaning is comprised of
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
pos - part of speech or other meaning specific description
|
|
Packit |
1184b9 |
syn1_mean - synonym 1 also used to describe the meaning itself
|
|
Packit |
1184b9 |
syn2 - synonym 2 for that meaning etc.
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
To make this even more clearer, here is actual data for the
|
|
Packit |
1184b9 |
entry "simple".
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
simple|9
|
|
Packit |
1184b9 |
(adj)|simple |elemental|ultimate|oversimplified|simplistic|simplex|simplified|unanalyzable|
|
|
Packit |
1184b9 |
undecomposable|uncomplicated|unsophisticated|easy|plain|unsubdivided
|
|
Packit |
1184b9 |
(adj)|elementary|uncomplicated|unproblematic|easy
|
|
Packit |
1184b9 |
(adj)|bare|mere|plain
|
|
Packit |
1184b9 |
(adj)|childlike|wide-eyed|dewy-eyed|naive |naif
|
|
Packit |
1184b9 |
(adj)|dim-witted|half-witted|simple-minded|retarded
|
|
Packit |
1184b9 |
(adj)|simple |unsubdivided|unlobed|smooth
|
|
Packit |
1184b9 |
(adj)|plain
|
|
Packit |
1184b9 |
(noun)|herb|herbaceous plant
|
|
Packit |
1184b9 |
(noun)|simpleton|person|individual|someone|somebody|mortal|human|soul
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
It says that "simple" has 9 different meanings and each
|
|
Packit |
1184b9 |
meaning will have its part of speech and at least 1 synonym
|
|
Packit |
1184b9 |
with other if presetn following on the same line.
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
Once you ahve created your own structured text file you can use
|
|
Packit |
1184b9 |
the perl program "th_gen_idx.pl" which can be found in this
|
|
Packit |
1184b9 |
directory to create an index file that is used to seek into
|
|
Packit |
1184b9 |
your data file by the MyThes code.
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
The correct way to run the perl program is as follows:
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
cat th_en_US_new.dat | ./th_gen_idx.pl > th_en_US_new.idx
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
Then if you head the resulting index file you should see the
|
|
Packit |
1184b9 |
following:
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
ISO8859-1
|
|
Packit |
1184b9 |
142689
|
|
Packit |
1184b9 |
'hood|10
|
|
Packit |
1184b9 |
's gravenhage|88
|
|
Packit |
1184b9 |
'tween|173
|
|
Packit |
1184b9 |
'tween decks|196
|
|
Packit |
1184b9 |
.22|231
|
|
Packit |
1184b9 |
.22 caliber|319
|
|
Packit |
1184b9 |
.22 calibre|365
|
|
Packit |
1184b9 |
.38 caliber|411
|
|
Packit |
1184b9 |
.38 calibre|457
|
|
Packit |
1184b9 |
.45 caliber|503
|
|
Packit |
1184b9 |
.45 calibre|549
|
|
Packit |
1184b9 |
0|595
|
|
Packit |
1184b9 |
1|666
|
|
Packit |
1184b9 |
1 chronicles|6283
|
|
Packit |
1184b9 |
1 esdras|6336
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
Line 1 is the same encoding string taken from the
|
|
Packit |
1184b9 |
structured thesaurus data file.
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
Line 2 is a count of the total number of entries
|
|
Packit |
1184b9 |
in your thesaurus.
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
All of the remaining lines are of the form
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
entry|byte_offset_into_data_file_where_entry_is_found
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
That's all there is too it.
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
|
|
Packit |
1184b9 |
Kevin
|
|
Packit |
1184b9 |
kevin.hendricks@sympatico.ca
|
|
Packit |
1184b9 |
|