Blob Blame History Raw
2009 - 2015 Harri Pitkänen <hatapitk@iki.fi>                    GPLv2+

 Information provided from attribute maps of Analysis objects
 ============================================================

This file contains documentation for implementors of morphological
analyzers (Analyzer) in libvoikko. Analyzer is an object that returns
a set of Analysis objects given an input string. Analysis objects
consist of key-value pairs where keys are ordinary C strings and
values are wchar_t strings. The set of required and optional strings
in the analysis results may be different for different languages or
different implementations of the same language.

Analyzers are required to be case insensitive. Information about
requirements for character case should be provided within the analysis
results, not by excluding results depending on the character case of
the input string. This design makes it easier to implement efficient
analysis operations for applications where character case does not
matter or matters only in some specific situations.

In order to improve reusability of code and to allow external
applications to take advantage of this feature it is best to use the
same set of keys and same structure for values where possible. This
document describes the guidelines that implementors of Analyzer
components are recommended (but not required) to follow when they
implement this feature.

Some analyzers may produce attributes that are expensive to create
and are not necessary for linguistic operations performed within
libvoikko. The analyzers are advised to produce these attributes
only when parameter "fullMorphology" is set to "true".

 Attributes common to all languages
 ==================================

The following attributes should be provided by all implementations of
Analyzer for all languages:

 STRUCTURE
 =========
  This attribute describes morpheme boundaries, character case and
  hyphenation restrictions for the word. The following characters
  are used in the values of this attribute:

  = Start of a new morpheme. This must also be present at the start
    of a word.

  - Hyphen. Word can be split in text processors after this character
    without inserting an extra hyphen. If the hyphen is at morpheme
    boundary, the boundary symbol = must be placed after the hyphen.

  p Letter that is written in lower case in the standard form.

  q Letter that is written in lower case in the standard form.
    Hyphenation is forbidden before this letter.

  i Letter that is written in upper case in the standard form.

  j Letter that is written in upper case in the standard form.
    Hyphenation is forbidden before this letter.

  Examples:
   Word: Matti-niminen -> STRUCTURE: =ipppp-=ppppppp
   Word: DNA-näyte ->     STRUCTURE: =jjj-=ppppp
   Word: autokauppa ->    STRUCTURE: =pppp=pppppp

The following attributes are not currently used within libvoikko but
they may be useful for external applications:

 FSTOUTPUT (fullMorphology only)
 ===============================
  Analyzers that are implemented using finite state transducers can provide
  the raw transducer output using this attribute.

  Examples:
   Word: kissalla -> FSTOUTPUT: [Ln][Xp]kissa[X][Xs]505527[X]kissa[Sade][Ny]lla

 BASEFORM (fullMorphology only)
 ==============================
  Base form of the given word.

  Examples:
   Word: kissalla -> BASEFORM: kissa

 NUMBER
 ======
  Grammatical number of the word. Suggested values for this attribute
  are "singular", "dual", "trial" and "plural".

  Examples:
   Word: kissa -> NUMBER: singular
   Word: kissat -> NUMBER: plural

 PERSON
 ======
  For verbs in active voice this attribute represents the person
  (first, second or third). The person for passive voice can be
  considered as the fourth voice if appropriate for the language.
  Suggested values for this attribute are "1", "2", "3" and "4".

  Examples:
   Word: juoksen -> PERSON: 1
   Word: juokset -> PERSON: 2

 MOOD
 ====
  Mood of a verb. Suggested values for this attribute are
  "indicative", "conditional", "imperative" and "potential".

  Examples:
   Word: juoksen -> MOOD: indicative
   Word: juoksisin -> MOOD: conditional

 TENSE
 =====
  Tense and aspect of a verb. Suggested values for this attribute are
  "past_imperfective", "present_simple", (add more as needed).
  
  Examples:
   Word: juoksen -> TENSE: present_simple
   Word: juoksin -> TENSE: past_imperfective

 NEGATIVE
 ========
  For all verbs this attribute indicates whether the verb is in
  a connegative form. Suggested values: "false", "true", "both"
  
  Examples:
   Word: sallitaan -> NEGATIVE: false
   Word: sallita (as in "ei sallita") -> NEGATIVE: true
   Word: maalaa (also "ei maalaa") -> NEGATIVE: both
  
 PARTICIPLE
 ==========
  Word is a participle of some sort. Suggested values for this attribute
  are "present_active", "present_passive", "past_active", "past_passive",
  "agent" and "negation" (add more as needed).
  
  Examples:
   Word: juokseva    -> PARTICIPLE: present_active
   Word: juostava    -> PARTICIPLE: present_passive
   Word: juossut     -> PARTICIPLE: past_active
   Word: juostu      -> PARTICIPLE: past_passive
   Word: juoksema    -> PARTICIPLE: agent
   Word: juoksematon -> PARTICIPLE: negation

 POSSESSIVE
 ==========
  Word contains information about possessor. For now this is used to
  indicate the use of possessive suffix in Finnish nouns.

  Examples:
   Word: kissani  -> POSSESSIVE: 1s
   Word: kissasi  -> POSSESSIVE: 2s
   Word: kissamme -> POSSESSIVE: 1p
   Word: kissanne -> POSSESSIVE: 2p
   Word: kissansa -> POSSESSIVE: 3

 COMPARISON
 ==========
  Word is comparable (adjective). Suggested values for this attribute are
  "positive", "comparative" and "superlative".
  
  Examples:
   Word: sininen   -> COMPARISON: positive
   Word: sinisempi -> COMPARISON: comparative
   Word: sinisin   -> COMPARISON: superlative

 Attributes for Finnish language
 ===============================

The following attributes are specific to Finnish language. Some are
currently used within libvoikko for improved language support, others
are provided only as information for external applications.

 Extensions to MOOD
 ==================
  Mainly due to structure of Suomi-malaga MOOD is also used to describe
  some non-finite verb forms. For that purpose the following additional
  attribute values are used:
  
   - A-infinitive (as in "juosta")
   - E-infinitive (as in "juostessa")
   - MA-infinitive (as in "juoksemassa", "juoksemasta", "juoksemaan" etc.)
   - MINEN-infinitive (as in "juokseminen")
   - MAINEN-infinitive (as in "juoksemaisillaan")

 CLASS
 =====
  Sanan sanaluokka. Attribuutti on käytössä libvoikon sisällä.
  Attribuutin mahdolliset arvot ovat seuraavat:
   - nimisana (yleisnimi)
   - laatusana
   - nimisana_laatusana (sama kuin erilliset analyysit nimisanana ja
     laatusanana)
   - teonsana
   - seikkasana
   - asemosana
   - suhdesana
   - huudahdussana
   - sidesana
   - etunimi
   - sukunimi
   - paikannimi
   - nimi (muu erisnimi kuin etu-, suku- tai paikannimi)
   - kieltosana
   - lyhenne
   - lukusana
   - etuliite

 SIJAMUOTO
 =========
  Nominin sijamuoto. Attribuutti on käytössä libvoikon sisällä.
  Attribuutin mahdolliset arvot ovat seuraavat:
   - nimento
   - omanto
   - osanto
   - olento
   - tulento
   - kohdanto
   - sisaolento
   - sisaeronto
   - sisatulento
   - ulkoolento
   - ulkoeronto
   - ulkotulento
   - vajanto
   - seuranto
   - keinonto
   - kerrontosti (esim. "nopeasti")
  
 KYSYMYSLIITE
 ============
  Sanaan liittyy kysymysliite -ko tai -kö. Attribuutin ainoa sallittu
  arvo on "true". Jos sanaan ei liity kysymysliitettä, attribuuttia
  ei ole.

 FOCUS
 =====
  Sanaan liittyy fokuspartikkeli -kin tai -kAAn.

  Esimerkkejä:
   Sana: kissakin  -> FOCUS: kin
   Sana: kissakaan -> FOCUS: kaan

 WORDBASES (fullMorphology only)
 ===============================
  Sanan osien perusmuodot. Attribuutti ei ole käytössä libvoikon
  sisällä. Attribuutin arvona on sanan perusmuoto, jossa yhdyssanan
  osat ja päätteet on erotettu toisistaan +-merkillä. Lisäksi kunkin
  yhdyssanan osan perusmuoto on osan perässä suluissa. Mikäli yhdyssanan
  osat itsessään ovat jaettavissa osiin, osat voidaan sulkujen sisällä
  olevassa perusmuodossa erotella merkeillä = tai |.

  Esimerkkejä:
   Sana: köydenvetoa ->     WORDBASES: +köyde(köysi)+n+veto(veto)
   Sana: Alkio-opistossa -> WORDBASES: +alkio(Alkio)+-+opisto(opisto)
                                       +alkio(alkio)+-+opisto(opisto)

  Johdinpäätteiden perusmuodot ovat suluissa siten, että päätteen edessä
  on +-merkki:
   Sana: kansalliseepos ->  WORDBASES: +kansa(kansa)+llis(+llinen)+eepos(eepos)

 WORDIDS (fullMorphology only)
 =============================
  Viittaukset sanan osiin Joukahaisessa. Attribuutti ei ole käytössä
  libvoikon sisällä. Attribuutin arvona on sanan perusmuoto, jossa
  yhdyssanan osat ja päätteet on erotettu toisistaan +-merkillä.
  Lisäksi kunkin yhdyssanan osan tietue-id on osan perässä suluissa.

  Esimerkkejä:
   Sana: köydenvetoa ->     WORDBASES: +köyde(w506953)+n+veto(w517284)
                                       +köyde(w506953)+n+veto(w523540)
                                       +köyde(w506953)+n+veto(w525160)
   Sana: Alkio-opistossa -> WORDBASES: +alkio(w518215)+-+opisto(w510148)
                                       +alkio(w500068)+-+opisto(w510148)