2009 - 2015 Harri Pitkänen GPLv2+ Information provided from attribute maps of Analysis objects ============================================================ This file contains documentation for implementors of morphological analyzers (Analyzer) in libvoikko. Analyzer is an object that returns a set of Analysis objects given an input string. Analysis objects consist of key-value pairs where keys are ordinary C strings and values are wchar_t strings. The set of required and optional strings in the analysis results may be different for different languages or different implementations of the same language. Analyzers are required to be case insensitive. Information about requirements for character case should be provided within the analysis results, not by excluding results depending on the character case of the input string. This design makes it easier to implement efficient analysis operations for applications where character case does not matter or matters only in some specific situations. In order to improve reusability of code and to allow external applications to take advantage of this feature it is best to use the same set of keys and same structure for values where possible. This document describes the guidelines that implementors of Analyzer components are recommended (but not required) to follow when they implement this feature. Some analyzers may produce attributes that are expensive to create and are not necessary for linguistic operations performed within libvoikko. The analyzers are advised to produce these attributes only when parameter "fullMorphology" is set to "true". Attributes common to all languages ================================== The following attributes should be provided by all implementations of Analyzer for all languages: STRUCTURE ========= This attribute describes morpheme boundaries, character case and hyphenation restrictions for the word. The following characters are used in the values of this attribute: = Start of a new morpheme. This must also be present at the start of a word. - Hyphen. Word can be split in text processors after this character without inserting an extra hyphen. If the hyphen is at morpheme boundary, the boundary symbol = must be placed after the hyphen. p Letter that is written in lower case in the standard form. q Letter that is written in lower case in the standard form. Hyphenation is forbidden before this letter. i Letter that is written in upper case in the standard form. j Letter that is written in upper case in the standard form. Hyphenation is forbidden before this letter. Examples: Word: Matti-niminen -> STRUCTURE: =ipppp-=ppppppp Word: DNA-näyte -> STRUCTURE: =jjj-=ppppp Word: autokauppa -> STRUCTURE: =pppp=pppppp The following attributes are not currently used within libvoikko but they may be useful for external applications: FSTOUTPUT (fullMorphology only) =============================== Analyzers that are implemented using finite state transducers can provide the raw transducer output using this attribute. Examples: Word: kissalla -> FSTOUTPUT: [Ln][Xp]kissa[X][Xs]505527[X]kissa[Sade][Ny]lla BASEFORM (fullMorphology only) ============================== Base form of the given word. Examples: Word: kissalla -> BASEFORM: kissa NUMBER ====== Grammatical number of the word. Suggested values for this attribute are "singular", "dual", "trial" and "plural". Examples: Word: kissa -> NUMBER: singular Word: kissat -> NUMBER: plural PERSON ====== For verbs in active voice this attribute represents the person (first, second or third). The person for passive voice can be considered as the fourth voice if appropriate for the language. Suggested values for this attribute are "1", "2", "3" and "4". Examples: Word: juoksen -> PERSON: 1 Word: juokset -> PERSON: 2 MOOD ==== Mood of a verb. Suggested values for this attribute are "indicative", "conditional", "imperative" and "potential". Examples: Word: juoksen -> MOOD: indicative Word: juoksisin -> MOOD: conditional TENSE ===== Tense and aspect of a verb. Suggested values for this attribute are "past_imperfective", "present_simple", (add more as needed). Examples: Word: juoksen -> TENSE: present_simple Word: juoksin -> TENSE: past_imperfective NEGATIVE ======== For all verbs this attribute indicates whether the verb is in a connegative form. Suggested values: "false", "true", "both" Examples: Word: sallitaan -> NEGATIVE: false Word: sallita (as in "ei sallita") -> NEGATIVE: true Word: maalaa (also "ei maalaa") -> NEGATIVE: both PARTICIPLE ========== Word is a participle of some sort. Suggested values for this attribute are "present_active", "present_passive", "past_active", "past_passive", "agent" and "negation" (add more as needed). Examples: Word: juokseva -> PARTICIPLE: present_active Word: juostava -> PARTICIPLE: present_passive Word: juossut -> PARTICIPLE: past_active Word: juostu -> PARTICIPLE: past_passive Word: juoksema -> PARTICIPLE: agent Word: juoksematon -> PARTICIPLE: negation POSSESSIVE ========== Word contains information about possessor. For now this is used to indicate the use of possessive suffix in Finnish nouns. Examples: Word: kissani -> POSSESSIVE: 1s Word: kissasi -> POSSESSIVE: 2s Word: kissamme -> POSSESSIVE: 1p Word: kissanne -> POSSESSIVE: 2p Word: kissansa -> POSSESSIVE: 3 COMPARISON ========== Word is comparable (adjective). Suggested values for this attribute are "positive", "comparative" and "superlative". Examples: Word: sininen -> COMPARISON: positive Word: sinisempi -> COMPARISON: comparative Word: sinisin -> COMPARISON: superlative Attributes for Finnish language =============================== The following attributes are specific to Finnish language. Some are currently used within libvoikko for improved language support, others are provided only as information for external applications. Extensions to MOOD ================== Mainly due to structure of Suomi-malaga MOOD is also used to describe some non-finite verb forms. For that purpose the following additional attribute values are used: - A-infinitive (as in "juosta") - E-infinitive (as in "juostessa") - MA-infinitive (as in "juoksemassa", "juoksemasta", "juoksemaan" etc.) - MINEN-infinitive (as in "juokseminen") - MAINEN-infinitive (as in "juoksemaisillaan") CLASS ===== Sanan sanaluokka. Attribuutti on käytössä libvoikon sisällä. Attribuutin mahdolliset arvot ovat seuraavat: - nimisana (yleisnimi) - laatusana - nimisana_laatusana (sama kuin erilliset analyysit nimisanana ja laatusanana) - teonsana - seikkasana - asemosana - suhdesana - huudahdussana - sidesana - etunimi - sukunimi - paikannimi - nimi (muu erisnimi kuin etu-, suku- tai paikannimi) - kieltosana - lyhenne - lukusana - etuliite SIJAMUOTO ========= Nominin sijamuoto. Attribuutti on käytössä libvoikon sisällä. Attribuutin mahdolliset arvot ovat seuraavat: - nimento - omanto - osanto - olento - tulento - kohdanto - sisaolento - sisaeronto - sisatulento - ulkoolento - ulkoeronto - ulkotulento - vajanto - seuranto - keinonto - kerrontosti (esim. "nopeasti") KYSYMYSLIITE ============ Sanaan liittyy kysymysliite -ko tai -kö. Attribuutin ainoa sallittu arvo on "true". Jos sanaan ei liity kysymysliitettä, attribuuttia ei ole. FOCUS ===== Sanaan liittyy fokuspartikkeli -kin tai -kAAn. Esimerkkejä: Sana: kissakin -> FOCUS: kin Sana: kissakaan -> FOCUS: kaan WORDBASES (fullMorphology only) =============================== Sanan osien perusmuodot. Attribuutti ei ole käytössä libvoikon sisällä. Attribuutin arvona on sanan perusmuoto, jossa yhdyssanan osat ja päätteet on erotettu toisistaan +-merkillä. Lisäksi kunkin yhdyssanan osan perusmuoto on osan perässä suluissa. Mikäli yhdyssanan osat itsessään ovat jaettavissa osiin, osat voidaan sulkujen sisällä olevassa perusmuodossa erotella merkeillä = tai |. Esimerkkejä: Sana: köydenvetoa -> WORDBASES: +köyde(köysi)+n+veto(veto) Sana: Alkio-opistossa -> WORDBASES: +alkio(Alkio)+-+opisto(opisto) +alkio(alkio)+-+opisto(opisto) Johdinpäätteiden perusmuodot ovat suluissa siten, että päätteen edessä on +-merkki: Sana: kansalliseepos -> WORDBASES: +kansa(kansa)+llis(+llinen)+eepos(eepos) WORDIDS (fullMorphology only) ============================= Viittaukset sanan osiin Joukahaisessa. Attribuutti ei ole käytössä libvoikon sisällä. Attribuutin arvona on sanan perusmuoto, jossa yhdyssanan osat ja päätteet on erotettu toisistaan +-merkillä. Lisäksi kunkin yhdyssanan osan tietue-id on osan perässä suluissa. Esimerkkejä: Sana: köydenvetoa -> WORDBASES: +köyde(w506953)+n+veto(w517284) +köyde(w506953)+n+veto(w523540) +köyde(w506953)+n+veto(w525160) Sana: Alkio-opistossa -> WORDBASES: +alkio(w518215)+-+opisto(w510148) +alkio(w500068)+-+opisto(w510148)