25.08.2013 Views

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

LULCL 2005<br />

Proceedings of the Lesser Used Languages and<br />

Computer Linguistics Conference<br />

Bolzano, 27 th -28 th October 2005<br />

Isabella Ties (Ed.)


LULCL 2005<br />

Proceedings of the Lesser Used Languages and Computer Linguistics Conference<br />

Isabella Ties (Ed.)<br />

2006<br />

The proceedings are co-financed by the European Union through the Interreg IIIA Italy-<br />

Switzerland Programme


Bestellungen bei:<br />

Europäische Akademie Bozen<br />

Viale Druso, 1<br />

39100 Bozen - Italien<br />

Tel. +39 0471 055055<br />

Fax +39 0471 055099<br />

E-mail: press@eurac.edu<br />

Nachdruck und fotomechanische<br />

Wiedergabe – auch auszugsweise – nur<br />

unter Angabe der Quelle<br />

(Herausgeber und Titel) gestattet.<br />

Verantwortlicher Direktor: Stephan Ortner<br />

Redaktion: : Isabella Ties<br />

Koordination: : Isabella Ties<br />

Graphik und Umschlag: Marco Polenta<br />

Druck: Fotolito Longo<br />

Per ordinazioni:<br />

Accademia Europea Bolzano<br />

Drususallee, 1<br />

39100 Bolzano - Italia<br />

Tel. +39 0471 055055<br />

Fax +39 0471 055099<br />

E-mail: press@eurac.edu<br />

Riproduzione parziale o totale del<br />

contenuto autorizzata soltanto con la<br />

citazione della fonte<br />

(titolo ed edizione).<br />

ISBN 88-88906-24-X<br />

Direttore responsabile: Stephan Ortner<br />

Redazione: Isabella Ties<br />

Coordinazione: Isabella Ties<br />

Grafica e copertina: Marco Polenta<br />

Stampa: Fotolito Longo


5<br />

Index<br />

Preface ............................................................................................ 7<br />

Spracherneuerung im Rätoromanischen: Linguistische, soziale und<br />

politische Aspekte ..............................................................................11<br />

Clau Solèr<br />

Implementing NLP-Projects for Small Languages:<br />

Instructions for Funding Bodies, Strategies for Developers ..............................29<br />

Oliver Streiter<br />

Un corpus per il sardo: problemi e perspettive ............................................45<br />

Nicoletta Puddu<br />

The Relevance of Lesser-Used Languages for Theoretical Linguitics:<br />

The Case of Cimbrian and the Suport of the TITUS Corpus ...............................77<br />

Ermenegildo Bidese, Cecilia Poletto and Alessandra Tomaselli<br />

Creating Word Class Tagged Corpora for Northern Sotho<br />

by Linguistically Informed Bootstrapping ...................................................97<br />

Danie J. Prinsloo and Ulrich Heid<br />

A Comparison of Approaches to Word Class Tagging:<br />

Distinctively Versus Conjunctively Written Bantu Languages .......................... 117<br />

Elsabé Taljard and Sonja E. Bosch<br />

Grammar-based Language Technology for the Sámi Languages ....................... 133<br />

Trond Trosterud<br />

The Welsh National <strong>Online</strong> Terminology Database ....................................... 149<br />

Dewi Bryn Jones and Delyth Prys<br />

SpeechCluster: A Speech Data Multitool................................................... 171<br />

Ivan A. Uemlianin<br />

XNLRDF: The Open Source Framework for Multilingual Computing ................... 189<br />

Oliver Streiter and Mathias Stuflesser<br />

Speech-to-Speech Translation for Catalan ................................................ 209<br />

Victoria Arranz, Elisabet Comelles and David Farwell


Computing Non-Concatenative Morphology: The Case of Georgian ................... 225<br />

Olga Gurevich<br />

The Igbo Language and Computer Linguistics: Problems and Prospects .............. 247<br />

Chinedu Uchechukwu<br />

Annotation of Documents for Electronic Editing of Judeo-Spanish <strong>Text</strong>s:<br />

Problems and Solutions ...................................................................... 265<br />

Soufiane Roussi and Ana Stulic<br />

Il ladino fra polinomia e standardizzazione:<br />

l’apporto della linguistica computazionale ............................................... 281<br />

Evelyn Bortolotti, Sabrina Rasom<br />

Il progetto “Zimbarbort” per il recupero del patrimonio linguistico cimbro ........ 297<br />

Luca Panieri<br />

Stealth Learning with an <strong>Online</strong> Dog (Web-based Word Games for Welsh) .......... 307<br />

Gruffudd Prys and Ambrose Choy<br />

Alphabetical list of authors & titles with keywords ..................................... 329<br />

Alphabetical list of contributors & contact addresses .................................. 335<br />

6


7<br />

Preface<br />

On behalf of the programme committee for the ‘Lesser Used Languages and<br />

Computer Linguistics’ conference (LULCL 2005), we are pleased to present the<br />

proceedings, which contain the papers accepted for presentation at the Bolzano<br />

meeting on 27th-28th October 2005. The contributions published in this volume deal<br />

with the main aspects of lesser used languages and their support through computer<br />

linguistics, ranging from lexicography to terminology for lesser used languages, and<br />

from computational linguistic applications in general to more specific resources such<br />

as corpora. Some papers deal specifically with Translation Memory Systems, online<br />

dictionaries, Internet Computer Assisted Language Learning (CALL) or Language for<br />

Specific Purposes (LSP).<br />

The choice of the conference theme was strongly influenced by the ambition<br />

to give lesser used languages an opportunity for visibility without taking into<br />

consideration the official number of speakers, but rather the range of technological<br />

resources available for each language. Even though some languages do indeed count a<br />

considerable number of speakers, technology support may be almost nonexistent. It is<br />

therefore remarkable how much has been done in the last decades for languages with<br />

few speakers. The Zimbar speakers are the smallest community represented at the<br />

conference, which counts about 2230 speakers living in Luserna, Roana, Mezzaselva<br />

and Giazza. Despite the small number of native speakers, there are major projects<br />

running on this Germanic language. The first project described here (cf. Bidese et al.)<br />

foresees the storage of Cimbrian textual material in the TITUS Corpus (‘Thesaurus<br />

of Indo-European <strong>Text</strong>ual and Linguistic Materials’), while the second one provides<br />

the guidelines for the Zimbarbort. The latter is a new project on the preservation of<br />

the Zimbar language, during which a database of lexical entries will be created (cf.<br />

Panieri). Both projects represent a substantial contribution to the preservation of<br />

language: through the recovery and storage of textual data they enable researchers<br />

to carry out linguistic analyses from several points of view.<br />

Sparseness of data is one of the main characteristics that many lesser used languages<br />

share with Zimbar. This influences both the choice of methodology and, of course, the<br />

results. Clau Solèrs keynote contribution reflects very well what happens when, for<br />

example, specialised terminology has to be elaborated within a small language. The<br />

lack of native terminology for many LSPs and the influence of bigger official languages,


such as German in this special case, are just some of the problems Rumantsch and<br />

other small languages have to face in order to propose acceptable terminology and<br />

preserve language at the same time. The project on the ‘Welsh National Terminology<br />

Database’ reflects the need to find a means between accepted terminology standards<br />

used for bigger languages (ISO 704 and ISO 860 norms) and language preservation. This<br />

project takes advantage of the similarities between terminology and lexicography, as<br />

existing lexicographical resources and applications are used to enrich the terminology<br />

database.<br />

Another central topic that lesser used languages have in common is the usability<br />

of available data. On the one hand we find the contribution on Judeo-Spanish, where<br />

Roussi & Stulic describe how to transliterate and annotate texts written in Hebrew<br />

characters and, at the same time, allow users to add their own interpretation and<br />

comments. On the other hand, Uchechukwu explains in his contribution the problems<br />

related to appropriate font programmes and software compatibility. On the basis of<br />

the Igbo language he describes what happens when the amount of data is considerable<br />

but not usable (due to the obstacle of accepted format).<br />

Issues of data sparseness and usability determine linguistic research, especially<br />

during the phases of data pre-processing, and the amount of time linguists must<br />

invest in dealing with linguistic research questions. Uemlianin proposes to use<br />

SpeechCluster in order to ensure that linguists can concentrate on linguistic analyses<br />

rather than disperse their efforts with formatting or any other time-consuming manual<br />

processing.<br />

Trosterud emphasises on the importance of open-source technology for projects<br />

on lesser used languages, so as to avoid waste in terms of time and technology, which<br />

must be reinvented every single time for every small language. The same point of<br />

view is stated by Stuflesser and Streiter as they present their intention to use XNLRDF,<br />

a free software package for NLP. Their contribution introduces the existing prototype<br />

and outlines future strategies.<br />

A similar aim is pursued by the invited key-note speaker Oliver Streiter, who focuses<br />

on this topic, providing a detailed overview on available resources and underlining the<br />

importance of mutual support within the research community through data sharing<br />

in standard formats, so as to make it usable and accessible to everybody. One of the<br />

instruments cited and used most often for data sharing is the Internet, as it allows<br />

online storage of data such as dictionaries, language games or terminology data bases<br />

(Jones & Prys). This medium is used by Canolfan Bedwyr to publish the web-based<br />

word games for Welsh, as well as by the Ladin institutions to disseminate their online<br />

8


dictionaries (cfr. Bortolotti & Rasom) meant to improve the language skills of native<br />

speakers.<br />

Several authors file contributions on tools for the elaboration and storage of language<br />

for text analysis and processing of text material with a view to the development of<br />

corpora. Puddu points out the importance of corpora for supporting the development<br />

of lesser used languages and the main problems connected to corpus design, text<br />

collection, storage and annotation for a lesser used language like Sardo (cf. Puddu).<br />

Sardinian, like any other lesser used language, has to cope with problems related to<br />

retrieval of written text, and in this specific case, also with a second problem: the<br />

absence of a standard orthography. The application of a homogeneous tag system, as<br />

well as the use of standards on storage, such as the rules elaborated by the EAGLES<br />

group (XCES), is suggested.<br />

Prinsloo and Heid describe methodology such as the bootstrapping of resources<br />

in order to elaborate language documentation and annotation. They describe the<br />

development of different tools to bootstrap tagging resources for Northern Sotho, and<br />

resources used to identify verbs and nouns for the disambiguation of closed class items.<br />

The Bantu languages and their characteristics are also discussed in the contribution by<br />

Taljard and Bosh, who present the problems encountered when dealing with languages<br />

with different writing systems — in this special case, Northern Sotho and Zulu. The<br />

authors describe the distinct approaches for class tagging according to the different<br />

writing systems.<br />

Examples of knowledge extraction and knowledge engineering are discussed in<br />

the paper on the FAME project, an Interlingual Speech-to-Speech Machine Translation<br />

System for Catalan, English and Spanish developed to assist users in making hotel<br />

reservations. The project includes tools for the documentation of data and elaboration<br />

of the standard Interchange Format (IF).<br />

It is clear from these contributions that nowadays, a variety of approaches and<br />

scientific methodologies are adopted in research on lesser used languages, showing<br />

the vitality of research in this specific area.<br />

Thanks to authors who cover a large variety of projects and technologies, an<br />

overview of the state of the art in research on lesser used languages can be provided,<br />

especially as regards projects on lesser used languages involving computational<br />

linguistics in Europe and the world. Central to the conference are both methodological<br />

issues, prompted by the described strategies for an efficient support of lesser used<br />

languages, and the problems encountered with theoretical approaches developed for<br />

major languages but applied to lesser used languages.<br />

9


The contributions underline the significance of computational linguistics, the<br />

methodologies and strategies followed, and their application on lesser used languages.<br />

It becomes evident how important decisions on international standards are and which<br />

consequences they imply for the standardisation of tools.<br />

This conference would not have been possible without the energy of many people.<br />

First of all we would like to thank the authors, who have provided superb contributes.<br />

Our gratitude goes also to the reviewers and to the scientific committee for their<br />

detailed and inspiring reviews.<br />

10<br />

Isabella Ties


Spracherneuerung im Rätoromanischen:<br />

Linguistische, soziale und politische Aspekte<br />

11<br />

Clau Solèr<br />

In Graubünden, the minority language Romansch has to assert itself in an environment<br />

of bilingualism with German on the one hand, while constantly keeping pace with<br />

the changing needs of its speakers on the other. To fulfill this task, terminological<br />

precision must continually come to terms with both the spoken language and the<br />

existing syntax. Romansch must be able to express a frame of mind that is influenced<br />

by the Germanic element, and neologisms must also adapt to the regional varieties<br />

for the speakers to be able to identify with them.<br />

Due to the limited political and economic importance of the language, as well as<br />

instruction that partly takes place in German only, Romansch is currently lacking the<br />

necessary channels for an efficient diffusion of neologisms.<br />

1. Einleitung<br />

Jede Sprache dient im Alltag als Werkzeug und passt sich ihrer Sprachgemeinschaft<br />

an; dies im Unterschied zu nur historischen oder kultischen Sprachen. Dabei darf<br />

sie sich aber nicht veräußern, nur um modern oder aktuell zu sein. Neben einer<br />

spontanen, gelegentlich unerwünschten Erneuerung – die übliche sich langfristig<br />

ablaufende Sprachentwicklung steht hier nicht zur Sprache – unterliegt die Sprache<br />

Eingriffen aus unterschiedlichsten Richtungen und Kräften und aus verschiedenen<br />

Gründen. Wie geschieht das, was entsteht daraus, wem nützt das und wird sie besser<br />

oder schlechter? Diese Fragen möchte ich besonders aus der praktischen Erfahrung<br />

zu beantworten versuchen und einige Überlegungen dazu anstellen. Vorerst muss<br />

ich das Rätoromanische in Graubünden und dessen Stellung im Hinblick auf die<br />

Sprachanpassung kurz umreißen. Ich wähle bewusst den Ausdruck Anpassung, um<br />

keine Wertung wie Erneuerung, Modernisierung, Einschränkung und Uminterpretation<br />

vorweg-zunehmen.<br />

Das Bündnerromanische ist eine eigenständige, neolateinische Sprache auf<br />

vorrömischer Grundlage. Seit über 1000 Jahren ist es im vielfältigen Kontakt mit dem<br />

Deutschen und während mehreren Jahrhunderten auch mit dem Italienischen (im<br />

Engadin besonders wirtschaftlich und in den katholischen Gegenden religiös bedingt).<br />

Nach dem Anschluss an die schweizerische Eidgenossenschaft 1803 ist die gelebte und<br />

relativ ausgeglichene Dreisprachigkeit der drei Bünde durch das Deutsche als fast


unumschränkte Verkehrs- und Verwaltungssprache ersetzt worden. Die rätoromanischen<br />

Ortsdialekte, die in fünf regionalen Schriftidiomen in geografisch und konfessionell<br />

mehr oder weniger getrennten Gebieten mit einer ursprünglich traditionellen, heute<br />

mehrheitlich touristischen Wirtschaft verwendet werden, haben sich unterdessen<br />

zu einer primär gesprochenen Varietät gewandelt. 35.095 Personen nannten bei der<br />

Volkszählung 2000 (RÄTOROMANISCH 2004:24) das Romanische als ihre Hauptsprache<br />

und insgesamt 60.816 verwenden es im Alltag oder bei der Arbeit, wobei nur zwei Drittel<br />

davon in Graubünden leben und der Rest in der schweizerischen Diaspora. Gemäß<br />

EUROMOSAIC (1996:34) braucht eine Sprachgruppe mindestens 300.000 Mitglieder für<br />

ihre Selbständigkeit und so sind die Aussichten des Romanischen eher düster. In den<br />

Gemeinden mit mehr als 20% Romanischsprechern besuchen zwei Drittel der Schüler<br />

eine maximal vierjährige romanische Grundschule mit anschließender Einführung ins<br />

Deutsche, das in den drei letzten Jahren Unterrichtssprache wird, neben immerhin bis<br />

zu 6 Stunden Romanisch als Fach. Die Mittelschulen in Chur und Samedan ermöglichen<br />

einen zweisprachigen Maturitätsabschluss. In der Pädagogischen Fachhochschule, die<br />

im Unterschied zum bisherigen Lehrerseminar nicht mehr sprachbezogen ist, fehlt eine<br />

entsprechende Unterstützung, wie es in den nur deutschen beruflichen Fachschulen<br />

auch der Fall ist.<br />

Es fehlt noch der linguistische Zustandsbericht. Die traditionelle Einsprachigkeit<br />

mit wenigstens einer Fremdsprache gibt es nur noch bei wenigen, älteren Personen in<br />

entlegenen Ortschaften mit geringer Zuwanderung. Sonst leben die Bündnerromanen in<br />

einer funktionalen, domänenorientierten und personengesteuerten Mehrsprachigkeit<br />

mit jeweils unterschiedlichen Kodes: romanische Ortsmundart gesprochen,<br />

teilweise gelesen, aber selten geschrieben, Schweizerdeutsch gesprochen sowie<br />

Standarddeutsch als Schriftsprache und teilweise gehört. Man wählt die Sprache<br />

relativ wertfrei, und die Phase als Rätoromanisch stigmatisiert war und man daher am<br />

Minderwertigkeitskomplex litt, ist heute mehrheitlich überwunden, und zwar in erster<br />

Linie wegen der hohen Deutschkompetenz der Romanischsprecher, ihrer besseren<br />

Integration in der deutschsprachigen Gesellschaft und letztlich auch wegen der vielen<br />

Zuzügler mit noch selteneren Sprachen.<br />

2. Terminologische Anpassung<br />

Ich wähle bewusst den Begriff Terminologie, der die Neologie und die Uminterpretation<br />

vorhandener Begriffe einschließt. Dabei hat man sich weniger umfangreiche und weit<br />

abgestützte Prozesse vorzustellen, sondern eher zufällige und gelegentlich chaotische<br />

Vorschläge, die nach Möglichkeit gesammelt und verbreitet werden. Viele Einträge<br />

des Pledari Grond in Rumantsch grischun stammen aus allerlei Übersetzungen und<br />

12


Anfragen von Sprachverwendern, der Rest stammt aus den Regionalwörterbüchern<br />

und aus der systematischen Neologie.<br />

3. Das Bedürfnis nach terminologischer Anpassung<br />

Es ist zwar unbestritten, dass keine Sprache von sich aus eine Anpassung braucht,<br />

denn Sprachen handeln nun einmal nicht. Trotzdem hat man seit dem Ende des<br />

19. Jh. immer wieder das Romanische als klagende, leidende oder verschupfte<br />

Sprache anthropomorphologisiert, damit den Rätoromanen ins Gewissen geredet und<br />

zugegebenermaßen einiges erreicht. Vieles ist aber auch verdorben worden (Coray<br />

1993). Es ist die Sprachgemeinschaft mit ihren Anwendern, die eine Sprache den<br />

Bedürfnissen nach gesicherter und rascher Kommunikation anpasst. Als Sprachverwender<br />

gelten grundsätzlich die Sprechenden in ihrem sozialen, wirtschaftlichen und<br />

geistigen Umfeld, solange sie sich nicht ausschließlich als Parteivertreter der Sprache<br />

verhalten. Puristische oder spracherhaltende Gründe sind politisch und gesellschaftlich<br />

begründet und von den Sprachverwendern nur beschränkt getragen. Sie lehnen diese<br />

von der Sprachverwaltung vertretene künstliche Erhaltung ab, wie ihr Verhalten u. A.<br />

gegenüber dem Rumantsch grischun zeigt.<br />

Eine wirkliche Alternative sich sprachlich anzupassen besteht für Sprachverwender<br />

von Minderheitensprachen mit einem asymmetrischen Bilinguismus in einem<br />

Sprachwechsel, der meistens mehrstufig verläuft, auch wenn dieser Sprachwechsel<br />

gerne verschwiegen wird (Solèr 1986:299). Das Englische in bestimmten<br />

Wirtschaftsbereichen gilt heute als direkter Weg, wenn es nicht aus Ermangelung<br />

einer gemeinsamen Sprache gewählt wird.<br />

4. Methoden der terminologischen Anpassung<br />

In der Vergangenheit hat sich das Romanische den Bedürfnissen mehr schlecht<br />

als recht angepasst und ist auch deshalb minorisiert worden. Erst im Zuge der<br />

spätromantischen Nationalbewegung, also seit mehr als hundert Jahren, bemüht<br />

man sich bewusst und systematisch um eine lexikalische Erneuerung. Heute ist das<br />

Romanische terminologisch sehr stark ausgebaut, verglichen mit dem Zustand vor<br />

150 Jahren als „tausenderlei Gegenstände und Thätigkeiten der gebildeten Welt<br />

unbekannt oder doch fremd geblieben [waren] (CARISCH 1848:X). Auch die Syntax<br />

hat sich erneuert und ist eigenständig(er) geworden. In dieser Zeit veränderten sich<br />

die Gesellschaft und Wirtschaft grundlegend. Die obligatorische Volksschule erreichte<br />

erstmals eine ganze Bevölkerungsschicht und konnte die Sprache direkt beeinflussen,<br />

indem alte örtliche Formen verschwanden, wie Jaberg/Jud (1928) bedauerten.<br />

13


Nun gilt es zu erklären, wie die Sprache terminologisch angepasst wird. Neben der<br />

Verwendung einer phonetisch und morphologisch integrierten fremden Bezeichnung<br />

oder gleichzeitig dazu liefert die eigensprachliche Um- und Beschreibung (Periphrase)<br />

die wichtigste spontane terminologische Anpassung. Dieses Vorgehen passt auch<br />

stilistisch fremde oder unverständliche Terminologie an die Umgangssprache an und<br />

steht logischerweise im Widerspruch zur Systematisierung der Fachsprache. Weiterhin<br />

gilt die typisch analytische Parataxe einer Volkssprache, wie es das Romanische im<br />

Grunde genommen ist. Der Vorteil der hohen Verständlichkeit muss mit der Variabilität<br />

erkauft werden. Hierhin gehören auch die zufälligen, spielerischen Volksbildungen<br />

mit üblicherweise nur regionaler und kurzzeitiger Gültigkeit; erwähnt seien vallader:<br />

chasperets für ‘Scheibenwischer’, eigentlich ‘Kasperlefigur’ oder sursilvan: cutgna für<br />

‘Surfbrett, Snowboard’, eigentlich ‘Schwarte (vom Holz oder vom Speck)’.<br />

Systemkonform ist auch die professionelle Terminologie oft periphrastisch<br />

anstatt derivativ und daraus entstehen, je nach dem Definitionsgrad, linguistische<br />

Ungetüme wie ovs da giaglinas allevadas a terra für ganz gewöhnliche ‘Eier (von<br />

Hühnern) aus Bodenhaltung’ oder chapisch da la rullera d’alver da manvella für<br />

‘Kurbelwellenlagerdeckel’, das freilich auch nicht verständlicher ist und als einzelner<br />

Baustein noch kompliziertere Sätze bilden muss.<br />

Ein typischer und traditioneller Terminologieprozess ist die Analogie. Heute weicht<br />

diese endolinguale zugunsten der exolingualen, sich am Deutsch lehnende Bildung<br />

zurück wegen ihrer Nähe zur Denkstruktur der Romanischsprecher. Sie verspricht<br />

mehr Erfolg als eine Herleitung aus dem Französischen als kaum mehr unterrichtete<br />

Fremdsprache oder aus dem Italienischen, das zwar (noch) einen festen Platz in den<br />

Bündner Schulen hat, aber nur eine geringe Bedeutung im Alltag genießt.<br />

Analogien zu romanischen Sprachen liegen in den folgenden Beispielen vor 1 , (vgl. auch<br />

Decurtins 1993, 235-254 passim): Als Alltagsbegriff gilt schambun (oit, frz.) ‘Schinken’.<br />

Der Begriff vl: levatriza ‘Hebamme’ scheiterte als undurchsichtige Bezeichnung für<br />

eine einsetzende Professionalisierung und wurde deshalb periphrastisch zu vl: duonna<br />

da part ‘Geburtsfrau’ rg: spendrera eigentlich ‘Rettende’, dunna da part. Auch<br />

purtantina ‘Tragbahre’ ist kaum verständlich und konnte bara trotz der Homonymie zu<br />

‘Leiche’ nicht ersetzen. Die Ausdrücke guid ‘(Reise-)Führer’ und guidar ‘führen’ sind<br />

seltener als manader ‘Führer, Lenker’ und manar ‘lenken, leiten, führen’. Einsichtig<br />

sind giraplattas ‘Plattenspieler’ und modernisiert giradiscs ‘Diskettenlaufwerk’, das<br />

1 Die romanischen Beispiele sind in Rumantsch grischun (rg); die Regionalformen werden bezeichnet als<br />

sr = sursilvan, st = sutsilvan, sm = surmiran, pt = puter, vl = vallader; Französisch = frz., Italienisch = it,<br />

Oberitalienisch = oit., Rätoromanisch = rtr.<br />

14


aber schon durch ‘CD-Player’ internationalisiert wurde. 2 Die Bezeichnung telefonin<br />

für ‘Funktelefon’ konnte sich gegen natel als Produktname und besonders handy nicht<br />

durchsetzen und das westschweizerische portable ist geographisch und mental schon<br />

zu weit entfernt.<br />

Besonders die ersten grundlegenden Wörterbücher in der ersten Hälfte des 20. Jh.<br />

wählten die Analogie. Ein Teil ihrer Vorschläge konnte sich dank der Verbreitung in<br />

der damals sprachprägenden Schule sowie dem hohen Ansehen des Französischen und<br />

Italienischen durchsetzen und viele Germanismen ersetzen (Solèr 2005).<br />

Wohl immer beeinflusste der Purismus sowohl außer- wie auch innersprachlich die<br />

terminologische Anpassung. Zu Beginn des 20. Jh. fielen besonders im Engadin wegen des<br />

Irredentismus viele Italianismen trotz ihrer linguistischen Nähe zum Rätoromanischen.<br />

Andererseits besteht das Dilemma zwischen neolateinischen Begriffen wie aspiratur<br />

‘Staubsauger’, mochetta ‘Spannteppich’, die aber weniger transparent sind als die<br />

transkodischen tschitschapulvra ‘Staubsauger’ und tarpun stendì eig. ‘gespannter<br />

Teppich’. Und genau diese Nähe schafft viele neue Begriffe, die erst rückübersetzt,<br />

also deutsch gedacht, verstanden werden: maisa da mezdi wörtlich ‘Mittagstisch’ für<br />

‘gemeinsames Mittagessen für ältere Personen’ anstatt gentar cuminaivel.<br />

Zu den produktiven endolingualen Prozessen gehört die Morphemableitung für<br />

die verschiedenen Kategorien. Trotz ihrer grundlegenden Systematik erkennt man<br />

zeittypische Vorlieben. Deverbale Agensbegriffe auf -ader, -atur, -adur sind häufig,<br />

während Formen auf –ari, z.B. sr: attentari ‘Attentäter’, teilweise mit lateinischem<br />

–ARIU-Formen zusammenfallen; splanari ‘Hobelbank’ ist insofern eine Falschbildung,<br />

weil es kein Agens ist und auch nicht zu –ARIU gehört. Die –ist-Formen wie schurnalist<br />

‘Journalist’ sind nur dann erfolgreich, wenn die Variante -cher nicht durch ein<br />

deutsches Analogon gestützt wird. Sonst gilt -ist als puristisch und High-Variante wie<br />

musicist ‘Musiker’, das mit musicher eine Low-Variante erzeugt.<br />

Die Prozesse und teilweise deren Resultat werden gebildet mit -ziun wie furniziun<br />

‘Lieferung’, allontanaziun ‘Entfernung’ und exemziun ‘Befreiung, Entbindung’ auf<br />

ganz unterschiedlicher romanischer Basis oder mit -ada wie zavrada ‘Schafscheide,<br />

Aussonderung’, scuntrada ‘Treffen, Zusammenkunft’ und, ziemlich heterogen, auzada<br />

‘Stockwerk’, genauer ‘Anhebung’.<br />

Auch andere Suffixe sind mehrwertig, so –al in fossal ‘Baugrube, Stadtgraben’, plazzal<br />

‘Baustelle’, aber auch runal ‘Schlepplift’ ohne die –ALIS-Adjektive zu berücksichtigen.<br />

Allgemein bevorzugt das Romanische Periphrasen anstatt der stilistisch markierten<br />

2 Als Abkürzung gilt mehrheitlich „CD“ m/f während disc cumpact im romanischen Radio recht geläufig<br />

ist.<br />

15


Adjektive auf -abel, -ibel, -aivel, -ar, -ari, -ic und unterscheidet sich damit stark vom<br />

Französischen und Italienischen. 3<br />

Mehrdeutig ist auch das Morphem –et als Verkleinerung vegliet ‘(kleiner) Alter’, als<br />

Spezifikation furtget ‘Gabler’, rg: buffet ‘Blasebalg’, sr: suflet analog frz. „souflet“,<br />

it. „soffietto“, sr: stizzet ‘Löschhorn’, rg: durmigliet ‘Siebenschläfer’ als Lehnbildung<br />

bzw. Calque für die kaum verständlichen Formen sr: glis, vl: glira aus lat. GLIS.<br />

Interessant sind die Bildungen auf –era. Während die vom Verb abgeleiteten<br />

durchsichtig und verständlich sind, wie ardera 4 ‘Verbrennungsanlage’, mulschera<br />

‘Melkstand’, cuera ‘Brutkasten’, erweisen sich die vom Nomen gebildeten sehr<br />

undurchsichtig wie balestrera ‘Schießscharte’, das primär mit sr: ballistrar ‘zappeln,<br />

störrisch sein, hapern’ assoziiert wird, oder sie wirken ambivalent wie sutgera<br />

‘Sesselbahn’, bobera ‘Bobbahn’, cruschera sr: ‘Drehkreuz, Kreuz Kreuzworträtsel’,<br />

rg: ‘Fadenkreuz’ und nicht beispielsweise ‘Kreuzung’, das cruschada heißt und<br />

homonymisch ist mit ‘Kreuzzug’. Hier erzeugten die verschiedenen Idiome trotz<br />

der sogenannten avischinaziun miaivla ‘sanften Annäherung’ der 60er Jahre<br />

unterschiedliche Formen, die man zwar gegenseitig verstand, aber nicht zu einer<br />

einheitlichen Sprachform beigetragen haben.<br />

Grundsätzlich kann jede Entlehnung als Basiselement dienen, wobei sie mehr an<br />

psychologische als an linguistische Grenzen stößt. Anstatt neue Verben direkt mit dem<br />

Morphem –ar an fremde, meistens deutsche Stämme zu binden wie bremsar ‘bremsen’,<br />

spizzar ‘mit dem Spitzeisen ausschlagen’, cliccar ‘klicken’, checkar ‘merken’ (über<br />

Deutsch aus dem Englischen), chiffar ‘kiffen’, die die früheren Verben auf –egiar/-<br />

iar ersetzen, bevorzugt man das analytische far il + deutscher Infinitiv far il clichen<br />

‘(den) Klick machen’.<br />

Asyndetische Bildungen sind durchsichtig und treffend wie tirastruvas<br />

‘Schraubenzieher’ und muntastgala ‘Treppenaufzug’, während tilavent ‘Düse’ in<br />

Richtung Wetter weist. ‘Mutterkuh’ vatga-mamma drückt auch in der veränderten<br />

Abfolge von Bestimmtes-Bestimmendes (Determinat-Determinant) das undefinierte<br />

Verhältnis aus. Obgleich analog zu biomassa ‘Biomasse’, ist biopur ‘Biobauer’<br />

gewöhnungsbedürftig aber nötig, weil pur da bio wie pur da latg ‘Milchbauer’ zuerst<br />

auf das Material oder die Herkunft verweist. Regelmäßige Bildungen wie telecumandar,<br />

microcirquit bleiben elitär.<br />

3 Auf –ebel lautet einzig debel „schwach“. Formen wie frz. „grippe aviaire“ und it. „influenza aviaria“<br />

für ‘Vogelgrippe’ sind im Romanischen fast unmöglich und uaulic ‘den Wald betreffend’, selvus ‘waldig’<br />

wirken exotisch.<br />

4 Dazu „muss“ das Pledari Grond eine Periphrase implant per arder rument liefern; die Idiome verwenden<br />

zudem sr: barschera, vl: bruschaduoir.<br />

16


Auch eine Aktualisierung durch Um- und Neudefinition bestehender, nicht mehr<br />

gebrauchter Begriffe ist möglich, trotz der unsicheren Übergangszeit mit Homonymie:<br />

Noch heute wird zavrar nur auf ‘Schafe scheiden’ beschränkt, trotz zavrader<br />

‘Sortwort’ und zavrar ‘sortieren’; man verwendet sortar oder das ungenaue separar<br />

‘trennen’. Eine wirkliche „Herausforderung“ bedeutet eben dessen Bezeichnung: Das<br />

surselvische provocaziun ist vermutlich zu nahe an die deutsche „Provokation“, so dass<br />

man heute vermehrt sfida, eine italienische Entlehnung im Engadinischen, verwendet,<br />

obwohl sfida, und ganz besonders sfidar in Rheinischbünden näher an sfidar, disfidar<br />

‘misstrauen’ liegt. Eine Erweiterung erfuhr das Verb sunar ‘musizieren’, im Engadin<br />

noch ‘Glocken läuten’, durch die Unterdifferenzierung von ‘spielen’ und ‘abspielen’<br />

sunar ina platta, in(a) CD anstatt tadlar, far ir ina platta, in(a) CD. Dem Biologiebegriff<br />

tessì ‘Gewebe’ fehlt das typische Fadenmuster eines gewobenen Tuches, und er ist<br />

deshalb nicht alltagstauglich; stattdessen verwendet man konkret pel ‘Haut’, charn<br />

‘Fleisch’ bis zu material ‘Material’. Auch der Fachausdruck petroli für ‘Erdöl’ wird<br />

nur in der engeren Bedeutung von Lampenbrennstoff ‘Petrol’ wahr genommen und<br />

erfordert infolgedessen ein Calque ieli (mineral) ‘Mineralöl’.<br />

Die gesamtromanische Standardisierung, angestrebt in Rumantsch grischun, zeigt<br />

im Alltag ihre Grenzen wegen einer hohen Heteronymie. Entweder verwendet man<br />

beide Ausdrücke wie taglia/imposta ‘Steuer’, buis/schluppet ‘Gewehr’, entschaiver/<br />

cumenzar ‘beginnen’ oder man vereinfacht unzulässig, indem man glisch für ‘Lampe’<br />

anstatt cazzola in der Surselva verwendet, wo glisch nur ‘Licht’ bedeutet, weil<br />

lampa aus puristischen Gründen ausfällt. Manchmal wird der ursprüngliche Begriff<br />

missverstanden und das Resultat ist unbrauchbar wie plimatsch ‘Kissen’ in bischla da<br />

plimatsch ‘Lagerhülse’ als Umdeutung eines horizontal beweglichen Holzes auf dem<br />

Wagen für eine rotierende Drehbewegung, der rullera ‘Rolllager’, cullanera ‘Kugellager’<br />

entsprechen. Heute schmunzelt man über die Pionierbezeichnung sr: tabla spurteglia<br />

(Gadola 1956:79) für eine ‘elektrische Schalt(er)tafel’ mit Unterdifferenzierung von<br />

‘elektrischer Schalter’ und ‘Schalterfenster’, das inzwischen zu tavla da distribuziun,<br />

cumond berichtigt wurde. Der ganze Bereich der Elektrizität mit ‘Strom’, ‘Spannung’,<br />

‘Hochspannung’ und ‘Starkstrom’ usw. wurde erst nach 1990 für das Pledari Grond<br />

terminologisch aufgearbeitet; umgesetzt ist es kaum, schließlich ist es ziemlich<br />

abstrakt. 5<br />

5. Auswirkungen der terminologischen Anpassung<br />

Außer in offiziellen Bereichen mit einem vorgeschriebenen Sprachgebrauch wie<br />

die dreisprachige Kantonsverwaltung, die Gesetzgebung und die Herstellung von<br />

Schulbüchern, ist die terminologische Anpassung ein Zusammenspiel von glücklichen,<br />

5 Die ersten Fachvorschläge wurden 1917 im Chalender ladin veröffentlicht: Davart l’electricited. Terms<br />

romauntschs per l’electricited, acceptos dalla Commissiun Linguistica, 70-71.<br />

17


überzeugenden Vorschlägen auf der einen Seite und einer erfolgreichen Vermarktung<br />

auf der anderen Seite. Zuerst zur linguistischen Komponente:<br />

6. Linguistische Identifizierung<br />

Seit der Einführung des Rumantsch grischun 1982 bedeutet Terminologie nicht<br />

nur eine lexikalische Erweiterung, teilweise in einer Diglossie, sondern auch einen<br />

Paradigmawechsel hin zum Einheitsstandard. Neben psychologischen und politischen<br />

Hindernissen bestehen auch syntaktisch-semantische Unterschiede. Außer bei<br />

Gesprächen in sektoriellen Sprachen zwischen Fachleuten, sind die betroffenen<br />

Endanwender Laien, die Romanisch praktisch nur sprechen, und deshalb muss die<br />

Fachterminologie folgendes beachten:<br />

• der Begriff muss durchsichtig, transparent sein sowohl<br />

elementar (wörtlich) als auch in der Bedeutung (inhaltlich):<br />

sufflafain ‘Heugebläse’, tirastapun ‘Zapfenzieher’, pendiculara ‘Seilbahn’,<br />

autpledader ‘Lautsprecher’, portasperanza ‘Hoffnungsträger’; problematisch ist<br />

sr: sclausanetg, rg: strasarsa und, trotz des Calques, cirquit curt ‘Kurzschluss’;<br />

camiun-tractor erkennt die ländliche Bevölkerung als ‘Ackertraktor’ und nicht als<br />

modernen ‘Sattelschlepper’, ‘LKW’;<br />

• er muss sich regional und idiomatisch anpassen:<br />

rg: tilastruvas, vl tirascrauvs, sr: tilastrubas konnte in der Lumnezia zu<br />

tre(r)strubas angepasst werden. Schnell verliert sich aber der Grundbegriff,<br />

so für ‘Scheibenwischer’ mit der Vermischung von ‘wischen’, ‘waschen’ und<br />

‘trocknen’ rg: fruschavaiders, fruschets, sr: schubregiaveiders, furschets,<br />

st: furbaveders, sm: siaintaveders, pt: süjaintavaiders, terdschins, vl:<br />

süaintavaiders, terdschins und die schon erwähnten chasperets;<br />

rg: biancaria ‘Weißwäsche’ ist unverständlich im Romanischen mit nur alv als<br />

Benennung für ‘weiß’; üblich sind konkrete Begriffe wie sr: resti da letg, vl: linzöls<br />

‘Bettwäsche’, sr: resti suten ‘Unterwäsche’;<br />

• er sollte weder zur Homonymie noch zu Heteronymie führen:<br />

schluppet/buis ‘Gewehr’ sind regional so verankert, dass keiner<br />

davon sich durchsetzen kann; die ungenügende Unterscheidung<br />

von fittar ‘mieten’ und affittar ‘mieten, vermieten’ erfordert eine<br />

Periphrase prender/dar a fit ‘in Pacht nehmen/geben’;<br />

rg: taglia ‘Steuer’ ist bevorzugt worden, obwohl imposta produktiver<br />

wäre: *impostabel, *impostar, das im PG nur als Part. Perf. contribuziun<br />

imposta ‘auferlegte Leistung’ steht und kaum von rg: impostar<br />

‘aufgeben, einfächern’ als Buchwörter bedrängt würde;<br />

18


vl: cumischiun sindicatoria ‘Geschäftsprüfungskommission’ ist neben sr, rg:<br />

cumissiun da gestiun lediglich eine Scheinopposition, weil gestiun überall ‘Geschäft’<br />

bedeutet, aber trotzdem identifiziert sich die Bevölkerung zunehmend mit solchen<br />

Schibboleths als Gegenreaktion zu einer drohenden Vereinheitlichung;<br />

• darf im Romanischen dem deutschen Diskurs und Geist 6 nicht zuwiderlaufen.<br />

Begriffe wie denticulara ‘Zahnradbahn’ sind offenbar zu wenig einsichtig und<br />

brauchen eine Periphrase viafier a roda dentada, um sich von dentera ‘Zahnspange’,<br />

dentadira ‘Gebiss, Zahnung’ und dentigliun ‘Bartenplatte (beim Wal)’ abzusetzen,<br />

weil viele Morpheme zu schwach und deshalb nicht produktiv sind. Trotzdem<br />

vermochten sich auch eigenständige Begriffe durchzusetzen: runal ‘Schlepplift’,<br />

sutgera ‘Sessellift’; rentier ‘Rentner’ ist umstritten wegen des deutschen Synonyms<br />

‘Ren, Rentier’ und vl: golier vermag den üblichen goli ‘Goali, Torhüter’ kaum zu<br />

vertreiben.<br />

Fast unüberwindliche Hindernisse für eine Standardisierung stellen die idiomatisch<br />

ausgeprägten Bereiche der Speisen und der häuslichen Tierwelt dar. Der Ersatz für<br />

sr: tier ‘Tier’ animal wird besonders im Engadin pejorativ als ‘Viech’ verstanden;<br />

dessen bes-cha wird wiederum mit sr: bestga und besonders bestia ‘Raubtier, Bestie’<br />

gleichgesetzt, denn biestga gilt dort nur für ‘Vieh, Großvieh’ und entspricht nicht vl:<br />

besch ‘Schaf’. Die exemplarische Vielfalt belegt die Bezeichnung der Körperteile beim<br />

Menschen (PLED RUMANTSCH 3 1984).<br />

Bei bilingualen Sprechern mit einer stark interferierten Sprache betrifft die<br />

linguistische Identifizierung nicht nur das definierte Romanisch als postulierte<br />

reine Sprache, sondern das gesamte Repertoire (Deutsch und andere Sprachen).<br />

Die romanische Form wirkt oft puristisch mit entsprechendem Registerwechsel und<br />

verletzt den oft einzigen verfügbaren tieferen Stil des Sprachverwenders; es entsteht<br />

ein neues Register. In der geläufigen Jugendsprache wirken magiel ‘Glas’ und gervosa<br />

‘Bier’ stilistisch fremd neben glas und pier, und sa participar ‘sich beteiligen’<br />

entspricht aus sozialkommunikativen Gründen nicht far cun ‘mitmachen’, das man<br />

ersetzen will (Solèr 2002:261).<br />

7. Linguistische Bereicherung und Unsicherheit<br />

Die Terminologie will eine umfassendere Verwendbarkeit der Sprache mit neuen<br />

Domänen erreichen, aber sie soll auch die linguistische Ausdrucksmöglichkeit<br />

erweitern und so das Romanische als Fachsprache fördern. Wohl sind die derivativen<br />

Prozesse linguistisch geeigneter als die analytischen, aber diese werden wegen der<br />

6 Gemäss Ascoli (1880-83:407) „materia romana in spirito tedesco“ und Solèr (2002:261) „mentale<br />

Symbiose“.<br />

19


höheren Transparenz und der Nähe zum Deutschen bevorzugt; auch psychologische<br />

Gründe scheinen eine Hürde darzustellen. Spontane und spielerische Bildungen sind<br />

Einzelfälle ohne Wirkung, so idear ‘die Idee haben’, impulsar ‘den Impuls geben’ oder<br />

praulistic ‘märchenhaft’. Besonders die Zeitungsleute des 19. Jh. mussten mehr oder<br />

weniger eine neue Sprache für die sich stark veränderte Umwelt erschaffen, weil<br />

bis anhin nur eine religiöse und juristische Fachsprache bestand und Deutsch keine<br />

Alternative war. Noch heute sind die Medien Pioniere, denken wir an ‘Seebeben’,<br />

‘SARS’, ‘Herdenschutzhund’ und ‘Vogelgrippe’, aber gelegentlich verhindern<br />

notdürftige abstrakte Stelzenbegriffe eine genaue und kohärente Terminologie: chaun<br />

da protecziun ‘Herdenschutzhund’ anstatt chaun-pastur, chaun pertgirader; forzas<br />

da segirezza ‘Sicherheitskräfte’ anstatt eines konkreten Begriffs armada, polizia;<br />

bains da provediment ‘Versorgungsgüter’ für victualias, provediment oder effectiv,<br />

populaziun da peschs ‘Fischbestand, -population’, das romanisch als ‘Fischbevölkerung’<br />

verstanden wird anstatt (ils) peschs als Kollektiv.<br />

Einzelelemente lassen sich problemlos austauschen, während mehrgliedrige Begriffe<br />

die bestehende Syntax überfordern, indem sie sie verändern oder eine systemfremde<br />

Syntax übernehmen:<br />

• Verben mit Präposition im abstrakten Sinn:<br />

metter enturn ideas ‘Ideen umsetzen’ verstanden als ‘Ideen umlegen, töten’<br />

anstatt realisar, concretisar ideas; sr: fatg en lavur cumina priu ora la lavur da<br />

professiun ‘in Gemeinwerk gemacht ausgenommen die Facharbeit’ anstatt auter<br />

che, cun excepziun da, danor ‘anders als, mit Ausnahme von, außer’;<br />

• Nominalisierung und Nominalketten:<br />

sm: La discussiun cun la donna ò savia neir exequeida sainza grond disturbi da<br />

canera ‘das Gespräch mit der Frau konnte ohne größere Belästigung durch Lärm<br />

durchgeführt werden’ anstatt ins ò savia discorrer cun la donna senza neir disturbo<br />

‘man konnte mit der Frau sprechen, ohne gestört zu werden’;<br />

• Stelzensätze und Leerformulierungen:<br />

far adiever dals meds publics da transport en Engiadina Bassa ‘Gebrauch machen<br />

von den öffentlichen Verkehrsmitteln im Unterengadin’ für ir cun il tren ed auto<br />

da posta; exequir lavurs da surfatscha ‘Oberflächenarbeiten ausführen’ anstatt far<br />

la cuvrida ‘die Abdeckung (der Straße) machen’.<br />

Mit diesen transkodischen Bildungen könnte man sich linguistisch allenfalls<br />

abfinden, wenn das Romanische damit nicht noch die Identität verlieren würde.<br />

Komplexe Begriffe widersprechen zwar der Sprachgewohnheit, der Tradition der<br />

Romanischsprecher, aber die abstrakte, sperrige, styroporartige Syntax hat sich<br />

20


vom traditionellen Romanisch so weit entfernt, dass man es nur über das Deutsche,<br />

versteht, also aus der Rückübersetzung.<br />

8. Sozialpsychologische Aspekte<br />

Das Romanische wird in dörflichen Sprachgemeinschaften und teilweise in<br />

den Regionalzentren verwendet; es schafft dort eine lokale Identifikation unter<br />

den Romanischsprechern, ganz besonders den Einheimischen, und steht für das<br />

Überschaubare gegenüber dem Fremden. Wenn man aber beruflich oder mit einem<br />

nichtromanischen Partner eine andere Sprache verwendet, so tut man das emotionslos.<br />

Und wenn manche wegen ihrer fehlenden romanischen Fachkompetenz Deutsch<br />

verwenden, dann ist das eher ein Reflex der Sprachpolitik, als dass man sich schämt. Es<br />

ist zudem eher selten, dass man bewusst neue romanische Ausdrücke sucht, denn allzu<br />

oft vergessen besonders die Sprachverwalter und Sprachpfleger, dass das Romanische<br />

oft nur informell gesprochen wird und endgültig eine Ko-Sprache des Deutschen ist.<br />

9. Politisch-wirtschaftliche Aspekte<br />

Jede Sprache kann zwar materiell (Terminologie, Neologie) erfolgreich, sozusagen<br />

im Labor erneuert werden, aber deren Verbreitung, Implementierung, kann nur die<br />

Anwendungsseite (Produkte, Sprachträger usw.) bewirken. Beim Romanischen hingegen<br />

sind die anwendungsorientierten Bedingungen überhaupt nicht oder nur schwach<br />

erfüllt und auch der technisch-linguistische Bereich ist nicht eindeutig bestimmt<br />

(Entscheidungskompetenz, Verbindlichkeit, Verbreitung). Die enge und fast intime<br />

Sprachgemeinschaft fordert vom linguistischen Bearbeiter, der zugleich selber betroffen<br />

ist, eine technisch-linguistische Spracherneuerung, die einerseits systematisch ist und<br />

andererseits auch eine sichere Triviallösung liefert. Diese Individualisierung beeinflusst<br />

trotzdem die Spracherneuerung weniger als andere Rahmenbedingungen, nämlich das<br />

sprachliche Umfeld, die Nützlichkeit und die Kleinräumigkeit.<br />

Im gemeinsamen Wirtschafts-, Verkehrs-, Ausbildungs- und Kommunikationsraum<br />

mit der deutschen Schweiz fehlt dem Romanischen die konkrete, durchgehende<br />

Anwendung, die Kommerzialisierung der Sprache, außer in den gesteuerten Bereichen<br />

der Verwaltung und Volksschule in denen sie ohne direkte Konkurrenz ist.<br />

10. Verbreitung und Nachhaltigkeit<br />

Im ganzen Anpassungsprozess erweist sich – bei einer minorisierten Sprache<br />

nicht unerwartet – ausgerechnet die wichtigste Phase, nämlich die Verbreitung und<br />

systematische Anwendung, als schwächstes Glied. Die Anpassung dringt nicht direkt<br />

zum Anwender im Berufsalltag, sondern er muss sie bewusst holen und auch bereit sein,<br />

21


sie zu verwenden; gezwungen wird er kaum und wenn, dann nur in einzelnen Bereichen<br />

von befohlener Mehrsprachigkeit. Zudem verstreicht häufig so viel Zeit zwischen dem<br />

Vorschlag und der Anwendung beim Endverbraucher, dass der Begriff im technischen<br />

Bereich entweder schon veraltet ist oder dass die entlehnte Erstbezeichnung oder<br />

ein Trivialbegriff sich eingebürgert hat. Häufig überrumpelt die Entwicklung aber die<br />

Sprache regelrecht, so z. B. im Informatikbereich.<br />

Beim Start des Rumantsch grischun 1982 war auch die Informatik ein relativ neues<br />

und unbekanntes Werkzeug, so dass in dieser Phase auch die romanischen Begriffe<br />

dafür geschaffen werden konnten. In der anschließenden rasanten Verbreitung<br />

der Informatik sind diese aber durch die internationalen bedrängt oder verdrängt<br />

worden, so: ordinatur ‘Rechner, Computer’ > computer; platta fixa ‘Festplatte’ ><br />

HD; arcunar ‘speichern’ > segirar ‘sichern’; datas ‘Daten’, datoteca, ‘Datenfile’ ><br />

file; actualisaziun, cumplettaziun ‘Update’ > update, palpader ‘Scanner’ > scanner.<br />

‘Laptop’ hat man direkt übernommen ohne portabel vorzuschlagen.<br />

Die Zeitungsredaktoren des 19 Jh. konnten ihre neuen, wenig systematischen<br />

Begriffe unmittelbar den Lesern konkurrenzlos vermitteln; aber sie verliefen oft im<br />

Sand, weil sie nicht systematisch gesammelt und weiter verbreitet wurden. Diese<br />

Schwächen versuchte die für das Engadin 1919 begonnene themenorientierte Reihe<br />

„S-CHET RUMANTSCH“ in der Zeitung und später in Buchform zu überwinden. Ich<br />

möchte es nicht unterlassen, einige phantasievolle Verbreitungsarten wenigstens zu<br />

erwähnen:<br />

• mit Metzgereibegriffen bedruckte Papiertüten<br />

• Beschriftungen der Produkte in den Auslagen<br />

• Sportterminologie auf Tafeln in den Turnhallen<br />

• zweisprachige Rechnungsformulare für Autowerkstätten<br />

• Beschreibung und Gebrauchsanweisung auf Produktepackungen7 Diese direkten Anwendungen wurden durch sekundäre Listen ergänzt und noch<br />

heute veröffentlichen einzelne Zeitungen regelmäßig kleine Wortlisten.<br />

Die wohl erfolgreichste Verbreitung brachte die Schule bis zu den großen<br />

Strukturänderungen der 70er Jahren des letzten Jahrhunderts, die eine noch<br />

mehrheitlich nur-romanische und ländliche Bevölkerung in eine neue Welt inhaltlich<br />

und sprachlich einführte. Diese Periode dauerte so lange, dass eine Schulbuchreihe noch<br />

über zwei Schulgenerationen reichte und die gelernten Neuerungen fast lebenslänglich<br />

galten. In dieser Zeit fallen auch die ersten systematischen Wörterbücher.<br />

7 Eines der wenigen Beispiele ist die „Lataria Engiadinaisa SA, CH-7502 Bever“; in den 90er Jahren waren<br />

einige Tierarzneien romanisch beschriftet; die Anschrift Adina Coca Cola blieb ein Werbegag der 90er<br />

Jahre.<br />

22


Wörterbücher und Lexikographie sind unverzichtbare Hilfsmittel für jede Sprache,<br />

aber wenig wirkungsvoll für den Sprachverwender, weil er bewusst und außerhalb<br />

der Gesprächs, sozusagen metakommunikativ auf sie greifen muss. Sie sind im<br />

Romanischen zudem nur referenziell und liefern eine Ersatzbezeichnung für schon<br />

bekannte – zwar deutsche – Ausdrücke, aber trotzdem verfügbar im Zeicheninventar<br />

der Romanischsprecher (Reiter 1984:289).<br />

Während die hochspezialisierte Terminologie in keiner Sprache zum allgemeinen<br />

Wortschatz gehört, sollten die neuen Begriffe des modernen Alltags wie Verkehr,<br />

Kommunikation, Unterhaltung, Lifestyle, aber auch der neueren Verwaltung<br />

umgesetzt werden. Für eine umfangreichere Durchsetzung, Implementierung, fehlt<br />

das romanische Umfeld sowie der Terminologiediskurs. Die bestehenden Medien<br />

erfüllen lediglich eine lokale und emotionale Rolle gegenüber einem umfassenden<br />

deutschsprachigen Angebot, und so entwickelt sich auch kaum eine Sprachnorm.<br />

Die Verwendung des Romanischen allgemein, und einer offiziellen Sprachform<br />

im besonderen, anstatt des Deutschen oder des Englischen ist nur ausnahmsweise<br />

bei Kulturtouristen und Heimwehromanen ein kommerzieller Faktor; sonst kann es<br />

sogar hinderlich sein, wie die Reaktionen der Bevölkerung auf jegliche Anpassung<br />

eindrücklich belegen. Das Romanische besitzt kein geschütztes Sprachgebiet und seine<br />

Verwendung kann gesetzlich kaum oder nicht durchgesetzt werden wie beispielsweise<br />

in Frankreich.<br />

Die kantonale Verwaltung verwendet die drei offiziellen Kantonsprachen in den<br />

Veröffentlichungen und im Internet (Erklärungen, Berichte, Anleitungen, Hinweise,<br />

Abstimmungen usw.). In der Verwaltungstätigkeit hingegen ist das Romanische<br />

gegenüber dem Deutschen besonders im Fachbereich eingeschränkt: Die romanische<br />

Steuererklärung gibt es nicht digital, verschiedene amtliche Formulare können online<br />

nur deutsch und gelegentlich italienisch ausgefüllt werden. Offensichtlich trifft<br />

folgendes für die regionalen Organisationen, die direkt mit der Bevölkerung arbeiten<br />

zu: nur Romanisch selten, zweisprachig ist häufiger, eher plakativ, und mehrheitlich<br />

Deutsch. Das ist auch eine Folge des ‘polykephalen’, sprich regionalisierten<br />

Romanisch als Teil einer deutschen Umwelt, und es verunmöglicht eine einheitliche<br />

Fachterminologie und ihre einheitliche umgangssprachliche Umsetzung. So bestätigt<br />

sich die Feststellung von Haarmann (1993:108) „Hier liegt ein prinzipielles Problem<br />

des Minderheitenschutzes. Eine indominante Sprache hat zwar grundsätzlich bessere<br />

Chancen zu überleben, wenn ihre Verwendung in Bereichen des öffentlichen Lebens<br />

garantiert wird, es besteht aber keine automatische Wechselbeziehung zwischen<br />

23


einer sprachpolitischen Förderung und der Erhaltung dieses Kommunikationsmediums<br />

als Mutter- und Heimsprache“.<br />

Die gewinnorientierte Wirtschaft wählt dementsprechend die beste Sprache.<br />

Romanisch verwendet sie identifikatorisch und emotional in den rtr. Regionen, aber nicht<br />

als durchgehende Plattform (Banken, Versicherungen). „Unique Selling Proposition“ ist<br />

ein Schlagwort und wird bestenfalls im Mäzenatentum eingelöst. Ohne die operative<br />

Bedeutung passt sich keine Fachsprache an, oder sie wird nicht systematisch und<br />

einheitlich verwendet, sondern als lokale und stilistische (diglossische) Variante,<br />

banalisiert als Trivialterminologie. Dann ist auch die domänenspezifische Verwendung<br />

des Romanischen und deren Aktualisierung weitgehend illusorisch, und auch die<br />

bescheidene berufliche Aus- und Weiterbildung dient bestenfalls für romanische<br />

Infrastrukturbetriebe (Lia Rumantscha, Radio, Fernsehen und die Schulunterstufe).<br />

Das Romanische passt sich zwar den neuen Erfordernissen dauernd an, aber weil<br />

diese Entwicklung eher spontan als geordnet erfolgt, und weil sie eher die gesprochene<br />

Sprache mit einer Trivialterminologie betrifft, fördert sie die zweisprachige<br />

Diglossie mit dem Schriftdeutschen in allen Außenbeziehungen und sogar unter<br />

Romanischverwendern.<br />

11. Ausblick – aber kaum die Lösung<br />

Das klingt nach einer Bankrotterklärung. Das ist es nicht, aber man muss sich<br />

auf die Grundlagen rückbesinnen und in erster Linie die Randbedingungen, die<br />

soziolinguistischen, politischen und wirtschaftlichen Voraussetzungen ernst(-er)<br />

nehmen.<br />

Zum ersten die Terminologie; anstatt der akademischen und direkt kopierten,<br />

sterilen Erneuerungen muss man sich um assoziative – und überschreite sie auch die<br />

Einzelsprache – einsichtige oder sogar spielerische, aber praxistaugliche Benennungen<br />

bemühen, die lebensnah sind und genaue Inhalte sprachlich sinnvoll und kulturell<br />

verträglich umsetzen können.<br />

Die Hauptschwierigkeit ist und bleibt die Verbreitung. Wenn eine Sprache wie<br />

das Romanische mehr kulturell, ideell und politisch, als wirtschaftlich begründet ist,<br />

erweist sich deren Anpassung (Modernisierung und Standardisierung) umso weniger<br />

durchsetzbar.<br />

Psychologischer Druck oder die Drohung eines Sprachniedergangs wirken vielleicht<br />

kurzfristig, erwecken Hoffnungen, aber sie wirken niemals nachhaltig.<br />

Dass sich der Riesenaufwand für die Romanisierung des ganzen MS-Office mit<br />

der Orthografiekontrolle (Spell-Checker) nicht lohnt, ist leicht vorauszusagen; das<br />

24


Produkt spricht eine zu kleine Gruppe an und das Bedürfnis nach romanischen <strong>Text</strong>en<br />

kann man nicht künstlich erzeugen. 8 Mit Sicherheit hilfreich und seit bald 15 Jahren<br />

nützlich erwies sich die Terminologiearbeit im Pledari Grond der Lia Rumantscha; es<br />

ist zwar bescheidener, dafür praxisbezogen und dient zudem als eine Hilfsbrücke zu<br />

den Idiomen und sollte somit Spannungen abbauen.<br />

Für eine isolierte Kleinsprache ist es aber unabdingbar, die Sprachverwender<br />

schnell, unkompliziert und konkret zu unterstützten. Die privaten und kollektiven<br />

Sprachverwalter wie die Lia Rumantscha und der Kanton mit seiner umfassenden<br />

Tätigkeit können die Sprachverwender am ehesten überzeugen mit gebrauchfertigen<br />

Vorlagen, schnellen Übersetzungen, gefälligen <strong>Text</strong>en und mit einem umsichtigen,<br />

engen Coaching bei der Sprachverwendung und so wären auch die Empfänger eingebunden.<br />

Für diese Aufgaben braucht es Terminologiearbeit. Das ist ein guter Anfang und ist<br />

auch zu bewältigen. Die folgenden ebenso notwendigen Schritte müssen zuallererst<br />

die Sprachverwender tun.<br />

8 Versuche der LR um 1990 digitales Material für Handwerksbetriebe herzustellen und zu vertreiben<br />

scheiterte an den einzelbetrieblichen „Branchenlösungen“, die miteinander unverträglich sind, an der<br />

Einheitsform Rumantsch grischun, an der gewohnten deutschen Berufssprache sowie der Einstellung<br />

gegenüber der deutschsprachigen Kundschaft.<br />

25


26<br />

Bibliographie<br />

Ascoli, G.I. (1880-1883). “Annotazioni sistematiche al Barlaam e Giosafat soprasilvano.”<br />

Archivio glottologico italiano. Roma: Loescher, 7:365-612.<br />

Carisch, O. (1848). Taschen-Wörterbuch der Rhætoromanischen Sprache in<br />

Graubünden. Chur: Wassali.<br />

Coray, R. (1993). “La mumma romontscha: in mitos.” ISCHI 77, 4, 146-151.<br />

Decurtins, A. (1993). “Wortschatz und Wortbildung – Beobachtungen im Lichte der<br />

bündnerromanischen Zeitungssprache des 19./20. Jahrhunderts.” Rätoromanisch,<br />

Aufsätze zur Sprach-, Kulturgeschichte und zur Kulturpolitik. Romanica Rætica 8,<br />

Chur: Società Retorumantscha, 235-254.<br />

EUROMOSAIC (1996). Produktion und Reproduktion der Minderheitensprachge-<br />

meinschaften in der Europäischen Union. Brüssel/Luxemburg: Amt für amtliche<br />

Veröffentlichungen der EG.<br />

Gadola, G. (1956).”Contribuziun alla sligiaziun dil problem puril muntagnard.” Igl<br />

Ischi, 42, 33-93.<br />

Haarmann, H. (1993). Die Sprachenwelt Europas. Geschichte und Zukunft der<br />

Sprachnationen zwischen Atlantik und Ural. Frankfurt: Campus.<br />

Jaberg, K. & Jud, J. (1928). Der Sprachatlas als Forschungsinstrument. Halle:<br />

Niemeyer.<br />

Pledari Grond (2003) deutsch-rumantsch, rumantsch-deutsch, cun conjugaziuns dals<br />

verbs rumantschs. Cuira: Lia rumantscha [CD-ROM].<br />

PLED RUMANTSCH/PLAID ROMONTSCH 3 (1984). Biologia. Cuira: Lia rumantscha.


RÄTOROMANISCH (2004). Facts & Figures. Cuira: Lia rumantscha.<br />

Reiter, N. (1984). Gruppe, Sprache, Nation. Wiesbaden: Harrassowitz.<br />

S-CHET RUMANTSCH (1917-1963). Fögls per cumbatter la caricatura nella lingua<br />

ladina. Scuol: Uniun dals Grischs.<br />

Solèr, C. (1986). “Ist das Domleschg zweisprachig?” Bündner Monatsblatt, 11/12, 283-<br />

300.<br />

Solèr, C. (2002). “Spracherhaltung – trotz oder wegen des Purismus. Etappen des<br />

Rätoromanischen.” Bündner Monatsblatt, 4, 251-264.<br />

Solèr, C. (2005). “Co e cura che la scrittira emprenda rumantsch. Cudeschs da scola per<br />

la Surselva.” Annalas da la Societad Retorumantscha. Cuira: Societad retorumantscha,<br />

7-32.<br />

27


Implementing NLP-Projects for Small<br />

Languages: Instructions for Funding Bodies,<br />

Strategies for Developers<br />

29<br />

Oliver Streiter<br />

This research starts from the assumption that the conditions under which ‘Small<br />

Language’ Projects (SLPs) and ‘Big Language’ Projects (BLPs) are conducted are<br />

different. These differences have far-reaching consequences that go beyond the<br />

material conditions of projects. We will therefore try to identify strategies or<br />

techniques that aim to handle these problems. A central idea we put forward is<br />

pooling the resources to be developed with other similar Open Source resources. We<br />

will elaborate the expected advantages of this approach, and suggest that it is of such<br />

crucial importance that funding organisations should put it as condicio sine qua non<br />

into the project contract.<br />

1. Introduction: Small Language & ‘Big Language’ Projects - An Analysis of<br />

their Differences<br />

Implementing NLP-projects for Small Languages: Is this an issue that requires<br />

special attention? Are Small Language Projects (SLPs) different from ‘Big Language’<br />

Projects (BLPs)? What might happen if SLPs are handled in the same way as BLPs? What<br />

are the risks? How can they be reduced? Can we formulate general guidelines so that<br />

such projects might be conducted more safely? Although the processing of minority<br />

languages and other Small Languages has been subject to a series of workshops, this<br />

subject has been barely tackled as such. While most contributions discuss specific<br />

achievements (e.g. an implementation or transfer of a technique from Big Languages<br />

to Small Languages), only a few articles transcend to higher levels of reflection on<br />

how Small Language Projects might be conducted in general.<br />

In this contribution, we will compare SLPs and BLPs at the abstract schematic<br />

level. This comparison reveals differences that affect, among other things, the status<br />

of the researcher, the research paradigm to be chosen, the attractiveness of the<br />

research for young researchers, as well as the persistence and availability of the<br />

elaborated data - all to the disadvantage of Small Languages. We will advance one far-<br />

reaching solution that overcomes some of these problems inherent to SLPs, that is,<br />

to pool the developed resources with other similar Open Source resources and make


them freely available. We will discuss, step by step, the possible advantages of this<br />

strategy, and suggest that this strategy is so promising and so crucial for the survival<br />

of the elaborated data that funding organisations should put it as condicio sine qua<br />

non into their project contract.<br />

Let us start with the comparison of BLPs and SLPs.<br />

• Competition in Big Languages: Big Languages are processed in more than<br />

one research centre. Within one research centre more than one group might work on<br />

different aspects of this single language. The different centres or groups compete<br />

for funding, and thus strive for scientific reputation (publications, membership in<br />

exclusive circles, membership in decision taking bodies) and try to influence the<br />

decision-making processes of funding bodies.<br />

• Niches for Small Languages: Small Languages are studied by individual<br />

persons, small research centres or cultural organisations. Small Languages create a<br />

niche that protects the research and the researcher. Direct competition is unusual.<br />

This, without doubt, is positive. On the negative side, however, we notice that<br />

methodological decisions, approaches and evaluations are not challenged by<br />

competitive research. This might lead to a self-protecting attitude that ignores<br />

inspiration coming from successful comparable language projects.<br />

• Big Languages Promise Money: There is commercial demand for BLPs<br />

as can be seen from the funding that companies like Google or Microsoft provide<br />

for NLP projects. As these companies try to obtain a relative advantage over their<br />

competitors, language data, algorithms, and so forth are kept secret.<br />

• There is No Money in Small Languages: Those organisations that<br />

fund BLPs are not interested in SLPs. If a Small Language wants to integrate its<br />

spellchecker in Microsoft Word, the SLP has to provide the linguistic data with no<br />

or little remuneration for Microsoft.<br />

• Big Languages Hide Data: Language resources for Big Languages are and<br />

have been produced many times in different variants before they find their way<br />

into an application, or before they are publicly released. Since research centres<br />

for Big Languages compete for funding, recognition and commercialisation, every<br />

centre hopes to obtain a relative advantage over their competitors by keeping<br />

developed resources inaccessible to others. 1<br />

1 That this secretiveness might have any advantages at all can be called into question. Compare, for<br />

example, the respective advantages Netscape or Sun had from handing over their resources to the Open<br />

Source Community. Netscape-based browsers by far outperform their previous competitors such as<br />

Internet Explorer or Opera and the data handling in Open Office is going to be copied by the competitor<br />

Microsoft Office. As for the scientific reputation, people cited frequently are those who make available<br />

their resources including dictionaries and corpora (e.g. Eric Brill, Henry Kucera, W. Nelson Francis,<br />

30


• Small Languages Shouldn't Do So: For Small Languages, such a waste of<br />

time and energy is unreasonable. Resources that have been built once should be<br />

made freely available so that new, related projects can build on top of them, even<br />

if they are conducted elsewhere. Without direct competition, a research centre<br />

should have no disadvantage by making its resources publicly available. Reasons for<br />

not distributing the developed resources are most likely due to the misconception<br />

that sharing the data equals to losing the copyright on the data.<br />

However, under the General Public License (a license that might be used in SLPs),<br />

the distribution of resources requires that copies must contain the appropriate<br />

copyright notice (so that the rights remain with the author of the resources). In<br />

addition, it has to contain the disclaimer of warranty, so that the author is not liable<br />

for any problems others have with the data or programs. Somebody modifying the<br />

data or programs cannot sell this modification unless the source code is made freely<br />

available, so that everybody, including the author, can take over the resources for<br />

further improvements.<br />

The consequence of not sharing the data (i.e., keeping the data on the researcher’s<br />

hard disk) is that the data will be definitely lost within ten years after its last<br />

modification. 2<br />

• BLPs Overlap in time and create a research continuum. In this research<br />

continuum, researchers and resources can develop and adapt to new paradigms<br />

(defined as “exemplary instances of scientific research”, Kuhn 1996/1962) or new<br />

research guidelines. In fact, a large part of a project is concerned with tying the<br />

knots between past and future projects. Data is re-worked, re-modelled and thus<br />

kept in shape for the future.<br />

• SLPs are Discontinuous. There might be temporal gaps between one SLP<br />

and the next one. This threatens the continuity of the research, forces researchers<br />

to leave the research body, or might endanger the persistence of the elaborated<br />

Huang Chu-ren, Chen Keh-jiann, George A. Miller, Christiane Fellbaum, Throsten Brants, and many<br />

others).<br />

2 Reasons for the physical loss of data are: Personal mobility (e.g. after the retirement of a collaborator,<br />

nobody knows that the data exists, or how it can be accessed or used). Changes in software formats<br />

(e.g. the format of backup programs, or changes in the SCSI controller make the data unreadable).<br />

Changes in the physical nature of external memories (punch card, soft floppy disk, hard floppy disk,<br />

micro floppy, CD-ROM, magnetic tape, external hard disk, or USB-stick) and the devices that can read<br />

them. Hard disk failure (caused by firmware corruption, electronic failure, mechanical failure, logical<br />

failure, or bad sectors). The limited lifetime of storage devices is: tapes (2 years), magnetic media (5-<br />

10 years) and optic media (10-30 years). This depends very much on the conditions of usage and storage<br />

(temperature, light and humidity).<br />

31


data. The data is unlikely to be re-worked, or ported to new platforms or formats,<br />

and thus it risks becoming obsolete or unreadable.<br />

• BLPs Rely on Specialists: The bigger the group in a BLP, the more<br />

specialists in programming languages, databases, linguistic theories, parsing, and<br />

so forth it will integrate. Specialists make the BLP autonomous, since specific<br />

solutions can be fabricated when needed.<br />

• All-rounders at Work: Specialisation is less likely to be found in SLPs,<br />

where one person has to cover a wider range of activities, theories, tools, and<br />

so forth. in addition to administrative tasks. Thus, SLP projects cannot operate<br />

autonomously. They largely depend on toolkits, integrated software packages, and<br />

so forth. Choosing the right toolkit is not an easy task. It not only decides the<br />

success or failure of the project, but will also influence the course of the research<br />

more than the genius of the researcher. If a standard program is simply chosen<br />

because the research group is acquainted with it, a rapid project start might be<br />

bought at the price of a troublesome project history, data that is difficult to port<br />

or upgrade, or data that does not match the linguistic reality it should describe.<br />

• BLPs Play with Research Paradigms: BLPs can freely choose their<br />

research paradigm and therefore frequently follow the most recent trends in<br />

research. Although different research paradigms offer different solutions and have<br />

different constraints, BLPs are not so sensitive to these constraints and can cope<br />

successfully with any of them. BLPs must not only be capable of handling new<br />

research paradigms; otherwise, the new research paradigms could not survive,<br />

BLPs are even expected to explore new research paradigms, as they are the only<br />

ones having the gross scientific product that can cope with fruitless attempts and<br />

time-consuming explorations. Indeed, we observe that BLPs frequently turn to the<br />

latest research paradigm to gain visibility and reputation. Shifts in the research<br />

paradigm might make it necessary to recreate language resources in another<br />

format or another logical structure.<br />

• SLPs Depend on the Right Research Paradigm: SLPs do not dispose of<br />

rich and manifold resources (dictionaries, tagged corpora, grammars, tag-sets, and<br />

taggers) in the same way as BLPs do. The research paradigm should thus be chosen<br />

according to the nature and quality of the available resources, and not according<br />

to the latest fashion in research. This might imply the usage of a) example-<br />

based methods, since they require less annotated data (cf. Streiter & de Luca<br />

[2003]), b) unsupervised learning algorithms, if no annotations are available, or c)<br />

hybrid bootstrapping methods (e.g. D. Prinsloo & U. Heid 2006), which are almost<br />

32


impossible to evaluate. Young researchers may experience these restrictions as a<br />

conflict. On one hand, they have to promote their research, ideally in the most<br />

fashionable research paradigm; on the other hand, they have to find approaches<br />

compatible with the available resources. Fortunately, after the dust of a new<br />

research trend has settled 3 , new research trends are looked at in a less mystified<br />

light, and it is perfectly acceptable for SLPs to stick to an older research paradigm,<br />

if it conforms to the overall requirements. 4<br />

• Model Research in Big Languages: Research on Big Languages is frequently<br />

presented as research on that respective language and, in addition, as research on<br />

Language in general. The same piece of research might thus be sold twice. From<br />

this, BLPs derive not only a greater reputation and better project funding, but also<br />

an increased attractiveness of this research for young researchers. Big Languages,<br />

as a consequence, are those languages for which, in virtue of general linguistic<br />

accounts, documentary and pedagogic resources are developed. Students are<br />

trained in and with these languages in the most fashionable methods, which they<br />

learn to consider as superior.<br />

• SLPs Represent Applied Research - at best!: SLPs are less likely to sell<br />

their research as research on Language in general. In fact, little else but research<br />

on English counts as research on Language, and is considered research on a specific<br />

language at best. 5 The less general the scope of research, the less likely it is to be<br />

I have taken this term from Harold Somers (1998).<br />

3<br />

4 Although Big Language research centres are free to choose their research paradigm, they more often<br />

than not are committed to a specific research paradigm, (i.e., the one they have been following for<br />

years or in which they play a leading role. This specialization of research centres to a research paradigm<br />

is partially desirable, as only specialists can advance the respective paradigm. However, when they do<br />

research on Small Languages, either to extend the scope of the paradigm or to access alternative<br />

funding, striking mismatches can be observed between paradigm and resources. Such mismatches are<br />

of no concern to the Big Language research centre, which, after all, is doing an academic exercise,<br />

but they should be closely watched in SLPs, where such mismatches will cause the complete failure of<br />

the project. For example, Machine Translation knows two approaches: rule-based approaches, where<br />

linguists write the translation rules; and corpus-based approaches, where the computer derives the<br />

translation rules from parallel corpora. Corpus-based approaches can be statistics-based or examplebased.<br />

Recently, RWTH Aachen University, known for its cutting-edge research in statistical Machine<br />

Translation, proposed a statistical approach to sign language translation (Bungeroth & Ney 2004).<br />

One year later Morrissey & Way (2005) from Dublin City University, a leading agent in Example-based<br />

Machine Translation, proposed “An Example-Based Approach to Translating Sign Languages.” The fact,<br />

however, that parallel corpora involving at least one sign language are extremely rare and extremely<br />

small is done away in both papers, as if it would not affect the research. In other words, the research<br />

builds on a type of resource that does not actually exist, just to please the paradigm.<br />

5 In a Round Table discussion at the 1st SIGHAN Workshop on Chinese Language Processing, hosted by<br />

ACL in Hong Kong, 2000, a leading researcher in Computational Linguistics vehemently expressed his<br />

dissatisfaction in being considered only a specialist in Chinese Language Processing, while his colleagues<br />

working in English are considered specialists in Language Processing. Nobody would call Chomsky a<br />

specialist in American English! Working on a Small Languages thus offers a niche at the price of a stigma<br />

that prevents researchers from ascending to the Olympus of science .<br />

33


taught at university. Students then implicitly learn what valuable research is, that<br />

is, research on Big Languages and recent research paradigms.<br />

To sum up, we observed that BLPs are conducted in a competitive and sometimes<br />

commercialised environment. Competition is a main factor that shapes the way in<br />

which BLPs are conducted. In such an environment, it is quite natural for research<br />

to overlap and to repeatedly produce similar resources. Not sharing the developed<br />

resources is seen as enhancing the competitiveness of the research centre. It is not<br />

considered to be an obstacle to the overall advance of the research field: similar<br />

resources are available elsewhere in any case. Different research paradigms can be<br />

freely explored in BLPs, with an obvious preference for the latest research paradigm,<br />

or for the one to which the research centre is committed. Gaining visibility, funding<br />

and eternal fame are not subordinated to the goal of producing working language<br />

resources.<br />

The situation of SLPs is much more critical. SLPs have to account for the persistence<br />

and portability of their data beyond the lifespan of the project, beyond the involvement<br />

of a specific researcher, and beyond the lifespan of a format or specific memory<br />

device. As Small Languages are not that much involved in the transition of paradigms,<br />

data cannot be reworked, especially if research is discontinuous. The refunding of a<br />

project due to a shift in research paradigms or lost or unreadable data is unthinkable.<br />

With few or no external competitors, most inspiration for SLPs comes from BLPs.<br />

However, the motivation for BLPs to choose a research paradigm and their capacity<br />

to handle research paradigms (given per definition) is not that of a SLP. For talented<br />

young researchers, SLPs are not attractive. As students, they have been trained in<br />

BLPs and share with the research community a system of values according to which<br />

other languages and other research paradigms are preferred.<br />

2. Improving the Situation: Free Software Pools<br />

Although most readers might consent with the obvious description of SLPs I have<br />

given above, few have turned to the solutions I am about to sketch below. The main<br />

reason for this might be possible misconceptions or unsubstantiated fears. Let us<br />

start with what seems to be the most puzzling question; that is: how can projects and<br />

researchers guarantee the existence of their data beyond the direct influence of the<br />

researcher him/herself? To give a hypothetical example: you develop a spellchecker<br />

for a language of two hundred speakers, all above the age of seventy (including<br />

yourself), and none of them having a computer (except for you). How can you ensure<br />

that the data survives? The answer is: Pool your data with other data of the same<br />

form and function and let the community take care of YOUR data. If you make your<br />

34


esearch results available as free software, other people will take care of your data<br />

and upgrade it into new formats, whenever needed. ‘But,’ you might wonder now,<br />

‘why should someone take care of my data on an unimportant and probably dying<br />

language?’ The answer lies in the pool: even if those people do not care about your data<br />

per se, they care about the pool, and once they transform resources for new versions<br />

they transform all resources of the pool, well knowing that the attractiveness of the<br />

pool comes from the number of different language modules within it. In addition,<br />

all language modules have the same format and function and if one module can be<br />

transformed automatically, all others can be automatically transformed as well. 6<br />

But which pools exist that could possibly integrate and maintain your data? Below,<br />

you find an overview of some popular and useful pools. This list might also be read<br />

as a suggestion for possible and interesting language projects, or as a check-list of<br />

components of your language that still need to be developed to be at par with other<br />

languages. Frequently, the same linguistic resources are made available to different<br />

pools (e.g. in ISPELL, ASPELL and MYSPELL). This enlarges the range of applications for<br />

a language resource, increases the visibility, and supports data persistence.<br />

6<br />

• Spelling, Office, Etc:<br />

ISPELL (lgs. > 50); spelling dictionary + rules:<br />

A spellchecker, used standalone or integrated into smaller applications.<br />

(AbiWord, flyspell, WBOSS). (http://www.gnu.org/software/ispell/)<br />

ASPELL (lgs. > 70); spelling dictionary + rules:<br />

An advanced spellchecker, used standalone or integrated into smaller<br />

applications. (emacs, AbiWord, WBOSS)(http://aspell.sourceforge.net/)<br />

MYSPELL (lgs. > 40); spelling dictionaries + rules:<br />

A spellchecker for Open Office. (http://lingucomponent.openoffice.org/)<br />

OpenOffice Grammar Checking (lgs. > 5); syntax checker:<br />

A heterogeneous set of grammar checkers for Open Office.<br />

OpenOffice Hyphenation (lgs. > 30); hyphenation dictionary:<br />

I do not know how much of an unmotivated over-generalisation this is. In the Fink project (http://fink.<br />

sourceforge.net), for example, there is one maintainer for each resource and not for each pool and, as<br />

a consequence, not all ispell modules are available. In DEBIAN (http://www.debian.org) we find again<br />

one maintainer for each resource, but orphaned packages, that is packages without maintainer, are<br />

taken over by the DEBIAN QA group.8<br />

35


A hyphenation dictionary for use with Open Office, but used also in LaTeX, GNU<br />

Troff, Scribus, and Apache FOP.<br />

OpenOffice Thesaurus (lgs. > 12); thesaurus:<br />

A thesaurus for use with Open Office.<br />

(http://lingucomponent.openoffice.org/)<br />

STYLE and DICTION (lgs. = 2); style checking:<br />

Help to improve wording and readability.<br />

(http://www.gnu.org/software/diction/diction.html)<br />

HUNSPELL (lgs. > 10); spelling dictionary + rules:<br />

An advanced spellchecker for morphologically rich languages that can be<br />

turned into a morphological analyser. (http://hunspell.sourceforge.net/).<br />

• Dictionaries:<br />

FREEDICT (lgs. > 50); translation dictionary:<br />

Simple, bilingual translation dictionaries, optionally with definitions and API as<br />

binary and in XML. (http://sourceforge.net/projects/freedict/).<br />

Papillon (lgs. > 8); multilingual dictionaries:<br />

Multilingual dictionaries structured according to Mel’čuk’s <strong>Text</strong> Meaning<br />

Theory. (http://www.papillon-dictionary.org/Home.po)<br />

JMDict (lgs. > 5); multilingual dictionaries:<br />

Multilingual translation dictionaries in XML, based on word senses.<br />

(http://www.csse.monash.edu.au/~jwb/j_jmdict.html)<br />

• Corpora:<br />

Universal Declaration of Human Rights (lgs. > 300); parallel corpus:<br />

The Universal Declaration of Humans Rights has been translated into many<br />

languages and can be easily aligned with other languages. (http://www.unhchr.<br />

ch/udhr/navigate/alpha.htm)<br />

Multex (lgs. > 9); corpora and morpho-syntactic dictionaries:<br />

36


Parallel corpora of Orwell’s 1984 annotated in CES with morpho-syntactic<br />

information in ten Middle and Eastern European languages. (http://nl.ijs.si/<br />

ME/V2/)<br />

• Analysis:<br />

Delphin (lgs. > 5); HPSG-grammars:<br />

HPSG-Grammars for NLP-applications, in addition various tools for running and<br />

developing HPSG resources. (http://www.delph-in.net/)<br />

AGFL (lgs.> 4); parser and grammars:<br />

A description of Natural Languages with context-free grammars. (http://www.<br />

cs.ru.nl/agfl/)<br />

• Generation:<br />

KPML (lgs.> 10); text generation system:<br />

Systemic-functional grammars for natural language generation.<br />

(http://purl.org/net/kpml)<br />

• Machine Translation:<br />

OpenLogos (lgs. > 4); Machine Translation software and data:<br />

An open Source version of the Logos Machine Translation System for new<br />

language pairs to be added. (http://logos-os.dfki.de/).<br />

3. Strategies and Recommendations for Developers<br />

If there is no pool of free software data that matches your own data, you should<br />

try the following: 1) Convert your data into free software so that you have a greater<br />

chance that others will copy and take care of it; and, 2) Modify your data so that it<br />

can be pooled with other data. This might imply only a minor change in the format of<br />

the data that can be done automatically by a script. Alternatively, create a community<br />

that will, in the longterm, create a pool. In general, this implies that you separate the<br />

procedural components (tagger, spelling checker, parser, etc.) from the static linguistic<br />

data; make the procedural components freely available; and, describe the format of<br />

the static linguistic data. An example might well be Kevin Scannell’s CRUBADAN, a<br />

web-crawler for the construction of word lists for ISPELL. The author succeeded in<br />

creating a community around his tool that develops spellcheckers for more than thirty<br />

Small Languages (cf. http://borel.slu.edu/crubadan). Through this split of declarative<br />

(linguistic) components on the one hand, and procedural components (programs) on<br />

the other, many pools come with adequate tools to create and maintain the data.<br />

37


Pooling of corpora is not as frequent as, for example, the pooling of dictionaries.<br />

The main reason for this may be that corpora are very specific, and document a<br />

cultural heritage. Pooling them with corpora of different languages, subject areas,<br />

registers, and so forth is only of limited use. Nevertheless, there are some computer-<br />

linguistic pools that integrate corpora for computational purposes, and that may,<br />

therefore, integrate your corpora and maintain them for you. A description of these<br />

mostly very complex pools is beyond the scope of this paper, but the interested reader<br />

might check the following projects:<br />

• GATE (http://gate.ac.uk);<br />

• Natural Language Toolkit (http://nltk.sourceforge.net); and,<br />

• XNLRDF (http://140.127.211.213/xnlrdf).<br />

Projects targeting language documentation may also host your corpora, (e.g. the<br />

TITUS Project [http://titus.uni-frankfurt.de/]). In addition, LDC (http://www.ldc.<br />

upenn.edu) and ELRA (http://www.elra.info) are hosting and distributing corpora<br />

(and dictionaries) so that your institute might profit financially from sold copies of the<br />

corpus you created.<br />

Once you decide to create your own free software (including corpora, dictionaries,<br />

etc.), you have to think about the license and the format of the data. From the great<br />

number of possible licenses you might use for your project (cf. http://www.gnu.org/<br />

philosophy/license-list.html for a commented list of licenses) you should consider the<br />

GNU General Public License, as this license, through the notion of Copyleft, doesn’t<br />

give a general advantage to someone who is copying and modifying your software.<br />

Copyleft refers to the obligation that:<br />

…anyone who redistributes the software, with or without changes,<br />

must pass along the freedom to further copy and change it. (...)<br />

Copyleft also provides an incentive for other programmers to add to<br />

free software.<br />

(http://www.gnu.org/copyleft/copyleft.htm)<br />

With Copyleft, modifications have to be made freely available under the same<br />

conditions as you originally distributed your data, and if the modifications are of<br />

general concern, you can integrate them back into your software. The quality of<br />

your resources improves, as others can find and point out mistakes or shortcomings<br />

in the resources. They will report to you as long as you remain the distributor. In<br />

addition, you may ask people to cite your publication on the resource whenever using<br />

the resource for one of their publications. Without Copyleft, important language<br />

38


data would already have been lost (e.g. the CEDICT dictionary, after the developer<br />

disappeared from the Internet).<br />

After putting your resources with the chosen license onto a webpage, you should<br />

try to integrate your resource into larger distributions such as DEBIAN (http://www.<br />

debian.org) so that, in the long term, these organisations will manage your data. To<br />

do this, your resources have to conform to some formal requirements that, although<br />

seeming tedious, will certainly contribute to the quality and maintainability of<br />

your resources (cf. http://www.debian.org/doc/debian-policy, for an example of<br />

the requirements of integration in DEBIAN). From DEBIAN, your resources might be<br />

migrated without your work into other distributions (REDHAT, SUSE, etc.) and into<br />

other modules, perhaps embedded into nice GUIs.<br />

4. Instructions for Funding Organisations<br />

A sponsoring organisation that is not interested in sponsoring a specific researcher<br />

or research institute, but which tries to promote a Small Language in electronic<br />

applications, should insist on putting the resources to be developed under Copyleft,<br />

and make this an explicit condition in the contract. Only this will guarantee that<br />

the resources will be continually developed even after the lifetime of the project.<br />

‘Copylefting’ thus allows for a sustainable development of language resources from<br />

punctual or discontinuous research activities. Only Copylefting guarantees that the<br />

most advanced version is available to everybody who might need it. In fact, a funding<br />

organisation that does not care about the way data can be accessed and distributed<br />

after the project’s end is, in my eyes, guilty of grossly negligent operation. Too many<br />

resources have been created in the past only to be lost on old computers, tapes, or<br />

simply forgotten. Adding resources to this invisible pile is of no use.<br />

In addition, the funding organisations may require the sources to be bundled with<br />

a pool of Free software resources in order to guarantee the physical preservation of<br />

the data and its widest accessibility. Copylefting alone only provides the legal grounds<br />

for the survival of the data; handing over the resources to a pool will make them<br />

available in many copies worldwide and independent from the survival of one or the<br />

other hard disk. Copylefting without providing free data access is like eating without<br />

swallowing.<br />

5. Free Software for SLPs: Benefits and Unsolved Problems<br />

Admittedly, it would be naive to assume that releasing project results as free<br />

software would solve all problems inherent in SLPs. This step might solve the most<br />

important problems of data maintenance and storage, and embed the project into a<br />

39


scientific context. But can it have more positive effects than this? Which problems<br />

remain? Let us return to our original list of critical points in SLPs to see how they are<br />

affected by such a step.<br />

• Open Source pools create a platform for research and data maintenance<br />

that allows the niche to be assigned to SLPs without having to handle situations of<br />

competition;<br />

• Data is made freely available for future modifications and improvements.<br />

If the data is useful it will be handed over from generation to generation;<br />

• The physical storage of the data is possible in many of the listed pools,<br />

and does not depend on the survival of the researchers hard disk;<br />

• The pools frequently provide specific, sophisticated tools for the<br />

production of resources. These tools are a cornerstone of a successful project;<br />

• In addition, through working with these tools, researchers acquire<br />

knowledge and skills that are relevant for the entire area of NLP;<br />

• Working with these tools will lead to ideas for improvement. Suggesting<br />

such improvements will not only help the SLP to leave the niche, but will finally<br />

lead to better tools. For young researchers, this allows them to work on their Small<br />

Language, and, at the same time, to be connected with a wider community for<br />

which their research might be relevant; and,<br />

Through the generality of the tools (i.e., their usage for many languages) the content<br />

of SLPs might become more appropriate for university curricula in computational<br />

linguistics, terminology, corpus linguistics, and so forth. Some problems, however,<br />

remain, for which other solutions have to be found.<br />

These are:<br />

• Discontinuous research if research depends on project acquisition;<br />

• Dependence on research paradigm. Corpus-based approaches can be<br />

used only when corpora are available, rule-based approaches when formally<br />

trained linguists participate in the project. To overcome most of these limitations,<br />

research centres and funding bodies should continuously work on the improvement<br />

of the necessary infrastructure for language technology (cf. Sarasola 2000); and,<br />

• Attracting and binding researchers. As the success of a project depends<br />

to a large extent on the researchers' engagement and skills, attracting and binding<br />

researchers is a sensitive topic, for which soccer clubs provide an illustrative<br />

model. Can a SLP attract top players, or is an SLP just a playground for a talented<br />

young researcher who will sooner or later transfer to a BLP? Or can the SLP count<br />

40


on local players only? A policy for building a home for researchers is thus another<br />

sensitive issue for which research centres and funding bodies should try to find a<br />

solution.<br />

6. Conclusions<br />

Although the ideas outlined in this paper are very much based on ‘sofa-research’ and<br />

intuition, a very schematic and simplistic thinking, informal personal communications,<br />

and my personal experience, I hope to have provided clear and convincing evidence that<br />

Small Language Projects profited, profit and will profit from joining the Open Source<br />

community. For those who want to follow this direction, the first and fundamental<br />

step is to study possible licenses (http://www.gnu.org/philosophy/license-list.html)<br />

and to understand their implications for the problems of SLPs, such as the storage<br />

and survival of data, their improvement through a large community, and so forth. This<br />

article lists some problems against which the licenses can be checked.<br />

Emotional reactions like “I do not want others to fumble in my data,” or “I do not<br />

want others to make money with my work” should be openly pronounced and discussed.<br />

What are the advantages of others having my data? What are the disadvantages? How<br />

can people make money with Open Source data? As said before, misconceptions, and<br />

thus unsubstantiated fears, often lead to a rejection of the Open Source idea than a<br />

well-founded argument. This is how humans function, but not how we advance Small<br />

Languages.<br />

7. Acknowledgments<br />

This paper would not have been written if I had not met with people like Mathias,<br />

Isabella, Christian, Judith, Daniel and Kevin. As a consequence of these encounters,<br />

the paper is much more a systematic summary than my original thinking. For mistakes,<br />

gaps and flawed arguments, however, the author alone is responsible.<br />

41


42<br />

References<br />

Bungeroth, J. & Ney, H. (2004). “Statistical Sign Language Translation.” Streiter,<br />

O. & Vettori, C. (eds) (2004). Proceedings of the Workshop on Representation and<br />

Processing of Sign Languages, 4th International Conference on Language Resources<br />

and Evaluation, Lisbon, Portugal, May 2004, 105-108.<br />

Kuhn, T.S. (1996/1962). The Structure of Scientific Revolutions. Chicago: University<br />

of Chicago Press, 3rd edition.<br />

Morrissey, S. & Way, A. (2005). “An Example-Based Approach to Translating Sign<br />

Languages.” Way, A. & Carl, M. (eds) (2005). Proceedings of the “Workshop on<br />

Example-Based Machine Translation”, MT-Summit X, Pukhet, Thailand, September<br />

2005, 109-116.<br />

Prinsloo, D. & Heid, U. (this volume). “Creating Word Class Tagged Corpora for Northern<br />

Sotho by Linguistically Informed Bootstrapping”, 97-115.<br />

Sarasola, K. (2000). “Language Engineering Resources for Minority Languages”<br />

Proceedings of the Workshop “Developing Language Resources for Minority Languages:<br />

Re-usability and Strategic Priorities.” Second International Conference on Language<br />

Resources and Evaluation, Athens, Greece, May 2000.<br />

Scannell, K. (2003). “Automatic Thesaurus Generation for Minority Languages: an<br />

Irish Example.” Proceedings of the Workshop “Traitement automatique des langues<br />

minoritaires et des petites langues. 10ème conférence TALN, Batz-sur-Mer, France,<br />

June 2003, 203-212.<br />

Somers, H. (1998). “New paradigms.” MT: The State of Play now that the Dust has<br />

Settled. Proceedings of the “Workshop on Machine Translation”,10th European<br />

Summer School in Logic, Language and Information, Saarbrücken, August 1998, 22-<br />

33.<br />

Streiter, O. & De Luca, E.W. (2003). “Example-based NLP for Minority Languages:<br />

Tasks, Resources and Tools.” Streiter, O. (ed) (2003) Proceedings of the Workshop


“Traitement automatique des langues minoritaires et des petites langues”, 10ème<br />

conférence TALN, Batz-sur-Mer, France, June 2003, 233-242.<br />

43


Un corpus per il sardo:<br />

problemi e perspettive<br />

45<br />

Nicoletta Puddu<br />

Creating a corpus for minority languages has provided an interesting tool to both study<br />

and preserve these languages (see, for example, the DoBeS project at MPI Nijmegen).<br />

Sardinian, as an endangered language, could certainly profit from a well-designed<br />

corpus. The first digital collection of Sardinian texts was the Sardinian <strong>Text</strong> Database;<br />

however, it cannot be considered as a corpus: it is not normalized and the user can<br />

only search for exact matches. In this paper, I discuss the main problems in designing<br />

and developing a corpus for Sardinian.<br />

Kennedy (1998) individuates three main stages in compiling a corpus: (1) corpus<br />

design; (2) text collection and capture; and, (3) text encoding or mark-up. As for<br />

the first stage, I propose that a Sardinian corpus should be mixed, monolingual,<br />

synchronic, balanced, and annotated, and I discuss the reasons for these choices<br />

throughout the paper. <strong>Text</strong> collection seems to be a minor problem in the case of<br />

Sardinian: both written and spoken texts are available and the number of speakers<br />

is still significant enough to collect a sufficient amount of data. The major problems<br />

arise at the third stage. Sardinian is fragmented into different varieties, and has not<br />

a standard variety (not even a standard orthography). Recently, several proposals for<br />

standardization have been made, but without success (see the discussion in Calaresu<br />

2002; Puddu 2003). First of all, I suggest using a standard orthography that allows us<br />

to group Sardinian dialects into macro varieties. Then, it will be possible to articulate<br />

the corpus into sub-corpora that are representative of each variety. The creation of<br />

an adequate morphological tag system will be fundamental. As a matter of fact, with<br />

a homogeneous tag system, it will be possible to perform searches throughout the<br />

corpus and study linguistic phenomena both in the single macro variety and in the<br />

language as a whole.<br />

Finally, I propose a morphological tag system and a first tagged pilot corpus of Sardinian<br />

based on written samples according to EAGLES and XCES standards.<br />

1. Perché creare corpora per le lingue minoritarie<br />

La corpus linguistics o linguistica dei corpora (da qui LC) risulta di particolare<br />

interesse, soprattutto per chi adotti un approccio funzionalista, in quanto “studia la


lingua nel modo in cui essa viene effettivamente utilizzata, da parlanti concreti in<br />

reali situazioni comunicative” (Spina 2001:53). L’utilizzo dei corpora, come noto, può<br />

essere molteplice: dagli studi sul lessico (creazione di lessici e dizionari di frequenza)<br />

a quelli sulla sintassi, fino alla didattica delle lingue e alla traduzione. Per le lingue<br />

standardizzate, l’utilizzo di corpora è in grande sviluppo. Tuttavia, anche per le lingue<br />

in pericolo di estinzione, la creazione di corpora si può rivelare particolarmente utile.<br />

Oltre alle motivazioni comuni alle lingue standardizzate, creare un corpus può essere<br />

un valido metodo per conservare la testimonianza della lingua, nella malaugurata<br />

ipotesi che essa si estingua. Se un corpus viene infatti definito come “una raccolta<br />

strutturata di testi in formato elettronico, che si assumono rappresentativi di una data<br />

lingua o di un suo sottoinsieme, mirata ad analisi di tipo linguistico” (Spina 2001:65),<br />

è evidente che esso può fungere anche da “specchio” di una lingua in un determinato<br />

stato. In questo senso il corpus può porsi come strumento complementare ad atlanti<br />

linguistici e indagini mirate, fotografando un campione rappresentativo della lingua.<br />

La presenza di corpora facilita di molto l’analisi di fenomeni linguistici. La presenza<br />

di un corpus non elimina certamente i metodi tradizionali di raccolta dati, ma<br />

fornisce un valido strumento per testare la validità di una ipotesi anche per studiosi<br />

che non possano accedere direttamente ai parlanti. Inoltre, su un corpus è possibile<br />

compiere degli studi sulla frequenza, evidentemente molto difficili da realizzare con<br />

gli strumenti tradizionali.<br />

Mostrata quindi l’utilità dei corpora anche per le lingue minoritarie, è necessario<br />

porre in evidenza le particolari questioni che la linguistica dei corpora si trova ad<br />

affrontare nel caso di lingue non standardizzate. In questo studio prenderemo ad<br />

esempio il caso del sardo, per evidenziare i possibili problemi (e le eventuali soluzioni<br />

da adottare).<br />

Nel caso del sardo, come in molte altre lingue minoritarie in via di estinzione che<br />

si aprono solo ora alla linguistica dei corpora, ci troviamo davanti a due questioni<br />

fondamentali.<br />

Da un lato è necessario creare quanto prima un corpus per cercare di preservare<br />

lo stato di lingua attuale. Nel caso del sardo, sottoposto a massiccia influenza da<br />

parte dell’italiano, alcune varietà rischiano una rapida estinzione e sarebbe quanto<br />

mai auspicabile raccogliere il prima possibile, con criteri omogenei, dati della lingua<br />

parlata che possano essere inseriti in un corpus.<br />

Dall’altro lato, oltre che per la pianificazione del corpus e la raccolta dei dati, è<br />

necessario stabilire tutti gli standard di codifica e annotazione e ciò, come vedremo,<br />

46


crea non pochi problemi nel caso di lingue non standardizzate e frammentarie come<br />

il sardo.<br />

2. Un progetto sperimentale: il Sardinian Digital Corpus<br />

Il sardo è, come ben noto, suddiviso in diverse varietà e non standardizzato,<br />

nonché in costante regresso. Esistono diverse grammatiche e dizionari e, su internet,<br />

è disponibile il Sardinian <strong>Text</strong> Database (http://www.lingrom.fu-berlin.de/sardu/<br />

textos.html), una raccolta di testi in sardo curata dall’Università di Colonia. Si tratta<br />

di un’interessante iniziativa, che però non risponde ai criteri di rappresentatività,<br />

campionamento e bilanciamento. I testi vengono infatti inseriti dai vari autori e non<br />

vi è uniformità nella codifica.<br />

2.1 La pianificazione<br />

Come noto, la fase di pianificazione è fondamentale in quanto, proprio in questa<br />

fase, si prendono decisioni che determinano la fisionomia del corpus e che, da un<br />

certo momento in poi, non possono più essere modificate.<br />

Nel descrivere le fasi della progettazione del SDC, seguiremo la fondamentale<br />

tassonomia di Ball (anno:pag.).<br />

Una prima distinzione è quella per mezzo. Nel progettare un corpus di una lingua<br />

in pericolo di estinzione, ovviamente sarebbe da privilegiare la presenza di campioni<br />

di lingua orale. Tuttavia, quando esista una tradizione scritta, la scelta potrebbe<br />

ricadere anche su una tipologia di corpus mista, in modo da avere una visione quanto<br />

più possibile globale della lingua. Nel caso del sardo, ad esempio, un corpus misto<br />

sarebbe possibile, in quanto esiste una certa produzione scritta sia “tradizionale”<br />

(romanzi, racconti, testi poetici, articoli di giornale), sia “elettronica” (mailing lists<br />

e siti in sardo).<br />

Il SDC dovrebbe essere un corpus monolingue, ma rappresentativo delle diverse<br />

varietà. La creazione di corpora multilingui o paralleli potrebbe essere un passo<br />

successivo, particolarmente interessante sia dal punto di vista della linguistica<br />

comparativa che della glottodidattica.<br />

Per le stesse ragioni enumerate sopra, il SDC si propone come un corpus sincronico,<br />

dato che abbiamo mostrato come sia urgente documentare lo stato di lingua attuale.<br />

Il confronto con stati di lingua passati e quindi la creazione di un corpus diacronico<br />

dovrà essere necessariamente successiva.<br />

47


Il SDC, data la totale assenza di altri corpora per il sardo dovrebbe quindi essere<br />

un corpus di riferimento, basato rigorosamente sui due criteri di campionamento e<br />

bilanciamento.<br />

Il corpus si propone, almeno in una fase iniziale, come un corpus aperto,<br />

continuamente aggiornabile con nuove acquisizioni, sempre e comunque coerenti con<br />

la pianificazione iniziale.<br />

Infine, il corpus dovrà essere annotato. A questo proposito, proponiamo qui anche<br />

una prima ipotesi di annotazione del SDC secondo gli standard internazionali.<br />

2.2 Acquisizione dei dati<br />

Per quanto riguarda la raccolta dati, non vi sono particolari differenze rispetto<br />

a lingue ufficiali e standardizzate. Bisogna pertanto adottare gli accorgimenti tipici<br />

della ricerca sul campo.<br />

Problemi ben più importanti si pongono invece per quanto riguarda la codifica<br />

dei dati. Bisogna infatti arrivare a una normalizzazione grafica che evidentemente<br />

comporta, per i testi non standardizzati, una scelta da parte di chi codifica.<br />

Il sardo rappresenta, sotto questo punto di vista, un caso emblematico. Le differenze<br />

tra le diverse varietà sono, infatti, numerose, soprattutto sul piano fonetico. In quale<br />

varietà devono essere inseriti i testi nel corpus? E sino a che punto è possibile ridurre<br />

le diverse varietà del sardo a una unica macrovarietà?<br />

La questione della standardizzazione della lingua sarda è stata oggetto, negli ultimi<br />

anni, di una robusta polemica. Nel 2001 l’Assessorato alla Pubblica Istruzione della<br />

Regione Sardegna ha pubblicato una prima proposta di standardizzazione denominata<br />

Limba sarda unificada (LSU). Tale proposta è stata elaborata da un’apposita commissione<br />

e si tratta in sostanza di una varietà che, per ammissione della stessa commissione,<br />

per quanto si ponga come obiettivo la mediazione tra le diverse varietà presenti<br />

nell’isola, è “rappresentativa di quelle varietà più vicine alle origini storico-evolutive<br />

della lingua sarda” (LSU: 5). Di fatto, i tratti scelti per la LSU sono perlopiù logudoresi<br />

e sono stati percepiti dai parlanti come tratti “locali” piuttosto che conservativi. Ciò<br />

ha portato a una netta opposizione allo standard soprattutto da parte dei parlanti<br />

campidanesi: le motivazioni di questa reazione sono analizzate dal punto di vista<br />

sociolinguistico in Calaresu (2002) e Puddu (2003, 2005).<br />

Di recente, la Regione Sardegna ha incaricato una nuova commissione di elaborare<br />

una lingua standard per usi solo burocratici-amministrativi e, nel contempo, di creare<br />

48


delle norme ortografiche sarde “per tutte le varietà linguistiche in uso nel territorio<br />

regionale” 1 .<br />

La soluzione migliore, a mio parere, è quindi di inserire i testi nelle diverse<br />

macrovarietà del sardo con una standardizzazione solo ortografica in base alle proposte<br />

della commissione. In sostanza, si tratterebbe di operare solo una normalizzazione<br />

grafica sui vari testi riconducendoli a macrovarietà e annotando le eventuali<br />

differenziazioni fonetiche a parte. Facciamo un esempio: la nasale intervocalica<br />

originaria del latino subisce, nelle diverse varietà del sardo campidanese, trattamenti<br />

differenti. In alcuni casi viene mantenuta, in alcuni viene raddoppiata, in altri è ridotta<br />

a nasalizzazione della vocale precedente, mentre in altri ancora è sostituita dal colpo<br />

di glottide. Pertanto, la parola per ‘luna’, può essere pronunciata come [‘luna], [‘lun:<br />

a] [‘luâa] o ancora [‘lu/a]. La mia proposta è quindi di trascrivere la forma originaria<br />

luna, annotando poi la eventuale trascrizione fonetica in un file separato.<br />

Vi è inoltre il problema di inserire nel corpus anche alcune varietà come il<br />

gallurese e il sassarese che non sono unanimemente riconosciute come dialetti del<br />

sardo. Si potrebbe però operare una distinzione tra ‘nuclear Sardinian’, costituito<br />

da campidanese e logudorese e ‘core Sardinian’, con il gallurese e sassarese. La<br />

struttura del SDC dovrebbe, in definitiva, essere rappresentata come nella figura 1 in<br />

appendice.<br />

2.3 Codifica primaria<br />

La nostra proposta è, come detto, che il SDC sia annotato. Il corpus sarà codificato<br />

usando XML e lo standard sarà quello stabilito da XCES (Corpus Encoding Standard<br />

for XML). Le linee guida CES raccomandano in particolar modo che l’annotazione sia<br />

di tipo stand-off, vale a dire che l’annotazione non sia unita al testo base, ma sia<br />

contenuta in altri files XML a esso collegati. Nel caso del SDC l’annotazione stand-off<br />

è particolarmente importante, dato che è pensato come un corpus aperto.<br />

• Nella codifica dei corpora, le guidelines CES riconoscono tre principali<br />

categorie di informazioni rilevanti:la documentazione, che contiene informazioni<br />

globali sul testo, il suo contenuto e la sua codifica; i dati primari, che comprendono<br />

sia la “gross structure”, ovverosia il testo al quale vengono aggiunte informazioni<br />

sulla struttura generale (paragrafi, capitoli, titoli, note a piè di pagina, tabelle,<br />

figure ecc.) sia elementi che compaiono al livello del sottoparagrafo;<br />

1 “La Commissione dovrà inoltre definire norme ortografiche comuni per tutte le varietà linguistiche in<br />

uso nel territorio regionale. In questo modo sarà possibile promuovere la creazione di word processor,<br />

correttori ortografici, oltre all’utilizzo e alla diffusione di strumenti elettronici per favorire l’uso<br />

corretto della lingua sarda” (http://www.regionesardegna.it/j/v/25?s=3661&v=2&c=220&t=1).<br />

49


• l’annotazione linguistica, che può essere di tipo morfologico, sintattico,<br />

prosodico ecc.<br />

A titolo esemplificativo, mostrerò di seguito come possa essere codificato un testo<br />

preso da un articolo di giornale. Supponiamo di dover codificare il testo seguente<br />

pubblicato su un giornale locale.<br />

Su Casteddu hat giogau ariseru in su stadiu Sant’Elia bincendi po quatru a zero<br />

contras a sa Juventus. Is festegiamentus funti sighius fintzas a mengianeddu.<br />

Cust’annu parit chi ddoi siant bonas possibilidadis de binci su scudetu.<br />

‘Il Cagliari ha giocato ieri allo stadio Sant’Elia vincendo per quattro a zero<br />

contro la Juventus. I festeggiamenti sono continuati sino all’alba. Quest’anno<br />

pare che ci siano buone possibilità di vincere lo scudetto’.<br />

Le informazioni relative alla documentazione saranno contenute in una header<br />

conforme alle guidlines CE. La header minima per questo documento sarà la<br />

seguente.<br />

sdc_header.xml<br />

<br />

<br />

<br />

<br />

Sardinian Digital Corpus, demo<br />

<br />

<br />

University of…<br />

via…<br />

Free<br />

2005<br />

<br />

<br />

<br />

50


La Gazzetta di Sardegna<br />

…<br />

…<br />

…<br />

<br />

<br />

Su scudetu in Casteddu<br />

Porru<br />

<br />

<br />

<br />

<br />

<br />

Il Sardinian Digital Corpus…<br />

<br />

CES Level 1<br />

<br />

Rendition attribute<br />

values on Q and QUOTE tags are adapted from ISOpub and ISOnum standard<br />

entity set names<br />

Marked up to the level of sentence<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Campidanese Sardinian<br />

51


<br />

<br />

La header sarà esterna al CesDoc. Le varie headers saranno salvate in uno headerbase<br />

esterno al documento e vi si farà riferimento attraverso un’espressione Xpointer nel<br />

CesDoc. Il CesDoc, che contiene il secondo tipo di informazioni si presenterà come<br />

segue:<br />

sdc_demo.xml<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Su Casteddu hat giogau ariseru in su stadium<br />

Sant&apos; Elia bincendi po quatru a zero contras a sa Juventus.<br />

Is festegiamentus funti sighius fintzas<br />

mengianeddu.<br />

Cust&apos; annu parit chi ddoi siant bonas<br />

ssibilidadis de binci su scudetu.<br />

<br />

<br />

<br />

<br />

52


2.4 Annotazione<br />

Come detto, il SDC si propone come un corpus annotato. Al XCESDoc faranno<br />

quindi riferimento, tramite Xpointer, le annotazioni ai vari livelli. Il più diffuso<br />

tipo di annotazione è quello per parti del discorso. Nel caso del sardo un corpus<br />

annotato per parti del discorso potrebbe rivelarsi particolarmente utile per la ricerca<br />

linguistica, soprattutto nella creazione di una grammatica descrittiva corpus-based.<br />

L’annotazione per parti del discorso sarà logicamente la prima in ordine di tempo, ma,<br />

dato il carattere aperto del corpus, potranno in seguito essere aggiunte annotazioni<br />

su altri livelli.<br />

2.5 Il tagset<br />

Nel caso del sardo è necessario creare un apposito tagset, che può essere in parte<br />

mutuato dai tagset per l’italiano e lo spagnolo creati all’interno del progetto MULTEXT<br />

secondo le raccomandazioni EAGLES. Le diverse varietà del sardo non differiscono<br />

particolarmente dal punto di vista morfosintattico: questo significa che è possibile<br />

definire un unico tagset per tutte le varietà. Gli esempi in questo articolo sono in<br />

campidanese, ma le etichette si potranno applicare praticamente senza variazioni<br />

anche alle altre varietà del sardo.<br />

L’annotazione grammaticale del nostro corpus, compatibile con la CesAna DTD,<br />

consisterà di tre livelli:<br />

• la forma base ();<br />

• una descrizione morfosintattica secondo le linee guida EAGLES ();<br />

• un corpus tag ().<br />

In accordo con quanto proposto da EAGLES, abbiamo una descrizione a due livelli:<br />

• la prima, a grana più fine, contiene la descrizione quanto più accurata<br />

possibile del token (descrizione lessicale );<br />

• la seconda invece, “a grana più grossa”, è una versione sottodeterminata<br />

della prima descrizione (corpus tag ).<br />

La distinzione a due livelli si mostra particolarmente utile quando si voglia utilizzare<br />

un sistema di etichettatura automatica. Alcune categorie sono infatti piuttosto difficili<br />

da disambiguare automaticamente ed è pertanto opportuno avere un sistema di<br />

etichettatura a grana più grossa. Nel caso del sardo, la creazione o l’implementazione<br />

di tagger automatici può essere un passo successivo, ma mi sembra utile, sin d’ora,<br />

definire un sistema di etichettatura adatto anche per un futuro utilizzo automatico.<br />

• Il tag si compone di una stringa di caratteri strutturata nel modo<br />

seguente: in posizione 0 il simbolo che codifica la parte del discorso;<br />

53


• nelle posizioni da 1 a n i valori degli attributi relativi a persona, genere,<br />

numero, caso ecc.;<br />

• se un attributo non si può applicare è sostituito da un trattino.<br />

Il tagset qui proposto è simile, come detto, a quello proposto per l’italiano e lo<br />

spagnolo all’interno del progetto MULTEXT (Calzolai & Monachini 1996). Manteniamo<br />

innanzi tutto la classificazione in parti del discorso proposta all’interno di MULTEXT<br />

(tab.1 in appendice).<br />

Analizziamo quindi in breve le etichette create per il sardo. Ci soffermeremo solo<br />

nel caso in cui vi siano differenze notevoli tra il sardo e le altre due lingue romanze,<br />

e nei casi in cui siano state operate scelte differenti.<br />

Nome<br />

Per quanto riguarda la categoria “nome”, i tratti presi in considerazione sono<br />

esemplificati nella tabella 2 e corrispondono a quelli considerati per l’italiano in<br />

Calzolari e Monachesi. La tabella 3 mostra le possibili combinazioni e la traduzione<br />

in ctags.<br />

Verbo<br />

Nella scelta dei valori del verbo, vi sono alcune modifiche sia rispetto all’italiano<br />

che allo spagnolo (tab. 4). Per quanto riguarda il modo, non è inserito tra i codici il<br />

condizionale. In sardo, infatti, esso è una forma perifrastica formata tramite un verbo<br />

ausiliare (camp. hai ‘avere’, log. depi ‘dovere’) più l’infinito del verbo. Pertanto,<br />

in conformità con quanto fatto nelle altre lingue per le forme perifrastiche, le due<br />

forme saranno etichettate autonomamente.<br />

Il medesimo discorso vale per il futuro, formato nelle diverse varietà dal verbo<br />

‘avere’ coniugato, la preposizione a e l’infinito del verbo.<br />

Il sardo non possiede inoltre forme di passato remoto, ma il passato non durativo<br />

è espresso dalla forma perifrastica formata da ’avere’ più il participio passato del<br />

verbo.<br />

La situazione è piuttosto complessa per quanto riguarda i clitici. Mentre nel caso<br />

dell’italiano è previsto un unico codice E per tutti i tipi di clitico, nel caso dello<br />

spagnolo ogni tipo di clitico e le possibili combinazioni vengono specificati con diversi<br />

codici anche nel ctag. Lo spagnolo non presenta però i cosiddetti “clitici avverbiali”<br />

(italiano ne e ci), che in sardo si possono aggiungere al verbo e combinarsi con altri<br />

54


clitici. Abbiamo quindi mantenuto la convenzione dello spagnolo, aggiungendo però i<br />

codici per i clitici avverbiali.<br />

La tabella 5 mostra tutte le possibili combinazioni. Si noti che in sardo non esiste<br />

il participio presente e che è possibile aggiungere forme clitiche del pronome solo al<br />

gerundio e all’imperativo.<br />

Aggettivo<br />

L’aggettivo in sardo (tabb. 6 e 7) non presenta particolari differenze rispetto<br />

all’italiano e allo spagnolo. Il comparativo è normalmente formato con prus ‘più’<br />

seguito dall’aggettivo.<br />

Pronome<br />

Per quanto riguarda i pronomi, in analogia con quanto fatto per lo spagnolo, è<br />

stato preso in considerazione anche l’attributo caso (tabb. 8 e 9).<br />

Determinante<br />

Cfr. tabb.10 e 11 in appendice.<br />

Articolo<br />

Cfr. tabb. 12 e 13 in appendice.<br />

Avverbio<br />

Cfr. tabb. 14 e 15 in appendice.<br />

Determinante<br />

Cfr. tabb.10 e 11 in appendice.<br />

Articolo<br />

Cfr. tabb. 12 e 13 in appendice.<br />

Avverbio<br />

Cfr. tabb. 14 e 15 in appendice.<br />

Punteggiatura<br />

55


Cfr. tabella 24 in appendice.<br />

2.4 L’esempio annotato<br />

A questo punto siamo in grado di annotare l’esempio secondo le convenzioni XCES.<br />

Per questioni di brevità forniamo l’annotazione solo di una parte del nostro testo.<br />

sdc_annotation.xml<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

su<br />

<br />

su<br />

Tdms-<br />

RMS<br />

<br />

<br />

<br />

Casteddu<br />

<br />

Casteddu<br />

Np..-<br />

56


NP<br />

<br />

<br />

<br />

hat<br />

<br />

hai<br />

Vaip3s-<br />

VAS3IP<br />

<br />

<br />

<br />

giogau<br />

<br />

giogai<br />

Vmp--sm-<br />

VMPSM<br />

<br />

<br />

<br />

ariseru<br />

<br />

<br />

<br />

ariseru<br />

R-p-<br />

B<br />

<br />

in<br />

<br />

in<br />

Sp<br />

57


SP<br />

<br />

<br />

<br />

su<br />

<br />

su<br />

Sp-<br />

SP<br />

<br />

<br />

<br />

stadiu<br />

<br />

stadiu<br />

Ncms-<br />

NMS<br />

<br />

<br />

<br />

Sant&apos<br />

<br />

Santu<br />

A-pms-<br />

AMS<br />

<br />

<br />

<br />

Elia<br />

<br />

Elia<br />

Np..-<br />

NP<br />

58


.....<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

is<br />

<br />

is<br />

Tdmp-<br />

RMP<br />

<br />

<br />

<br />

festegiamentus<br />

<br />

festegiamentu<br />

Ncmp-<br />

NMP<br />

<br />

<br />

<br />

<br />

<br />

59


3. Conclusioni<br />

In questo articolo abbiamo mostrato, attraverso un case study, le problematiche<br />

relative all’applicazione della linguistica dei corpora a lingue minoritarie e non<br />

standardizzate. Mi pare quindi che si possano trarre alcune considerazioni più generali<br />

applicabili a gran parte delle lingue minoritarie.<br />

Innanzi tutto, il lavoro da fare, in casi come quello del sardo, è a più livelli dalla<br />

progettazione del corpus, alla raccolta, all’etichettatura dei dati. L’imponenza<br />

dell’opera si accompagna alla necessità di accelerare i tempi nel caso di varietà<br />

particolarmente a rischio di estinzione.<br />

In secondo luogo, in caso di varietà non standardizzate, si pone, come abbiamo<br />

visto, il problema della codifica dei dati. La scelta qui operata, vale a dire di un corpus<br />

articolato in sotto-macro-varietà, mi sembra possa essere un buon compromesso tra<br />

la necessità di standardizzazione da un lato e il mantenimento delle differenziazioni<br />

dall’altro. Ciò che è assolutamente necessario è invece il raggiungimento di una<br />

standardizzazione dal punto di vista ortografico.<br />

Infine, l’annotazione per parti del discorso può essere particolarmente utile per<br />

la creazione di grammatiche basate sull’uso. Nel caso del sardo, ad esempio, la<br />

presenza di una lingua dominante come l’italiano, con una ricca tradizione letteraria<br />

e grammaticale, può influenzare i giudizi di grammaticalità dei parlanti, inficiando in<br />

alcuni casi i dati raccolti.<br />

Mi pare quindi che, in base a tutte queste riflessioni, la progettazione e lo sviluppo<br />

di corpora per le lingue minoritarie debbano assumere un ruolo prioritario in progetti<br />

di salvaguardia e politica linguistica.<br />

Ringraziamenti<br />

Ringrazio Andrea Sansò per aver letto con attenzione il manoscritto.<br />

60


61<br />

Bibliografia<br />

Ball, C. (1994). Concordances and corpora for classroom and research. <strong>Online</strong> at<br />

http://www.georgetown.edu/cball/corpora/tutorial.html.<br />

Bel, N. & Aguilar A. (1994). Proposal for Morphosyntactic encoding: Application to<br />

Spanish, Barcelona.<br />

Blasco Ferrer, E. (1986). La lingua sarda contemporanea. Cagliari: Edizioni della<br />

Torre.<br />

Calaresu, E. (2002). “Alcune riflessioni sulla LSU (Limba Sarda Unificada).” Orioles,<br />

V. (a cura di), La legislazione nazionale sulle minoranze linguistiche. Problemi,<br />

applicazioni, prospettive. Udine: Forum, 247-266.<br />

Calzolari, N. & Monachini, M. (1996). Multext. Common Specification and notations<br />

for Lexicon Encoding, Pisa: Istituto di Linguistica Computazionale.<br />

EAGLES (1996) Recommendations for the morphosyntactic annotation of corpora.<br />

EAG-TCWG-MAC/R, Pisa: Istituto di Linguistica Computazionale.<br />

Ide, N. (1998) “Corpus Encoding Standard. SGML Guidelines for Encoding Linguistic<br />

Corpora.” Proceedings of the First International Language Resources and Evaluation<br />

Conference, Paris: European Language Resources Association, 463-70.<br />

Ide, N., Bonhomme, P. & Romary, L. (2000). “XCES: An XML-based Standard for<br />

Linguistic Corpora.” Proceedings of the Second Language Resources and Evaluation<br />

Conference (LREC), Athens, 825-30.


Ide, N. (2004). “Preparation and Analysis of Linguistic Corpora.” Schreibman, S.,<br />

Siemens, R. & Unsworth, J. (a cura di), A Companion to Digital Humanities. London:<br />

Blackwell.<br />

Kennedy, G. (1998). An introduction to Corpus Linguistics. London: Longman.<br />

Leech, G. & Wilson, A. (1996). EAGLES recommendations for the morphosyntactic<br />

annotation of corpora. Pisa: Istituto di Linguistica Computazionale.<br />

Regione Autonoma della Sardegna (2001). “Limba sarda unificada. Sintesi delle norme<br />

di base: ortografia, fonetica, morfologia e lessico”, Cagliari.<br />

McEnery, T. & Wilson, A. (1996). Corpus Linguistics. Edinburgh: Edinburgh University<br />

Press.<br />

Mensching, G. & Grimaldi, L. (2000). Sardinian <strong>Text</strong> Database, http://www.lingrom.<br />

fu-berlin.de/sardu/textos.html.<br />

Puddu, N. (2003). “In search of the “real Sardinian”: truth and representation.”<br />

Brincat, J., Boeder, W., Stolz, T. (a cura di), Purism in minor languages, endangered<br />

languages, regional languages, mixed languages. Bochum: Universitätsverlag Dr. N.<br />

Brockmeyer, 27-42.<br />

Puddu, N. (2005). “La nozione di purismo nel progetto di standardizzazione della<br />

lingua sarda.“ Carli, A., Calaresu, E. & Guardiano, C. (a cura di), Lingue, istituzioni,<br />

territori. Riflessioni teoriche, proposte metodologiche ed esperienze di politica<br />

linguistica. Bulzoni: Roma, 257-278.<br />

Spina, S. (2001). Fare i conti con le parole. Perugia: Guerra Edizioni.<br />

Una commissione tecnico-scientifica per un’indagine socio-linguistica sullo stato<br />

della lingua sarda. <strong>Online</strong> at www.regione.sardegna.it.<br />

62


Figura 1: La struttura del Sardinian Digital Corpus<br />

63<br />

Appendice


Tabella 1: Codici per le parti del discorso<br />

Parte del discorso Codice<br />

Nome N<br />

Verbo V<br />

Aggettivo A<br />

Pronome P<br />

Determinante D<br />

Articolo T<br />

Avverbio R<br />

Apposizione S<br />

Congiunzioni C<br />

Numerali M<br />

Interiezione I<br />

Unico U<br />

Residuale X<br />

Abbreviazione Y<br />

Tabella 2: Coppie attributo-valore per la categoria “Nome” in sardo<br />

Attributo Valore Esempio Codice<br />

Tipo comune libru c<br />

proprio Giuanni p<br />

Genere maschile omini m<br />

femminile femina f<br />

comune meri c<br />

Numero singolare omini s<br />

plurale feminas p<br />

Caso /// /// ///<br />

Tabella 3: e per la categoria “Nome” in sardo<br />

msd ctag esempio<br />

Ncms- NMS liburu<br />

Ncmp- NMP liburus<br />

Ncmn- NN lunis (su/is)<br />

Ncfs- NFS domu<br />

Ncfp- NFP domus<br />

Nccs- NNS meri (su/sa)<br />

Nccp- NNP meris (is f.m.),<br />

Np- NP Mariu, Maria, Puddu<br />

64


Tabella 4: Coppie attributo-valore per la categoria “Verbo” in sardo<br />

Attributo Valore Esempio Codice<br />

Status lessicale papai m<br />

ausiliare hai/essi a<br />

modale podi o<br />

Modo indicativo papat I<br />

congiuntivo papit s<br />

imperative papa m<br />

infinito papai n<br />

participio papau p<br />

gerundio papendi g<br />

Tempo presente papu p<br />

imperfetto papasta i<br />

persona prima seu 1<br />

seconda ses 2<br />

terza est 3<br />

numero singolare papat s<br />

plurale papant p<br />

genere maschile papau m<br />

femninile papada f<br />

clitico accusativo donaddu a<br />

beninci r<br />

dativo donaddi d<br />

avverbiale donandi<br />

acc+dat donasiddu t<br />

dat+avv donasindi u<br />

avv+dat donandeddi v<br />

dat+avv+acc mandasinceddu z<br />

Tabella 5: e per la categoria “Verbo” in sardo<br />

msd ctag esempio<br />

Vaip1s- VAS1IP hapu/seu<br />

Vaip2s- VAS2IP has/ses<br />

Vaip3s- VAS3IP hat/est<br />

Vaip1p- VAP1ICP eus/seus<br />

Vaip2p- VAP2IP eus/seis<br />

Vaip3p- VAP3IP hant/funt<br />

Vaii1s- VAS1II hia, femu<br />

Vaii2s- VAS2II hiast, fiast<br />

65


Vaii3s- VAS3II hiat, fiat<br />

Vaii1p- VAP1II emus, femus<br />

Vaii2p- VAP2II eis, festis<br />

Vaii3p- VAP3II iant, fiant<br />

Vasp1s- VASXCP apa, sia<br />

Vasp2s- VASXCP apas, sias<br />

Vasp3s- VASXCP apas, siat<br />

Vasp1p- VAP1ICP apaus, siaus<br />

Vasp2p- VAP2CMP apais, siais<br />

Vasp3p- VAP3CP apant, siant<br />

Vasi1s- VAS3CI hemu, fessi<br />

Vasi2s- VAS3CI essist, fessis<br />

Vasi3s- VAS3CI essit, fessit<br />

Vasi1p- VAP1CI essimus, festus<br />

Vasi2s- VAP2ICR essidis, festis<br />

Vasi3p- VAP3CI essint, fessint<br />

Vanp--- VAF hai, essi<br />

Vaps-sm VAMSPR apiu, stetiu<br />

Vaps-pm VAMPPR stetius<br />

Vaps-sf VAFSPR stetia<br />

Vaps-pf VAFPPR stetias<br />

Vmip1s- VMIP1S papu<br />

Vmip2s- VMIP2S papas<br />

Vmip3s- VMIP3S papat<br />

Vmip1p- VMIP1P papaus<br />

Vmip2p- VMIP2P papais<br />

Vmip3p- VMIP3P papant<br />

Vmsp1s- VMSP1S papi<br />

Vmsp2s- VMSP2S papis<br />

Vmsp3s- VMSP3S papi<br />

Vmsp1p- VMSP1P papeus<br />

Vmsp2p- VMSP2P papeis<br />

Vmsp3p- VMSP3P papint<br />

Vmii1s- VMII1S papemu<br />

Vmii2s- VMII2S papást<br />

Vmii3s- VMII3S papát<br />

Vmii1p- VMII1P papemus<br />

Vmii2p- VMII2P papestis<br />

Vmii3p- VMII3P papánt<br />

66


Vmsi1s- VMSI1S tenessi<br />

Vmsi2s- VMSI2S tenessis<br />

Vmsi3s- VMSI3S tenessit<br />

Vmis1s- VMIS1S tenessimus<br />

Vmsi2p- VMSI2P tenestis<br />

Vmsi3p- VMSI3P tenessint<br />

Vmp--pf- VMPPF tentas<br />

Vmp--sf- VMPSF tenta<br />

Vmp--pm- VMPPM tentus<br />

Vmp--sm- VMPSM tentu<br />

Vmg----t VMGT tzerrienimiddas<br />

Vmg----t VMGT tzerriandimiddas<br />

Vmg----t VMGT tzerriendimidda<br />

Vmg----t VMGT tzerriendimiddus<br />

Vmg----t VMGT tzerriendimiddu<br />

Vmg----d VMGD tzerriendimì<br />

Vmg----t VMGT tzerriendididdas<br />

Vmg----t VMGT tzerriendididdas<br />

Vmg----t VMGT tzerriendididda<br />

Vmg----t VMGT tzerriendididdus<br />

Vmg----t VMGT tzerriendididdu<br />

Vmg----d VMGD tzerriendidì<br />

Vmg----t VMGT tzerriendisiddas<br />

Vmg----t VMGT tzerriendisidda<br />

Vmg----t VMGT tzerriendisiddus<br />

Vmg----t VMGT tzerriendisiddu<br />

Vmg----d VMGD tzerriendisì<br />

Vmg----t VMGT tzerreindisiddas<br />

Vmg----t VMGT tzerriendisidda<br />

Vmg----t VMGT tzerriendisiddus<br />

Vmg----t VMGT tzerriendisiddu<br />

Vmg----d VMGD tzerriendisì<br />

Vmg----t VMGT tzerriendisiddas<br />

Vmg----t VMGT tzerriendisidda<br />

Vmg----t VMGT tzerriendisiddus<br />

Vmg----t VMGT tzerriendisiddu<br />

Vmg----d VMGD tzerriendisì<br />

Vmg----a VMGA tzerriendimì<br />

Vmg----a VMGA tzerrienditì<br />

Vmg----a VMGA tzerriendiddas<br />

67


Vmg----a VMGA tzerriendidda<br />

Vmg----a VMGA tzerriendiddus<br />

Vmg----a VMGA tzerriendiddu<br />

Vmg----a VMGA tzerriendisì<br />

Vmg----a VMGA tzerriendisì<br />

Vmg----u VMGU mandendisindi<br />

Vmg----z VMGZ mandendisinceddu<br />

Vmg----- VMG tzerriendi<br />

Vmmp2sa VMM2SA mandaddu<br />

Vmmp2sd VMM2SD mandadì<br />

Vmmp2st VMM2ST mandadiddu<br />

Vmmp2su VMM2SU mandadindi<br />

Vmmp2sv VMM2SV mandandeddi<br />

Vmmp2sz VMM2SZ mandasinceddu<br />

Vmmp2pa VMM2PA mandaiddu<br />

Vmmp2pd VMM2PD mandaisì<br />

Vmmp2pt VMM2PT mandaisiddu<br />

Vmmp2pu VMM2PU mandaisndi<br />

Vmmp2pz VMM2PZ mandaisinceddu<br />

Vmmp2s- VMM2S manda<br />

Vmmp2p- VMM2P mandai<br />

Vmmp2pv VMM2PV mandaindeddi<br />

Tabella 6: Coppie attributo-valore per la categoria “Aggettivo” in sardo<br />

Attributo Valore Esempio Codice<br />

Tipo // // //<br />

Grado positivo bonu p<br />

comparativo mellus c<br />

superlativo mellus s<br />

Genere maschile bonu m<br />

femminile bona f<br />

l-spec druci c<br />

Numero singolare bonu s<br />

plurale bonus p<br />

Caso // // //<br />

68


Tabella 7: e per la categoria “Aggettivo” in sardo<br />

msd ctag esempio<br />

A-pms- AMS bonu<br />

A-pmp- AMP bonus<br />

A-pfs- AFS bella<br />

A-pfp- AFP bellas<br />

A-pcs- ANS druci<br />

A-pcp- ANP drucis<br />

A-sms- AMS bellissimu<br />

A-smp- AMP bellissimus<br />

A-sfs- AFS bellissima<br />

A-sfp- AFP bellissimas<br />

A-sfs- AFS bellissima<br />

A-sfp- AFP bellissimas<br />

Tabella 8: Coppie attributo-valore per la categoria “Pronome” in sardo<br />

Attributo Valore Esempio Codice<br />

Tipo personale deu p<br />

dimostrativo cuddu d<br />

indefinito calincunu i<br />

possessivo miu m<br />

interrogativo chini t<br />

relativo chi r<br />

esclamativo cantu ! e<br />

riflessivo si x<br />

Persona prima deu 1<br />

seconda tui 2<br />

terza issu 3<br />

Genere maschile issu m<br />

femminile issa f<br />

L-spec comune deu c<br />

Numero singolare custu s<br />

plurale custus p<br />

L-spec invariante chini n<br />

Caso nominativo deu n<br />

dativo ddi d<br />

accusativo ddu a<br />

obliquo mei o<br />

69


Tabella 9: e per la categoria “Pronome” in sardo<br />

msd ctag esempio<br />

Pd-ms-- PDMS cussu<br />

Pd-mp-- PDMP cuddus<br />

Pd-fs-- <strong>PDF</strong>S cudda<br />

Pd-fp-- <strong>PDF</strong>P cuddas<br />

Pi-ms-- PIMS dognunu<br />

Pi-mp-- PIMP calincunus<br />

Pi-fs-- PIFS dognuna<br />

Pi-fp-- PIFP calincunas<br />

Pi-cs-- PINS chinechisiat<br />

Ps1ms-- PPMS miu, nostru<br />

Ps1mp-- PPMP mius<br />

Ps1fs-- PPFS mia<br />

Ps1fp-- PPFP mias<br />

Ps2ms-- PPMS tuu<br />

Ps2mp-- PPMP tuus<br />

Ps2fs-- PPFS tua<br />

Ps2fp-- PPFP tuas<br />

Ps3ms-- PPMS suu<br />

Ps3mp-- PPMP suus<br />

Ps3fs-- PPFS sua<br />

Ps3fp-- PPFP suas<br />

Ps3cp-- PPNP insoru<br />

Pt-cs-- PWNS chini?<br />

Pt-cn-- PWNN ita?<br />

Pt-cs-- PWMS cantu?<br />

Pt-cp-- PWMP cantus?<br />

Pr-cs-- PWMS cantu<br />

Pr-cp-- PWNP cantus<br />

Pr-cs-- PWNS chini<br />

Pr-cp-- PWNP calis<br />

Pe-cs-- PWNS cantu!<br />

Pe-cp-- PWNP cactus!<br />

Pe-cn-- PWNN ita!<br />

Pp1csn- PP1SN deu<br />

Pp2cs-n PP2SN tui<br />

Pp3ms[no] PP3MS issu<br />

Pp3fs[no] PP3FS issa<br />

Pp1cp[no] PP1PN nosus<br />

Pp2cp[no] PP2PN bosatrus<br />

Pp3mp[no] PP3MP issus<br />

Pp3fp[no] PP3FP issas<br />

Pp1cso- PP1SO mei<br />

Pp2-so- PP2SO ti<br />

P[px]1cs[ad]- P1S mi<br />

P[px]2cs[ad]- P2S ti<br />

70


P[px]3cs[ad]- P3 si<br />

Pp3.pd- PP3PD ddis<br />

Pp3.sd- PP3SD ddi<br />

Pp3fpa- PP3FPA ddas<br />

Pp3fsa- PP3FSA dda<br />

Pp3mpa- PP3MPA ddus<br />

Pp3msa- PP3MSA ddu<br />

P[px]1cp[ad]- P1P si<br />

P[px]2cp[ad]- P2P si<br />

P..fp--- PFP mias, custas,<br />

P..fs--- PFS mia, custa, canta etc.<br />

P..mp--- PMP mius, custus, cantas etc.<br />

P..ms--- PMS miu, custu, cantu etc.<br />

Tabella 10: Coppie attributo-valore per la categoria “Determinante” in sardo<br />

Attributo Valore Esempio Codice<br />

Tipo dimostrativo cuddu d<br />

indefinito dogna i<br />

possessivo miu m<br />

interrogativo chini t<br />

relativo chi r<br />

esclamativo cantu ! e<br />

Persona prima mia 1<br />

seconda tua 2<br />

terza sua 3<br />

Genere maschile custu m<br />

femminile custa f<br />

L-spec comune dogna c<br />

Numero singolare custu s<br />

plurale custus p<br />

L-spec invariante chini n<br />

nominativo deu n<br />

Possessore singolare miu s<br />

plurale nostru p<br />

71


Tabella 11: e per la categoria “Determinante” in sardo<br />

msd ctag esempio<br />

Dd-ms-- DDMS cuddu<br />

Dd-mp-- DDMP cuddus<br />

Dd-fs-- DDFS cudda<br />

Dd-fp-- DDFP cuddas<br />

Di-ms-- DIMS nisciunu<br />

Di-mp-- DIMP unus cantu<br />

Di-fs-- DIFS nisciunas<br />

Di-fp-- DIFP unas cantu<br />

Di-cs-- DINS chinechisiat<br />

Di-cc-- DINC dogni<br />

Ds1ms-- DPMS miu, nostru<br />

Ds1mp-- DPMP mius<br />

Ds1fs-- DPFS mia<br />

Ds1fp-- DPFP mias<br />

Ds2ms-- DPMS tuu, vostru<br />

Ds2mp-- DPMP tuus<br />

Ds2fs-- DPFS tua<br />

Ds2fp-- DPFP tuas<br />

Ds3ms-- DPMS suu<br />

Ds3mp-- DPMP suus<br />

Ds3fs-- DPFS sua<br />

Ds3fp-- DPFP suas<br />

Ds3cp-- DPNP insoru<br />

Dr-cs-- DWNS cantu<br />

Dr-cp-- DWNP cantus<br />

Dt-cn-- DWNN cali<br />

Dt-cs-- DWNS cantu<br />

Dt-cp-- DWNP cantus<br />

De-cs-- DWMS cantu<br />

De-cp-- DWMP cantus<br />

D..fp--- DFP mias, custas, cantas ecc.<br />

D..fs--- DFS mia, custa, canta, ecc.<br />

D..mp--- DMP mius, custus, cantus, ecc.<br />

D..ms--- DMS miu, custu, cantu, ecc.<br />

72


Tabella 12: Coppie attributo-valore per la categoria “Articolo” in sardo<br />

Attributo Valore Esempio Codice<br />

Tipo definito su d<br />

indefinito unu i<br />

Genere maschile su m<br />

femminile sa f<br />

Numero singolare su s<br />

plurale is p<br />

Caso // // //<br />

Tabella 13: e per la categoria “Articolo” in sardo<br />

msd ctag esempio<br />

Tdms- RMS su<br />

Td[fm]p- RXP is<br />

Tdfs- RFS sa<br />

Tims- RIMS unu<br />

Tifs- RIFS una<br />

Tabella 14: Coppie attributo-valore per la categoria “Avverbio” in sardo<br />

Attributo Valore Esempio Codice<br />

Tipo _ _ _<br />

Grado positivo<br />

superlativo<br />

73<br />

chitzi<br />

malissimu<br />

Tabella 15: e per la categoria “Avverbio” in sardo<br />

msd ctag esempio<br />

R-p B mali<br />

R-s BS malissimu<br />

Tabella 16: Coppie attributo-valore per la categoria “Preposizione” in sardo<br />

Attributo Valore Esempio Codice<br />

Tipo preposizione in, de p<br />

p<br />

s


Tabella 17: e per la categoria “Preposizione” in sardo<br />

msd Ctag Esempio<br />

Sp SP in<br />

Tabella 18: Coppie attributo-valore per la categoria “Congiunzione” in sardo<br />

msd ctag esempio<br />

Cc CC ma<br />

Cs CS poita<br />

Tabella 19: e per la categoria “Congiunzione” in sardo<br />

msd ctag esempio<br />

Cc CC ma<br />

Cs CS poita<br />

Tabella 20: Coppie attributo-valore per la categoria “Numerale” in sardo<br />

Attributo Valore Esempio Codice<br />

Tipo cardinale centu c<br />

ordinale primu o<br />

Genere maschile primu m<br />

femminile prima f<br />

Numero singolare primu s<br />

plurale primus p<br />

Caso // // //<br />

Tabella 21: e per la categoria “Numerale” in sardo<br />

msd ctag esempio<br />

M.ms- NMS primu<br />

M.fs- NFS prima<br />

M.mp- NMP primus<br />

M.fp- NFP primas<br />

Mc--- N unu, centu<br />

Tabella 22: e per la categoria “Interiezione” in sardo<br />

msd ctag esempio<br />

I I ayo!<br />

74


Tabella 23: e per la categoria “Residuale” in sardo<br />

ctag esempio<br />

X simboli ecc.<br />

Tabella 24: e per la categoria “Punteggiatura” in sardo<br />

ctag esempio<br />

punct .,:!?…<br />

75


The Relevance of Lesser-Used Languages for<br />

Theoretical Linguitics: The Case of Cimbrian<br />

and the Support of the TITUS Corpus<br />

Ermenegildo Bidese, Cecilia Poletto<br />

and Alessandra Tomaselli<br />

On the basis of the TITUS Project, the following contribution aims at showing the<br />

importance of a lesser-used language, such as Cimbrian, for the theory of grammar. In<br />

Chapter 1, we present the goals of TITUS and its possibilities in order to analyse old<br />

Cimbrian writings. Furthermore, according to these possibilities, the second chapter<br />

will summarise some recent results of the linguistic research about relevant aspects<br />

of Cimbrian grammar, in particular the syntax of verbal elements, of subject clitics,<br />

and of subject nominal phrases. Chapter 3 and 4 discuss which relevance these results<br />

can have in the Generative framework, in particular with respect to a generalisation<br />

concerning the syntactic change in context of isolation and language contact.*<br />

1. The TITUS Project (http://titus.uni-frankfurt.de)<br />

The TITUS Project was conceived in 1987 during the Eighth Conference of Indo-<br />

European Studies in Leiden, when some of the participants had the idea to link their<br />

work together in order to create a text database for the electronic storage of writings/<br />

sources relevant to their discipline. 1 The name of the project was “Thesaurus of<br />

Indo-European <strong>Text</strong>ual Materials on Data Media” (Thesaurus indogermanischer<br />

<strong>Text</strong>materialien auf Datenträgern). In the first phase, the project aimed at preparing<br />

a collection of textual materials from old Indo-European languages, such as Sanskrit,<br />

Old Iranian, Old Greek, Latin, as well as Hittite, Old High German and Old English.<br />

In the beginning of the ’90s, the rapid increase of electronic storage capacities<br />

in data processing led to a second phase of the project in 1994. During the Third<br />

Working Conference for the Employment of Data Processing in the Historical and<br />

Comparative Linguistics, in Dresden, the newly-founded working group ‘Historisch-<br />

Vergleichende Sprachwissenschaft’ (Historic-Comparative Linguistics) of the Society<br />

for Computational Linguistics and Language Technology (Gesellschaft für Linguistische<br />

* The present contribution was written by the three authors in complete collaboration. For the formal<br />

definition of scholar responsibility, we declare that Ermenegildo Bidese draws up sections 1, 1.1 and<br />

1.2, 2, 2.1, Cecilia Poletto sections 2.2 and 2.3, Alessandra Tomaselli sections 3 and 4. We would like<br />

to thank the staff of <strong>EURAC</strong> for the opportunity to present our research.<br />

1 Cf. Gippert (1995)<br />

77


Datenverarbeitung) decided on an extension of the objectives for the ‘Thesaurus’,<br />

including further text corpora from other Indo-European and neighbouring languages,<br />

and introduced the new name ‘Thesaurus of Indo-European <strong>Text</strong>ual and Linguistic<br />

Materials’, shortened to the acronym from the German designation: TITUS (Thesaurus<br />

indogermanischer <strong>Text</strong>- und Sprachmaterialien). The addition, ‘linguistic materials’,<br />

emphasizes that TITUS understands itself no longer only as a text database, but also as<br />

a ‘data pool’. 2 On the TITUS server, you can find materials and aids for the analysis of<br />

the texts as well as, such as, among other things, a currently up-to-date bibliography<br />

with the newest publications in Indo-European studies, teaching materials, lexica,<br />

glossaries, language maps, audiovisual materials, software and fonts and heaps of<br />

helpful links. In fact, since 1995, owing to the above-mentioned conference, TITUS has<br />

been present on the World Wide Web with its own site at http://titus.uni-frankfurt.<br />

de. 3 Responsible for the project is the Institut für Vergleichende Sprachwissenschaft<br />

at the University Johann Wolfgang-Goethe in Frankfurt am Main/Germany (direction:<br />

Professor Jost Gippert) in connection with other European universities.<br />

The third phase in the development of the TITUS Project coincides with the explosive<br />

expansion of the Internet, and the new possibilities that online communication and<br />

Web performance offer. The new target of TITUS is the replacement of static data<br />

retrieval by an interactive one. 4 This means that in order to better comprehend and<br />

analyse the texts, further information about the writings are made available to the<br />

user, who can then become interactive with the text. Three issues are pursued:<br />

• a graphic documentation of the physical supports of the texts, usually<br />

manuscripts and inscriptions;<br />

• an automatic retrievement of word form correspondences in a single text<br />

or in an entire language corpus; and,<br />

• an automatic linguistic analysis of occurrences for the morphology of a<br />

word or for the basic forms of a verb. 5<br />

This interactive retrieval system is currently in development.<br />

1.1 The Cimbrian <strong>Text</strong>s in the TITUS Project<br />

The TITUS text database includes two Cimbrian texts provided by Jost Gippert,<br />

Oliver Baumann & Ermenegildo Bidese (1999). 6 They comprise the catechism of 1813<br />

2 Bunz (1998:12)<br />

3 Ibid.<br />

4 Cf. Gippert (2001)<br />

5 Cf. Ibid. Cf. the same for four illustrative examples.<br />

6 The direct links are: http://titus.uni-frankfurt.de/texte/etcs/germ/zimbr/kat1813d/ kat18.htm and<br />

http://titus.uni-frankfurt.de/texte/etcs/germ/zimbr/kat1842d/kat18. htm.<br />

78


(better known as the ‘short Cimbrian catechism’, written in the Cimbrian variety of<br />

the Seven Communities), and a new edition of the same text with slight alterations<br />

from 1842. 7 In fact, this catechism is a Cimbrian translation of the ‘Piccolo Catechismo<br />

ad uso del regno d’Italia’ (Short Catechism for the Italian Kingdom) of 1807. A critical<br />

edition of both the original Italian text and the two Cimbrian versions was provided<br />

by Wolfgang Meid. 8 The situation of Cimbrian knowledge at this time (with particular<br />

reference to the plateau of the Seven Communities) was very good, even though<br />

the use of the local Romance variety – in accordance with what the same text in the<br />

introduction testifies – was spreading. 9 For this reason, and in view of the possibility<br />

of comparing this text with the first Cimbrian catechism of 1602, (which represents<br />

the oldest Cimbrian writing 10 ), the ‘short catechism’ of 1813 and its later version in<br />

1842 are essential sources for studying and analysing the diachronic development of<br />

the Cimbrian language. 11<br />

On the basis of the above-mentioned critical edition by Professor Meid, we digitised<br />

the text in agreement with Meid’s linearization of the original version. Moreover, we<br />

provided a first linguistic structuring of the text marking, above all, for the prefix<br />

of the participle perfect, pronominal clitics, personal pronouns, and the existence<br />

particle -da. 12<br />

1.2 The Research of Linguistic Content of the Cimbrian <strong>Text</strong>s<br />

The first way of accessing the content of the Cimbrian texts is selecting the levels<br />

(chapters, paragraphs, verses and lines) into which the text is specifically subdivided<br />

in the entry form on the right frame of the text’s start page. In this way, you can<br />

precisely find any given passage of the Cimbrian text. 13<br />

7 Cimbrian is a German dialect commonly spoken today in the village of Lusern/Luserna in the region of<br />

Trentino in northern Italy. It is also found, albeit in widely dispersed pockets, in the Venetian communities<br />

of Mittoballe/Mezzaselva (Seven Communities) and Ljetzan/Giazza (Thirteen Communities), in the<br />

northeast of Italy. When the Cimbrian colonies were founded and where the colonists came from are<br />

still subjects of controversy, although the accepted historical explanation is that the Cimbrian colonies<br />

originated from a migration of people from Tyrol and Bavaria (Lechtal) at the beginning of the second<br />

millennium. For a general introduction about the Cimbrian question and this language, cf. Bidese<br />

(2004b).<br />

8 Cf. Meid (1985b)<br />

9 Cf. Cat.1813:17-21 in Meid (1985b:35)<br />

10 Cf. Meid (1985a). The first Cimbrian catechism is the translation of Cardinal Bellarmino’s ‘Dottrina<br />

cristiana breve’ (cristian short doctrine). In spite of the title, the text is remarkably longer than the<br />

1813’s ‘short catechism.’<br />

11 Moreover, in TITUS, there is the first part of Remigius Geiser’s (1999) self-learning Cimbrian course(cf.<br />

http://titus.fkidg1.uni-frankfurt.de/didact/zimbr/cimbrian.htm).<br />

12 Cf. for the linguistically analysed texts following links: http://titus.uni-frankfurt.de/texte/etcs/germ/<br />

zimbr/kat1813s/kat18.htm and http://titus.uni-frankfurt.de/texte/etcs/germ/ zimbr/kat1842s/<br />

kat18.htm.<br />

13 Cf. for a detailed description of all these possibilities Gippert (2002).<br />

79


A second possibility for content searching is obtained by using TITUS word search<br />

engine. By double-clicking on a given word of the Cimbrian text, for example, you can<br />

automatically look for its occurrences in the text, for the exact text references, and<br />

for the context in which this word is used (including orthographic variants).<br />

A third way of content searching in the Cimbrian texts consists of using a search<br />

entry form that you can find when you open the link Switch to Word Index on the right<br />

frame of the start page of the text. In the box, you can enter a word and obtain its<br />

occurrences in the Cimbrian text.<br />

In conclusion, we can state that the TITUS Project, with all the above-mentioned<br />

possibilities (and including the Cimbrian texts with a first linguistic structuring), offer<br />

a good starting-point for the research of the diachronic development of Cimbrian’s<br />

syntax.<br />

2. Some Relevant Aspects of Cimbrian Syntax<br />

In the last decade, three interrelated syntactic aspects of the Cimbrian dialects<br />

have become the subject of intensive descriptive studies, from both the diachronic<br />

and the synchronic point of view: a) the syntax of verbal elements; b) the syntax of<br />

subject clitics; and, c) the syntax of subject NPs. The theoretical relevance of these<br />

studies will be discussed in section 4.<br />

2.1 Verb Syntax<br />

As for the syntax of verbal elements, the following descriptive results can be taken<br />

for granted:<br />

i) Cimbrian is no longer characterised by the V2 restriction, which requires the<br />

second position of the finite verb in the main declarative clause. As the following<br />

examples show, the finite verb can be preceded by two or more constituents that are<br />

not rigidly ordered, as shown by the fact that both (1) (a and b) and (2) are grammatical.<br />

Similar cases of V3 (as in [1a]) or V4 (as in [1b]) are not acceptable, neither in Standard<br />

German (cf. 3), or in any other continen-tal Germanic languages: 14<br />

(1a) Gheistar in Giani hat gahakat iz holtz ime balje (/in balt) 15 (Giazza)<br />

Yesterday the G. has cut the wood in the forest<br />

(1b) De muotar gheistar kam Abato hat kost iz mel 16 (Giazza)<br />

The mother yesterday in Abato has bought the flour<br />

14 Cf. Scardoni (2000), Poletto & Tomaselli (2000), Tomaselli (2004), Bidese & Tomaselli (2005). In the<br />

catechism of 1602, there are few examples of V3 constructions, but this is probably due to the fact that<br />

there is no relevant context for the topic. Cf. for this problem Bidese and Tomaselli (2005:76ff.)<br />

15 Scardoni (2000:152)<br />

16 Ivi:157<br />

80


(2) Haute die Mome hat gekoaft die öala al mercà 17 (Luserna)<br />

Today the mother has bought the eggs at-the market<br />

(3) *Gestern die Mutter hat Mehl gekauft<br />

yesterday the mother has flour bought<br />

ii) A correlate of the V2 phenomenology forces the reordering of subject and<br />

inflected verb: in the Germanic languages, 18 the subject can be found in main clauses<br />

to the right of the inflected verb (but still to the left of a past participle, if the<br />

sentence contains one) when another constituent is located in first position, yielding<br />

the ordering XP Vinfl Subject … (Vpast part.). In Cimbrian, the phenomenon of subject<br />

- (finite) verb inversion is limited to subject clitics starting from the first written<br />

documents (i.e., the Cimbrian catechisms of 1602, here shortened in Cat.1602) (cf.<br />

4), and survived the loss of the V2 word order restriction for quite a long time (cf. 5<br />

and 6). Nowadays, in Giazza, it is only optionally present, and only for some speakers<br />

(cf. 7 and 8), while it survives in Luserna (cf. 9 and 10): 19<br />

... 21<br />

...<br />

(4) [Mitt der Bizzonghe] saibar ghemostert zò bizzan den billen Gottez. 20<br />

Through knowledge are-we taught to know the will of God.<br />

(5) [Benne di andarn drai Lentar habent gahört asó], haben-se-sich manegiart<br />

When the other three villages had heard this, had-they taken pains to<br />

(6) [Am boute] [gan ljêtsen] hense getrust gien … 22<br />

Once in Ljetzan have-they got to go …<br />

(7) In sontaghe regatz / In sontaghe iz regat 23 (Giazza)<br />

On Sunday rains-it / On Sunday it rains<br />

(8) Haute er borkofart de oiar / Haute borkofartar de oiar 24 (Giazza)<br />

Today he sells the eggs/today sells-he the eggs<br />

17 Grewendord & Poletto (2005:117)<br />

18 English has this possibility too, but it is restricted to main interrogatives, while in the other Germanic<br />

languages it is found also in declaratives.<br />

19 Bosco (1996) and (1999), Benincà & Renzi (2000), Scardoni (2000), Poletto & Tomaselli (2000), Tomaselli<br />

(2004), Bidese & Tomaselli (2005) and Grewendorf & Poletto (2005). That subject clitics continue<br />

to invert when nominal subjects cannot is a well-known generalisation confirmed in other language<br />

domains, such as Romance.<br />

20 Cat.1602:694–5 in Meid (1985a:87)<br />

21 Baragiola 1906:108<br />

22 Schweizer 1939:36<br />

23 Scardoni 2000:144<br />

24 Ivi:155<br />

81


(9) *Haüte geat dar Giani vort 25 (Luserna)<br />

Today goes the Gianni away<br />

(10) Haüte geatar vort (dar Gianni) 26 (Luserna)<br />

Today goes-he away (the John)<br />

This seems to indicate that the ‘core’ of the V2 phenomenon (i.e., the word order<br />

restriction) could be lost before one of its main correlates (i.e., pronomi-nal subject<br />

inversion).<br />

• Germanic languages can be OV (German and Dutch) or VO (Scandinavian<br />

and Yiddish). In Cimbrian, the discontinuity of the verbal complex is limited to the<br />

intervention of pronominal elements, negation (cf. 12), monosyllabic adverbs/<br />

verbal prefixes, 27 and bare quantifiers 28 (cf. 13). In fact, from a ty-pological point<br />

of view, Cimbrian belongs, without any doubt, to the group of VO languages:<br />

(11a) Haüte die Mome hat gebäscht di Piattn 29 (Luserna)<br />

Today the mother has washed the dishes<br />

(11b) *Haüte di Mome hat di Piattn gebäscht 30 (Luserna)<br />

(12) Sa hom khött ke dar Gianni hat net geböllt gian pit se 31 (Luserna)<br />

They have said that the G. has not wanted go with them<br />

(13a) I hon niamat gesek 32 (Luserna)<br />

I have nobody seen<br />

(13b) han-ich khoome gaseecht (Roana)<br />

have-I nobody seen<br />

• Residual word order asymmetries between main and subordinate clauses<br />

with respect to the position of the finite verb are determined by a) the syntax<br />

of some ‘light’ elements (cf. 14 and 15 for negation and pronominal); b) by the<br />

presence of clitics (cf. 14b and 15b versus 16 and 17); and, c) by the type of<br />

subordinate clause (cf. 14b and 15b versus 18 and 19):<br />

(14a) Biar zéteren nete33 We give in not<br />

25 Grewendorf & Poletto 2005:116<br />

26 Ibid.<br />

27 Cf. Bidese 2004a and Bidese & Tomaselli 2005<br />

28 Cf. Grewendorf & Poletto (2005)<br />

29 Ivi:117<br />

30 Ivi:121<br />

31 Ivi:122<br />

32 Ivi:123<br />

33 Baragiola 1906:108<br />

82


(14b) ’az se nette ghenan vüar 34<br />

that they don’t put forward<br />

(15a) Noch in de erste Lichte von deme Tage hevan-se-sich alle 35<br />

Even at the break of that day get-they all up<br />

(15b) ’az se sich legen in Kiete 36<br />

that they calm down<br />

(16) ’az de Consiliere ghen nette auf in de Sala 37<br />

that the advisers go not above into the room<br />

(17) ’az diese Loite richten-sich 38<br />

that these people arrange themselves<br />

(18) umbrume di andar Lentar saint net contente 39<br />

because the other villages are not glad<br />

(19) umbrume dear Afar has-sich gamachet groaz 40<br />

2.2 Clitic Syntax<br />

because the question has got great<br />

The Cimbrian dialect, contrary to other Germanic languages that only admit weak<br />

object pronouns, is characterized by a very structured set of pronominal clitics, like<br />

all northern Italian dialects. 41 One important piece of evidence that subject and<br />

object pronouns are indeed clitics is the phenomenon of clitic doubling, namely,<br />

the possibility to double a full pronoun or an NP with a clitic, already noted in the<br />

grammars:<br />

(20) az sai-der getant diar 42<br />

that it will be to you made to you<br />

(21) Hoite [de muuutar] hat-se gakhoofet de ojar in merkaten (Roana)<br />

Today the mother has-she bought the eggs at-the market<br />

From a diachronic point of view, this phenomenon already appears for subject<br />

clitics in Cat.1813, but is limited to interrogative sentences, while in Baragiola (1906)<br />

it also appears in declarative sentences. The phenomenon is, nowadays, according to<br />

34 Ivi:111<br />

35 Ivi:109-110<br />

36 Ivi:114<br />

37 Ivi:110<br />

38 Ivi:108<br />

39 Ivi:105<br />

40 Ivi:113<br />

41 For an exhaustive description of the positions of clitics and pronouns in Cimbrian cf. Castagna (2005).<br />

42 Schweizer (1952:27)<br />

83


the research of Scardoni (2000), no longer productive in Giazza, optional/possible in<br />

Luserna, 43 but still frequent in Roana. 44<br />

In main clauses, subject clitics are usually found in enclisis to the finite verb (in<br />

Giazza, only as a vestige, cf. the above sentences [7] and [8]): 45<br />

(22) Bia hoas-to (de) (du)? (Luserna)<br />

How call-you?<br />

(23) Hasto gi khoaft in ğornal? 46 (Luserna)<br />

Have-you bought the newspaper?<br />

(24) Ghestar han-ich ghet an libar ame Pieren (Roana) 47<br />

Yesterday have-I given a book to P.<br />

In embedded clauses, subject clitics occur either in enclitic position to the finite<br />

verb or in enclitic position to the conjunction, depending on two main factors: i) the<br />

Cimbrian variety under consideration (and the ‘degree’ of V2 preservation); and, ii) the<br />

different types of subordinate clauses. According to what our data suggest, nowadays,<br />

enclisis to the finite verb seems to be the rule in Roana (25-8), but Schweizer’s grammar<br />

(Schweizer 1952) gives evidence for a different distribution of the subject clitics in<br />

subordinate clauses. He observes that subject clitics in the variety of Roana usually<br />

occur (or occurred) at the Wackernagel’s position (WP) in enclisis to the subordinating<br />

conjunction (cf. 29-31; cf. the above sentences [14b] and [15b] as well): 48<br />

(25) Ist gant zoornig, ambrumme han-ich ghet an libarn ame Pieren (Roana)<br />

(He) has got angry, because have-I given a book P.<br />

(26) Gianni hatt-ar-mi gaboorset, benne khimmas-to hoam (Roana)<br />

Gianni has-he-me asked, when come-you home<br />

(27) Haban-sa-mich gaboorset, ba ghe-ban haint (Roana)<br />

Have-they-me asked, where go-we today evening<br />

(28) Haban-sa-mich khött, habat-ar gabunnet Maria nach im beeck (Roana)<br />

Have-they-(to)me said, have-you met M. on the road<br />

(29) bas-er köt 49 (Roana)<br />

43 Cf. Vicentini (1993:149-51) and Castagna (2005)<br />

44 Our data suggest that there may be a difference between auxiliaries and main verbs: with the auxiliary<br />

‘have’, doubling seems mandatory, while this is not the case with main verbs.<br />

45 Some ambiguous forms can also appear in first position; we assume here that when occurring in first<br />

position, the pronominal forms are not real clitics, but, at most, weak forms.<br />

46 Vicentini (1993:44)<br />

47 In the variety of Roana, when the subject is definite and preverbal, there is always an enclitic<br />

pronoun.<br />

48 Cf. Castagna (2005) as well<br />

49 Schweizer (1952:27)<br />

84


what-he says<br />

(30) ben-ig-en nox vinne 50 (Roana)<br />

if-I-him still meet<br />

(31) ad-ix gea au 51 (Roana)<br />

if-EXPl.-I (az-da-ich) go above<br />

All the same, Schweizer (1952) underlines that there are many irregularities in<br />

accordance to which subject clitics in embedded clauses can appear in enclisis to<br />

the finite verb, or in both positions (clitic doubling). Luserna Schweizer notes that<br />

all the pronouns have to be clitized to the complementizer. 52 But we found evidence<br />

for a construction (cf. 32) in which the subject clitic appears in enclisis to the finite<br />

verb, probably due to the presence of a constituent between the complementizer and<br />

the finite verb (a case of “residual” embedded V2). In this sentence, there is clitic<br />

doubling too:<br />

(32) Dar issese darzürnt obrom gestarn honne i get an libar in Peatar 53<br />

(Luserna)<br />

He has got angry because yesterday have-I I given a book P.<br />

In main clauses, object clitics are always in enclisis to the inflected verb:<br />

(33a) Der Tatta hat-se gekoaft 54 (Luserna)<br />

The father has-her bought<br />

(33b) Der Tatta *se hat gekoaft 55 (Luserna)<br />

(34) De muutari hat-sei-se gasecht (Roana)<br />

The mother has-she-her seen<br />

(35) Gianni hatt-an-se gaseecht (Roana)<br />

Gianni has-he-her seen<br />

The same is true for embedded declarative clauses:<br />

(36a) I woas ke der Tatta hatse (net) gekoaft 56 (Luserna)<br />

I know that the father has-her (not) bought<br />

(36b) I woas ke der Tatta *se hat gekoaft 57 (Luserna)<br />

I know that the father her has bought<br />

50 Ibid.<br />

51 Ibid.<br />

52 Ibid. This analysis is confirmed in the data of Vicentini (1993)<br />

53 Grewendorf & Poletto (2005:121)<br />

54 Ivi:122<br />

55 Ibid.<br />

56 Ivi:123<br />

57 Ibid.<br />

85


(37) Gianni hatt-ar-mi gaboorset, bear hat-ar-dich telephonaart (Roana)<br />

Gianni has-he-me asked, who has-he-you called<br />

(38) kloob-ich Gianni hatt-ar-me ghet nicht ad ander (Roana)<br />

believe-I (that) Gianni has-he-(to)me given nothing else<br />

(39) biss-i net, Gianni hat-an-en ghakhoofet (Roana)<br />

know-I not, (if) Gianni has-he-him bought<br />

While in Roana, enclisis to the finite verb is the rule in all embedded clauses (including<br />

embedded interrogatives), in Luserna, in relative and embedded interrogative clauses,<br />

subject and object clitics are usually found in a position located to the immediate right<br />

of the complementiser (or the wh-item). 58 This corresponds to Wackernagel’s position<br />

of the Germanic tradition, and is usually hosting weak pronouns in the Germanic<br />

languages, which are rigidly ordered (contrary to DPs, which can scramble):<br />

(40) ’s baibe bo-da-r-en hat geet an liber 59 (Luserna)<br />

the woman who-EXPL.-he-(to) her has given a book<br />

(41) dar Mann bo dar en (er) hat geet an libar (Luserna)<br />

the man who-EXPL.-he-him (he) has given a book<br />

(42) Dar Giani hatmar gevorst zega ber (da)de hat o-gerüaft (Luserna)<br />

The G. has-me asked compl. who you has phoned<br />

(43) I boas net ber-me hat o-gerüaft (Luserna)<br />

I know not who us has phoned<br />

(44) I vorsmaar zega bar me mage hom o-gerüaf (Luserna)<br />

I wonder COMPL. who me could have phoned<br />

Summarising the data illustrated so far, we can state that:<br />

• Both subject and object clitics are always in enclisis to the finite verb in<br />

main clauses in all varieties;<br />

• Currently in Roana, both subject and object clitics always occur in enclisis<br />

to the finite verb in all embedded clauses; and,<br />

• In Luserna, clitics occur in enclisis in embedded declaratives and in WP in<br />

relative and embedded interrogatives.<br />

From this we conclude that:<br />

• Luserna displays a split between embedded wh-constructions on the one<br />

hand and embedded declaratives on the other, while Roana (at least nowadays)<br />

does not; and,<br />

58 This means that no element can intervene between the element located in CP and the pronoun(s).<br />

59 Grewendorf & Poletto (2005:121)<br />

86


• No cases of proclisis to the inflected verb are ever found in any Cimbrian<br />

variety.<br />

In general, although Cimbrian, contrary to other Germanic languages, has<br />

developed a class of clitic pronouns, it does not seem to have ‘copied’ the syntactic<br />

behaviour of subject and object clitics of neighbouring Romance dialects, which<br />

realize consistently proclisis to the inflected verb for object clitics in all sentence<br />

types, and permit enclisis of subject clitics only in main interrogative clauses, and<br />

enclisis of object clitics only with infinitival verbal forms. 60 On the contrary, enclisis<br />

to the inflected verb seems to be the rule in Cimbrian. Proclisis to the inflected verb<br />

is not at all attested, and the only other position apart from enclisis is the Germanic<br />

WP position in some embedded clause types in the variety of Luserna.<br />

2.3 The Syntax of Subject NPs<br />

As regards the syntax of the subject NPs in Cimbrian, there is evidence of the<br />

following aspects:<br />

• Cimbrian is not a pro-drop language. As with standard German, English<br />

and French, it is characterised by: a) obligatory expression of the subject (cf.<br />

45);<br />

• the use of the expletive pronoun iz (cf. 46); and, c) (contrary to standard<br />

German) a VO typology and the consequent adjacency of the verbal complex (cf.<br />

47); and, d) a relatively free position of the finite verb: 61<br />

(45) i han gaarbat (/gaarbatat) ime balt / Haute hani gaarbatat ime balje 62<br />

(Giazza)<br />

Today I have worked in the forest / Today have-I worked in the forest<br />

(46) Haute iz regat / Haute regatz63 (Giazza)<br />

Today it rains / Today rains-it<br />

(47) Gheistar in Giani hat gahakat iz holtz ime balje (/in balt) 64 (Giazza)<br />

Yesterday G. has cut the wood in the forest<br />

• Languages requiring a mandatory expression of the subject, such as English<br />

or French, see the possibility of putting the subject NPs on the right of the verbal<br />

60 Note that there are Romance dialects that have enclisis to the inflected verb, such as the variety of<br />

Borgomanero, studied by Tortora (1997), but this is a Piedmontese dialect, which can not have been in<br />

touch with Cimbrian, so we can exclude that enclisis has been developed through language contact with<br />

Romance.<br />

61 Cf. Poletto & Tomaselli (2002) and Tomaselli (2004:543). Cf. Castagna (2005) as well.<br />

62 Scardoni (2000:155)<br />

63 Ivi:144<br />

64 Ivi:152<br />

87


complex only in very limited contexts. From this perspective, it is interesting to<br />

note that Cimbrian generally permits it (cf. 48 and 49), similarly to standard Italian<br />

(cf. 50), and in opposition to the neighbouring romance dialect, in which the post<br />

verbal subject co-occurs with a subject pronoun in a preverbal position (cf. 51 and<br />

52):<br />

(48) Gheistar hat gessat dain manestar iz diarlja 65 (Giazza)<br />

Yesterday has eaten your soup the girl<br />

(49) Hat gahakat iz holtz dain vatar 66 (Giazza)<br />

Has cut he wood your father<br />

(50) Lo hanno comprato al mercato i miei genitori<br />

It have bought at the market my parents<br />

(51) Algéri l’à magnà la to minestra la buteleta 67<br />

Yesterday she has eaten your soup the girl<br />

(52) L’à taià la legna to papà 68<br />

He has cut the wood your father<br />

3. Cimbrian Data and the Generative Grammar Framework<br />

The results of the syntactic description of some aspects of Cimbrian grammar are<br />

relevant for any theoretical framework. In particular, within the Generative Grammar<br />

theoretical approach, the data discussed so far is relevant from both a synchronic and<br />

a diachronic point of view.<br />

Cimbrian, having been in a situation of language contact for centuries, offers a<br />

privileged point of view for determining how phenomena are lost and acquired. A<br />

number of interesting observations can be made concerning language change induced<br />

by language contact.<br />

First, Cimbrian shows that the ‘correlates’ of a given phenomenon (in our case<br />

V2) are lost after the loss of the phenomenon itself. More specifically, Cimbrian has<br />

maintained the possibility of inverting subject pronouns, while losing the V2 linear<br />

restriction. On the other hand, we can also state that the correlates can be acquired<br />

before the phenomenon itself: although Cimbrian has not developed a fully-fledged<br />

pro drop system, it already admits subject free inversion of the Italian type (i.e., the<br />

subject inverts with the ‘whole’ verbal phrase).<br />

65 Ivi:165<br />

66 Ibid.<br />

67 Ibid.<br />

68 Ibid.<br />

88


Second, syntactic change does not proceed in parallel to the lexicon, where a word<br />

is simply borrowed and then adapted to the phonological system of the language. 69 The<br />

syntactic distribution of clitic elements in Cimbrian shows that they have maintained<br />

a Germanic syntax, allowing either enclisis to the verb or the complementizer (WP),<br />

but never proclisis to the inflected verb, as is the case for Romance. Therefore, even<br />

though Cimbrian might have developed (or rather ‘maintained’/’preserved’) a class<br />

of clitic elements due to language contact, it has not ‘copied’ the Romance syntax of<br />

clitics.<br />

Moreover, the study of Cimbrian also confirms two descriptive generalisations<br />

concerning the loss of the V2 phenomenology established on the basis of the evolution<br />

of Romance syntax: 70<br />

• Embedded wh-constructions constitute the sentence type that longer<br />

maintains asymmetry with main clauses. This is shown in Cimbrian by the possibility<br />

of having clitics in WP only in embedded interrogatives, and relatives in the variety<br />

of Luserna; and,<br />

• Inversion of NPs is lost before inversion of subject clitics, which persists<br />

for a longer period.<br />

More generally, Cimbrian also confirms the hypothesis first put forth by Lightfoot<br />

(1979), and mathematically developed by Clark & Roberts (1993), that the reanalysis<br />

made by bilingual speakers goes through ambiguous strings that have two possible<br />

structural analyses; the speaker tends to use the more economical one (in terms of<br />

movement) that is compatible with the set of data at his/her disposal.<br />

Also, from the synchronic point of view, Cimbrian is an interesting study case, at<br />

least as far as verb movement is concerned. In V2 languages, it is most probably an<br />

Agreement feature located in the C that attracts the finite verb (see Tomaselli 1990 for<br />

a detailed discussion of this hypothesis). Cimbrian seems to have lost this property, as<br />

neither the linear V2 restriction nor the NP subject inversion are possible at this time.<br />

On the other hand, it has not (yet) developed a ‘Romance’ syntax, because clitics are<br />

always enclitics in the main clause (both declarative and interrogative). It is a well-<br />

known fact (see, among others, Sportiche 1993 and Kayne 1991 & 1994) that in the<br />

higher portion of the IP layer, there is a (set of) position(s) for clitic elements, and<br />

that subject clitics are always located to the left of object clitics inside the template<br />

containing the various clitics.<br />

69 This hypothesis is already been made by Brugmann (1917).<br />

70 See Benincà (2005) for the first generalization, Benincà (1984), Poletto (1998) and Roberts (1993), for<br />

the second.<br />

89


The position of the inflected verb in Cimbrian is neither the one found in V2 language<br />

(within the CP domain), nor the lower one found in modern Romance (within the IP<br />

domain). The syntax of clitics suggests that, in Cimbrian, the inflected verb moves to<br />

a position inside the clitic layer in the high IP (corresponding to the traditional WP),<br />

and precisely to the left of clitic elements both in main and embedded declarative<br />

clauses. 71 If this theoretical description proves to be tenable, we are now in the<br />

position to speculate about a possible explanation.<br />

4. A New Theoretical Correlation ‘Visible’ in Cimbrian<br />

A further interesting field to explore has to do with the theoretical reason why<br />

Cimbrian could not develop a Romance clitic syntax. In other words, there must have<br />

been some restriction constraining the speakers to maintain enclisis.<br />

A striking difference between the neighbouring Romance dialects and Cimbrian is<br />

the past participle agreement phenomenon. Past participle agreement is mandatory<br />

(at least for some object clitics) in Northern Italian dialects (cf. 53), while it is<br />

completely absent in Cimbrian. The morphological structure of the Cimbrian past<br />

participle has simply preserved the invariant German model, that is, ge- … -t, (cf.<br />

54):<br />

(53) (A) so k’el papá li ga visti<br />

I know that the father them-has seen<br />

(54) I woas ke der Tatta hatze (net) gekoaft (Luserna)<br />

I know that the father has-her (not) bought<br />

The existence of past participle agreement is usually analysed in the relevant<br />

literature as involving an agreement projection (AgrOP) to which both the object<br />

clitic and the verb move; the configuration of spec-head agreement between the two<br />

triggers the ‘passage’ of the number and gender features of the clitic onto the verb<br />

yielding agreement on the past participle (see Kayne 1991 and 1993).<br />

We believe that it is the presence of this lower agreement projection that is related to<br />

the possibility of having proclisis in Romance, and its absence that constrains Cimbrian<br />

to enclisis to the inflected verb. In Cimbrian, the clitic element moves directly to the<br />

higher clitic position (within the IP domain), while in Romance, this movement is<br />

always in two steps, the first being movement to the lower AgrO projection. In favour<br />

of this assumption is the fact that Cimbrian, like all other Germanic varieties, never<br />

showed past participle agreement of the Romance type.<br />

71 As we have already noted, the same is true for embedded interrogatives in Roana, while in Luserna,<br />

the verb is probably located lower in embedded interrogatives and relative clauses, leaving the clitic<br />

in WP alone.<br />

90


91<br />

Abbreviations<br />

Cat.1602 Cimbrian Catechism of 1602 (cf. Meid 1985a)<br />

Cat.1813 Cimbrian Catechism of 1813 (cf. Meid 1985b)<br />

DP Determiner Phrase<br />

NP Nominal Phrase<br />

Vinfl Inflected Verb<br />

Vpast part. Participle Past Verb<br />

Wh (interrogative element)<br />

XP X-phrase


92<br />

References<br />

Baragiola, A. (1906). “Il tumulto delle donne di Roana per il ponte (nel dialetto di<br />

Camporovere, Sette Comuni)”. Padova: Tip, Fratelli Salmin, reprinted in Lobbia, N. &<br />

Bonato, S. (eds.) (1998). Il Ponte di Roana. Dez Dink vo’ der Prucka. Roana: Istituto<br />

di Cultura Cimbra.<br />

Benincà, P. (1984). “Un’ipotesi sulla sintassi delle lingue romanze medievali.“ Quaderni<br />

Patavini di Linguistica 4, 3-19.<br />

Benincà, P. (2005). “A Detailed Map of the Left Periphery of Medieval Romance.”<br />

Zanuttini, R. et al. (eds.) (2005). Negation, Tense and Clausal Architecture: Cross-<br />

linguistics Investigations. Georgetown University Press.<br />

Benincà, P. & Renzi, L. (2000). “La venetizzazione della sintassi nel dialetto cimbro.<br />

“ Marcato, G. (ed.) (2000). Isole linguistiche? Per un’analisi dei sistemi in contatto.<br />

Atti del convegno di Sappada/Plodn (Belluno), 1–4 luglio 1999. Padova: Unipress,<br />

137–62.<br />

Bidese, E. (2004a). “Tracce di Nebensatzklammer nel cimbro settecomunigiano.”<br />

Marcato, G. (ed.) (2000). I dialetti e la montagna. Atti del convegno di Sappada/<br />

Plodn (Belluno), 2–6 luglio 2003, Padova: Unipress, 269–74.<br />

Bidese, E. (2004b). “Die Zimbern und ihre Sprache: Geographische, historische und<br />

sprachwissenschaftlich relevante Aspekte.” Stolz, T. (ed.) (2004). “Alte“ Sprachen.<br />

Beiträge zum Bremer Kolloquium über “Alte Sprachen und Sprachstufen” (Bremen,<br />

Sommersemester 2003). Bochum: Universitätsverlag Dr. N. Brockmeyer, 3–42.<br />

Bidese, E. & Tomaselli, A. (2005). “Formen der ‚Herausstellung’ und Verlust der V2-<br />

Restriktion in der Geschichte der zimbrischen Sprache.” Bidese, E., Dow, J.R. & Stolz,<br />

T. (eds.) (2005). Das Zimbrische zwischen Germanisch und Romanisch. Bochum:<br />

Universitätsverlag Dr. N. Brockmeyer, 71-92.<br />

Bosco, I. (1996). ’Christlike unt korze Dottrina’: un’analisi sintattica della lingua


cimbra del XVI secolo. Final essay for the degree “Laureat in Modern Languages and<br />

Literature.” Unpublished Essay, University of Verona.<br />

Bosco, I. (1999). “Christlike unt korze Dottrina’: un’analisi sintattica della lingua<br />

cimbra del XVI secolo.” Thune, E.M. & Tomaselli, A. (eds.) (1999). Tesi di linguistica<br />

tedesca. Padova: Unipress, 29–39.<br />

Brugmann, K. (1917). “Der Ursprung des Scheinsubjekts ‘es’ in den germanischen und<br />

den romanischen Sprachen.” Berichte über die Verhandlungen der Königl. Sächsischen<br />

Gesellschaft der Wissenschaften zu Leipzig, Philologisch-historische Klasse 69/5.<br />

Leipzig: Teubner, 1–57.<br />

Bunz, C.M. (1998). “Der Thesaurus indogermanischer <strong>Text</strong>- und Sprachmateria-<br />

lien (TITUS) – ein Pionierprojekt der EDV in der Historisch-Vergleichenden Sprach-<br />

wissenschaft.” Sprachen und Datenverarbeitung 1(98), 11-30. http://titus.uni-<br />

frankfurt.de/texte/sdv198.pdf.<br />

Castagna, A. (2005), “Personalpronomen und Klitika im Zimbrischen.” Bidese, E., Dow,<br />

J.R. & Stolz, T. (eds) (2005). Das Zimbrische zwischen Germanisch und Romanisch.<br />

Bochum: Universitätsverlag Dr. N. Brockmeyer, 93-113.<br />

Clark, R. & Roberts, I. (1993), “A Computational Model of Language Learnability and<br />

Language Change.” Linguistic Inquiry 24, 299-345.<br />

Geiser, R. (1999). “Grundkurs in klassischem Zimbrisch.” http://titus.fkidg1.uni-<br />

frankfurt.de/didact/zimbr/cimbrian.htm.<br />

Gippert, J. (1995). “TITUS. Das Projekt eines indogermanistischen Thesaurus.” LDV-<br />

Forum (Forum der Gesellschaft für Linguistische Datenverarbeitung) 12 (2), 35-47.<br />

http://titus.uni-frankfurt.de/texte/titusldv.htm.<br />

Gippert, J. (2001). Der TITUS-Server: Grundlagen eines multilingualen <strong>Online</strong>-<br />

Retrieval-Systems (aus dem Protokoll des 83. Kolloquiums über die Anwendung der<br />

Elektronischen Datenverarbeitung in den Geisteswissenschaften an der Universität<br />

Tübingen 17. November 2001). http://www.zdv.uni-tuebingen.de/ tustep/prot/<br />

93


prot831-titus.html.<br />

Gippert, J. (2002). The TITUS <strong>Text</strong> Retrieval Engine. http://titus.uni-frankfurt. de/<br />

texte/textex.htm.<br />

Grewendorf, G. & Poletto, C. (2005). “Von OV zu VO: ein Vergleich zwischen Zimbrisch<br />

und Plodarisch.” Bidese, E, Dow, J.R. & Stolz, T. (eds) (2005). Das Zimbrische zwischen<br />

Germanisch und Romanisch. Bochum: Universitätsverlag Dr. N. Brockmeyer, 114-128.<br />

Kayne, R.S. (1991). “Romance Clitics, Verb Movement, and PRO.” Linguistic Inquiry<br />

22, 647-686.<br />

Kayne, R.S. (1993). “Towards a Modular Theory of Auxiliary Selection.” Studia<br />

Linguistica 47, 3-31.<br />

Kayne, R.S. (1994). The Antisymmetry of Syntax. Cambridge, Mass.: MIT Press.<br />

Lightfoot, D. (1979). Principles of Diachronic Syntax. Cambridge, England: Cambridge<br />

University Press.<br />

Meid, W. (1985a). Der erste zimbrische Katechismus CHRISTLIKE UNT KORZE<br />

DOTTRINA. Die zimbrische Version aus dem Jahre 1602 der DOTTRINA CHRISTIANA<br />

BREVE des Kardinals Bellarmin in kritischer Ausgabe. Einleitung, italienischer und<br />

zimbrischer <strong>Text</strong>, Übersetzung, Kommentar, Reproduktionen. Innsbruck: Institut für<br />

Sprachwissenschaft der Universität Innsbruck.<br />

Meid, W. (1985b). Der zweite zimbrische Katechismus DAR KLÓANE CATECHISMO VOR<br />

DEZ BÉLOSELAND. Die zimbrische Version aus dem Jahre 1813 und 1842 des PICCOLO<br />

CATECHISMO AD USO DEL REGNO D’ITALIA von 1807 in kritischer Ausgabe. Einleitung,<br />

italienischer und zimbrischer <strong>Text</strong>, Übersetzung, Kommentar, Reproduktionen.<br />

Innsbruck: Institut für Sprachwissenschaft der Universität Innsbruck. http://titus.uni-<br />

frankfurt.de/texte/etcs/germ/zimbr/ kat1813d/kat18.htm.<br />

Poletto, C. (1998). “L’inversione interrogativa come ‘verbo secondo residuo’: l’analisi<br />

94


sincronica proiettata nella diacronia.” Atti del XXX convegno SLI, Roma: Bulzoni, 311-<br />

327.<br />

Poletto, C. & Tomaselli, A. (2000). “L’interazione tra germanico e romanzo in due<br />

‘isole linguistiche’. Cimbro e ladino centrale a confronto.” Marcato, G. (ed.) (2000).<br />

Isole linguistiche? Per un’analisi dei sistemi in contatto. Atti del convegno di Sappada/<br />

Plodn (Belluno), 1–4 luglio 1999. Padova: Unipress, 163–76.<br />

Poletto, C. & Tomaselli, A. (2002). “La sintassi del soggetto nullo nelle isole tedescofone<br />

del Veneto: cimbro e sappadino a confronto.” Marcato, G. (ed.) (2002). La dialettologia<br />

oltre il 2001. Atti del convegno di Sappada/Plodn (Belluno), 1–5 Luglio 2001. Padova:<br />

Unipress, 237–52.<br />

Roberts, I. (1993). Verbs and Diachronic Syntax: A Comparative History of English and<br />

French. Dordrecht: Kluwer.<br />

Scardoni, S. (2000). La sintassi del soggetto nel cimbro parlato a Giazza. Final essay<br />

for the degree “Laureat in Modern Languages and Literature.” Unpublished Essay,<br />

University of Verona.<br />

Schweizer, B. (1939). Zimbrische Sprachreste. Teil 1: <strong>Text</strong>e aus Giazza (Dreizehn<br />

Gemeinden ob Verona). Nach dem Volksmunde aufgenommen und mit deutscher<br />

Übersetzung herausgegeben. Halle/Saale: Max Niemeyer.<br />

Schweizer, B. (1952). Zimbrische Gesamtgrammtik. Band V.: Syntax der zimbrischen<br />

Dialekte in Oberitalien. Diessen am Ammersee. Unpublished typescript. Marburg/<br />

Lahn, Germany: Institut für die Forschung der Deutschen Sprache.<br />

Sportiche, D. (1993). “Clitic Constructions.” Rooryck, J. & Zaring, L. (eds) (1993).<br />

Phrase Structure and the Lexicon. Dordrecht: Kluwer, 213-276.<br />

Tomaselli, A. (1990). La sintassi del verbo finito nelle lingue germaniche. Padova:<br />

Unipress.<br />

95


Tomaselli, A. (2004). “Il cimbro come laboratorio d’analisi per la variazione linguistica<br />

in diacronia e sincronia.” Quaderni di lingue e letterature 28, Supplemento: Variis<br />

Linguis: Studi offerti a Elio Mosele in occasione del suo settantesimo compleanno,<br />

533–549.<br />

Tortora, C.M. (1997). “I Pronomi Interrogativi in Borgomanerese.” Benincà, P.<br />

& Poletto, C. (eds) (1997). Quaderni di Lavoro dell ASIS (Atlante Sintattico Italia<br />

Settentrionale): Strutture Interrogative dell Italia Settentrionale. Padova: Consiglio<br />

Nazionale delle Ricerche, 83-88.<br />

Vicentini, R. (1993). Il dialetto cimbro di Luserna: analisi di alcuni fenomeni linguistici.<br />

Final essay for the degree “Laureat in Modern Languages and Literature.” Unpublished<br />

Essay, University of Trento.<br />

96


Creating Word Class Tagged Corpora<br />

for Northern Sotho by Linguistically<br />

Informed Bootstrapping<br />

Danie J. Prinsloo and Ulrich Heid<br />

To bootstrap tagging resources (tagger lexicon and training corpus) for Northern<br />

Sotho, a tagset and a number of modular and reusable corpus processing tools are<br />

being developed. This article describes the tagset and routines for identifying verbs<br />

and nouns, and for disambiguating closed class items. All of these are based on<br />

morphological and morphosyntactic specificities of Northern Sotho.<br />

1. Introduction<br />

In this paper, we report on ongoing work towards the parallel creation of<br />

computational linguistic resources for Northern Sotho, on the basis of linguistic<br />

knowledge about the language. Northern Sotho is one of the eleven official languages<br />

of South Africa, spoken by about 4.2 million speakers in the northeastern part of the<br />

country. It belongs to the Sotho family of the Bantu languages (S32), (Guthrie 1971).<br />

The three Sotho languages are closely related.<br />

The creation of Natural Language Processing (NLP) resources is part of an effort<br />

towards an infrastructure for corpus linguistics and computational lexicography and<br />

terminology for Northern Sotho, which is seen as an element of a broader action for<br />

the development of Human Language Technology (HLT) and NLP applications for the<br />

South African languages.<br />

Parallel resource creation has been attempted as part of our research and<br />

development agenda in order to speed up the resource building process, in the sense<br />

of rapid prototyping of a part-of-speech (=POS) tagset; a tagger lexicon and (manually<br />

corrected) reference corpus; and a statistical tagger. These constitute the first set of<br />

corpus linguistic tools to be developed (we report on the first three tools here). At the<br />

same time, we intend to verify to what extent ‘traditional’ corpus linguistic methods<br />

and tools (as used for European languages) can be applied to a Bantu language-- an<br />

attempt that, to our knowledge, has not been made before.<br />

Two text corpora are used as input to the study. The first is a 43,000 tokens corpus,<br />

a selection from the Northern Sotho novel Tša ka Mafuri (Matsepe 1974), and the<br />

second is the Pretoria Sepedi Corpus (PSC) of 6 million tokens, a collection of 327<br />

97


Northern Sotho books and magazines. These are raw, unannotated corpora, compiled<br />

by means of optical character recognition (OCR), commonly known as ‘scanning’, with<br />

tokenization done per sentence. The PSC is still in the process of being cleaned from<br />

scanning errors. For details regarding the PSC and subsequent applications thereof,<br />

see sources such as Prinsloo (1991), De Schryver & Prinsloo (2000, 2000a & 2000b), and<br />

Prinsloo & De Schryver (2001).<br />

In this paper, we will discuss our task at both a specific and general level. We report<br />

about the specific task of creating resources for Northern Sotho, and our examples<br />

and illustrative material will be taken from this language. More generally, we also<br />

analyse the exercise in terms of methods and strategies for the joint bootstrapping<br />

of different resources for an ‘unresourced’ language, trying to abstract away from<br />

language-specific details.<br />

This article is organised as follows: in section 2, we give a brief overview of some of<br />

the language-specific phenomena we exploit in resource building; section 3 deals with<br />

the component elements of a corpus linguistic infrastructure for Northern Sotho that<br />

are presently being constructed, with the steps and procedures used in the process and<br />

the characteristics of the resulting resources; section 4 is a methodological conclusion<br />

(order of steps in resource creation, role of linguistic knowledge, etc.) and an analysis<br />

of the processes in terms of generalisability and portability to other languages such as<br />

Sotho, Bantu, and possibly completely different languages.<br />

2. Northern Sotho Linguistics Informing Corpus Technology<br />

A prerequisite to successful interpretation of the criteria for and output of a POS-<br />

tagger for Northern Sotho is a brief outline of certain basic linguistic characteristics<br />

of the language, especially of nouns and verbs. See Lombard et al. (1985), Louwrens<br />

(1991), and Poulos & Louwrens (1994) for a detailed grammatical description of this<br />

language.<br />

2.1 Noun System: Classifiers and Concords<br />

Nouns in Bantu languages are grouped into different noun classes. Compare Table<br />

1 for Northern Sotho.<br />

Table 1: Noun Classes of Northern Sotho with Examples<br />

Class Prefix Example Translation<br />

1 mo- monna man<br />

2 ba- banna men<br />

1a Ø malome uncle<br />

98


2b bo+ bomalome uncles<br />

3 mo- monwana finger<br />

4 me- menwana fingers<br />

5 le- lesogana young man<br />

6 ma- masogana young men<br />

7 se- selepe axe<br />

8 di- dilepe axes<br />

9 N-/Ø nku sheep (sg.)<br />

10 di+ dinku sheep (pl.)<br />

11<br />

12<br />

13<br />

14 bo- bogobe porridge<br />

6 ma- magobe different kinds of<br />

99<br />

porridge<br />

15 go go bona to see<br />

16 fa- fase below<br />

17 go- godimo above<br />

18 mo- morago behind<br />

Nouns are subdivided into different classes, each with its own prefix, and the<br />

prefixes of the first ten classes mark singular versus plural forms. Classes 11-13 do<br />

not exist in Northern Sotho. The prefixes also generate a number of concords and<br />

pronouns that are used to complete phrases and sentences. Consider the following<br />

example from Class 1, given in Table 2.<br />

Table 2: Example of a Sentence Consisting of a Noun, Verb, Pronoun and Concords<br />

Monna yo o A di rata<br />

noun Cl. 1 demonstrative<br />

(pronoun) Cl. 1<br />

subject<br />

concord Cl. 1<br />

present tense<br />

marker<br />

object concord<br />

Class 8/10<br />

verb stem<br />

Man this (he) () them loves<br />

This man loves them.<br />

There are a few hundred closed class items such as the subject concords, object<br />

concords, demonstratives (pronouns) and particles. Prime criteria for detecting and<br />

tagging nouns will naturally be based on class prefixes and nominal concords and to a<br />

limited extent on nominal suffixes such as the locative –ng.


2.2 Verb System: Productivity in Morphology<br />

In the case of verbs, numerous derivations of a single verb stem exist, consisting of<br />

the root, plus one or more prefix(es) and/or suffix(es), as is clearly indicated in Table<br />

3, which reflects a subsection (five out of eighteen modules, cf. Prinsloo [1994]) of<br />

the suffixes and combinations of suffixes for the verb stem reka ‘buy.’ The complexity<br />

of this layout is evident.<br />

Verbal derivations such as those in the rightmost column of Table 3 can all simply<br />

be tagged as verbs, or, alternatively, first be morphologically analysed (cf. Taljard &<br />

Bosch 2005) and then tagged in terms of their specific verbal suffixes, cf. column 2<br />

versus column 3 in Table 4 with respect to the suffixal cluster 02 ANA in Table 3.<br />

MODULE NUMBER AND<br />

MARKER<br />

Table 3: Selection of Derivations of the Verb reka<br />

MODULE COMPOSITION ABBREVIATIONS STEMS AND<br />

01 root + standard<br />

modifications<br />

02 ANA root + reciprocal + standard<br />

100<br />

VR reka<br />

(Per = Perfect tense) VRPer rekile<br />

(Pas = Passive) VRPas rekwa<br />

modifications<br />

03 ANTŠHA root + reciprocal + causative<br />

+ standard modifications<br />

04 ANYA root + alternative causative<br />

+ standard modifications<br />

05 EGA root + neutro passive +<br />

standard modifications<br />

DERIVATIONS<br />

VR PerPas rekilwe<br />

VRRec rekana<br />

VRRecPer rekane<br />

VRRecPas rekanwa<br />

VRRecPerPas rekanwe<br />

VRRecCau rekantšha<br />

VRRecCauPer rekantšhitše<br />

VRRecCauPas rekantšhwa<br />

VRRecCauPerPas rekantšhitšwe<br />

VRAlt-Cau rekanya<br />

VRAlt-CauPer rekantše<br />

VRAlt-CauPas rekanywa<br />

VRAlt-CauPerPas rekantšwe<br />

VRNeu-Pas rekega<br />

VRNeu-PasPer rekegile<br />

VRPas<br />

VRPerPas


Table 4: Alternatives in Tagging the Verb reka<br />

02 ANA rekana ‘V’ rek ‘Vroot’ an ‘Rec’ a<br />

2.3 Quantitative Aspects of the Lexicon<br />

rekane ‘V’ rek ‘Vroot’ an ‘Rec’ e ‘Per’<br />

rekanwa ‘V’ rek ‘Vroot’ an ‘Rec’ w ‘Pas’ a<br />

rekanwe ‘V’ rek ‘Vroot’ an ‘Rec’ w ‘Pas’ e ‘Per’<br />

There are a few marked tendencies in the quantitative distribution of lexical items<br />

in Northern Sotho, especially with respect to the relationship between frequency of<br />

use and ambiguity.<br />

In our 43,000 word corpus sample, we counted types and tokens, distinguishing<br />

nouns, verbs and closed class items. In Northern Sotho, only nouns and verbs allow<br />

for productive word formation (i.e., are open word classes), whereas function words,<br />

adverbs and adjectives are listed (i.e., belong to closed classes). Note that we did<br />

not consider numerals at all; the figures given are to be taken as tendencies. We<br />

separately counted forms that can be unambiguously identified as nouns, verbs or<br />

elements of one of the closed classes, as opposed to ambiguous forms where more<br />

than one word class can be assigned, depending on the context.<br />

All three have many more unambiguous types than ambiguous ones. As is likely in<br />

most languages, however, high frequency items are also highly ambiguous (cf. Table 5<br />

below). Nevertheless, if only slightly more than half of the potential verb occurrences<br />

in the sample are unambiguous (ca. 5000 tokens), the percentage of unambiguous<br />

occurrences of noun candidates is as high as 90% (5800 out of 6300 tokens). Ambiguity<br />

with nouns is restricted to rather infrequent items. For closed class items, however,<br />

the inverse situation is observed: only little more than 20% of the occurrences of<br />

closed class items in our sample are unambiguous, and a small set of closed class item<br />

types (88 types), of an average frequency of two hundred or more, constitutes about<br />

40% of the total amount of word forms in the sample. We expect that this distribution<br />

will be more or less generalisable to larger data sets of Northern Sotho. It will have<br />

an incidence on our approach to the bootstrapping of linguistic resources for this<br />

language. Table 5 lists the most frequent (and at the same time most ambiguous)<br />

items from the 43,000 word corpus sample with their tags (according to the tagset<br />

described in section 3.2) and their absolute frequency in the sample.<br />

101


Table 5: Most Frequent and Most Ambiguous Items in the Sample<br />

Item Possible Tags Freq.<br />

a CDEM6:CO6:CS1:CS6:CPOSS1:CPOSS6:QUE:PRES 2261<br />

go CO2psg:CO15:CO17:CS15:CS17:CSindef:PALOC 2075<br />

ka CS1psg:PAINS:PATEMP:PALOC:POSSPRO1psg 1807<br />

le CDEM5:CO2ppl:CO5:CS2ppl:CS5:PACON:VCOP 1615<br />

ba AUX:CDEM2:CO2:CS2:CPOSS2:VCOP 1429<br />

o CO3:CS1:CS2psg:CS3 1192<br />

ke AUX:CS1psg:PAAGEN:PACOP 1107<br />

3. Elements of a Scenario for Resource Building for Northern Sotho<br />

3.1 Starting Point and Objectives<br />

For computational lexicography, a sufficiently large corpus is needed, annotated<br />

at least at the level of part-of-speech. For the development of automatic tools for<br />

syntactic analysis, a more detailed annotation is required. In this paper we concentrate<br />

on a step prior to both of these resources, that is, on the creation of smaller, but<br />

generic resources to enable part of speech tagging.<br />

Tagset design for Northern Sotho is based on distinctions in traditional Northern<br />

Sotho grammar, it is carried out with a view to the kinds of information that would<br />

be extracted from a corpus once it has been tagged. As statistical tagging can only<br />

be attempted when a sufficiently large training corpus is available, an adaptation<br />

of the tagset is likely to be needed when the automatic tagging is tested, since<br />

some distinctions from the grammar may not be identifiable in texts without deeper<br />

knowledge.<br />

In working towards an annotated training corpus, different procedures are possible<br />

in principle: one could manually annotate a significant amount of data, or one could<br />

opt for a mixed approach, where certain parts of the corpus would receive manual<br />

annotation, and others would be annotated in a semi-automatic fashion, where<br />

the results of an automatic pre-classification are manually corrected. Due to the<br />

morphological and distributional properties of Northern Sotho discussed in section 2,<br />

the following breakdown was chosen:<br />

• Closed class items, as well as other words of very high frequency, were<br />

introduced manually to the tagger lexicon, with a disjunctive tag annotation that<br />

indicates for each item all its possible tags (Table 5);<br />

102


• Nouns and verbs can be guessed in the text on the basis of their<br />

morphological properties; thus, separate rule-based guessers were developed, and<br />

their results were manually corrected in the training corpus; and,<br />

• The disambiguation of closed-class items in context is, to a considerable<br />

extent, possible on the basis of rules similar to ‘local grammars.’ A certain amount<br />

of ambiguities in the training corpus have to be dealt with manually.<br />

In the remainder of this section, we report on tagset design (section 3.2); on an<br />

architecture for the creation of a tagger lexicon and a training corpus (section 3.3);<br />

and, on verb and noun guessing and the disambiguation of closed class items (sections<br />

3.4 to 3.6).<br />

3.2 Tagset Design<br />

The tagset designed for Northern Sotho is organised as a logical tagset (similar to<br />

a type hierarchy); this opens up the possibility to formulate underspecified queries to<br />

the corpus.<br />

The tagset mirrors some of the linguistic specificities of Northern Sotho, but is also<br />

conditioned by considerations of automatic processability with a statistical tagger.<br />

The tagset reflects properties of the nominal system of classes and concords: as they<br />

are (mostly) lexically distinct, we introduced class-based subtypes for nouns, pronouns<br />

and concords, as well as for adjectives: N, ADJ, C (for concord) and PRO (for pronoun)<br />

have such subtypes. As concords and pronouns have functionally and/or semantically<br />

defined subtypes, we apply the class-based subdivision in fact to the types listed in<br />

Table 6:<br />

Table 6: Nominal Categories that have Class-related Subtypes<br />

N Nouns CPOSS possessive concords<br />

ADJ adjectives EMPRO emphatic pronouns<br />

CS subject concords POSSPRO possessive pronouns<br />

CO object concords QUANTPRO quantifying pronouns<br />

CDEM demonstrative concords<br />

Given the complexity of the system of verbal derivation (cf. Table 3 above), an<br />

attempt to subclassify verbal forms accordingly would have led to an amount of<br />

tags (i.e., of distinctions) that would not be manageable with a statistical tagger.<br />

Furthermore, as- according to Northern Sotho orthography conventions- concords,<br />

adjectives and pronouns are written separately from the nouns and verbs to which<br />

they are grammatically related (disjunctive writing), these elements receive their<br />

103


own tags. Since verbal derivation is written conjunctively (like word formation in<br />

European languages), a single ‘verb’ tag (V) proved sufficient (cf. Table 4). As with<br />

parts of tense morphology and with word formation in European languages, an analysis<br />

of Northern Sotho verbal derivations is left to a separate tool (e.g. to a morphological<br />

analyser; see the discussion in Taljard & Bosch 2005).<br />

Other tags cover invariable lexical items:<br />

• adverbs (ADV) and numerals (NUM);<br />

• tense/mood/aspect markers for present tense (PRES), future (FUT), and<br />

progressive (PROG);<br />

• auxiliaries (AUX) and copulative verbs (VCOP);<br />

• ideophones (IDEO); and,<br />

• different (semantically defined) kinds of particles that mark a hortative<br />

(HORT), questions (QUE), as well as agentive (PAAGEN), connective (PACON),<br />

copulative (PACOP), instrumental (PAINS), locative (PALOC) and temporal (PATEMP)<br />

constructs.<br />

In principle, our approach to the design of tagsets for nouns and verbs is similar to<br />

the one of Van Rooy and Pretorius (2003) for Setswana, but it is much less complex.<br />

In the case of verbs we agree on the allocation of a single tag for verb stem plus<br />

suffix(es) as well as on separate tags for verbal prefixes:<br />

“[…] verbs are preceded by a number of prefixes, which are regarded as<br />

separate tokens for the purposes of tagging. The verb stem, containing the<br />

root and a number of suffixes (as well as the reflexive prefix) receives a single<br />

tag.“ (Van Rooy & Pretorius 2003:211)<br />

Likewise, for nouns, we are in agreement that at this stage in the development of<br />

tagsets, certain subclassifications such as the separate identification of deverbatives<br />

should be excluded (cf. Van Rooy & Pretorius 2003:210). Our approach differs from Van<br />

Rooy and Pretorius among others, in that a much smaller tagset is compiled for both<br />

verbs and nouns. In the case of verbs, we do not consider modal categories, and in the<br />

case of nouns, we honour subclasses but not divisions in terms of relational nouns and<br />

proper names. Consider the following examples illustrating basic differences in terms<br />

of the approaches as well as of the complexity of the tags:<br />

(1) Nouns<br />

a) Mosadi ‘woman’<br />

Tswana (Van Rooy & Pretorius 2003:217):<br />

104


Tag category: Common noun, singular; Label: NC1; Intermediate tagset:<br />

N101001<br />

Northern Sotho: Noun; Tag: N1<br />

(2) Verbs<br />

b) Bomalome ‘uncles’<br />

Tswana (Van Rooy & Pretorius 2003:217):<br />

Tag category: Relational noun, plural; Label: NR2; Intermediate tagset:<br />

N302001<br />

Northern Sotho: Noun; Tag: N2<br />

Tswana: kwala/kwalwa/kwadile; Northern Sotho: ngwala/ngwalwa/ngwadile<br />

‘write/be written/wrote’<br />

Tswana (Van Rooy & Pretorius 2003:219):<br />

Tag category: Lexical verb, indicative, present, active; Label: Vl0PA;<br />

Intermediate tagset: V0001111102000 kwala<br />

Tag category: Lexical verb, indicative, present, passive; Label: Vl0PP;<br />

Intermediate tagset: V0001112102000 kwalwa<br />

Tag category: Lexical verb, indicative, past, active; Label: Vl0DA; Intermediate<br />

tagset: V0001141102000 kwadile<br />

Northern Sotho: verb; Tag: V<br />

3.3 An Architecture for Parallel Resource Building<br />

Since we opted, as far as POS-tagging is concerned, for an attempt to apply Schmid’s<br />

(1994) statistical TreeTagger to Northern Sotho, both a tagger lexicon and a reference<br />

corpus for training were needed. Schmid’s TreeTagger was chosen, because it needs<br />

much less manually annotated training material than other statistical taggers. For<br />

European languages (German, French, English, Dutch, and Italian) training corpora of<br />

40,000 to 100,000 words have proven sufficient to obtain the 96-97% tagging rate that<br />

is standard in current applications. Tagging quality of the TreeTagger also depends<br />

upon the number of different tags and on the size of the tagger lexicon. It thus seems<br />

obvious to bootstrap lexicon and corpus in parallel.<br />

Given the grammatical and distributional properties of Northern Sotho, we opted<br />

for the overall approach as sketched above in section 3.1: a list of closed class items<br />

and their possible tags is created manually, whereas nouns and verbs are guessed on<br />

the basis of morphological rules, and closed class item disambiguation is performed<br />

semi-automatically, based on rules, and possibly also on frequency-based heuristics.<br />

105


Figure 1 shows the strands of corpus annotation, where the (upper) strand leading<br />

to the training corpus is meant to be carried out once, whereas the general strand<br />

(below) can be repeated for each newly acquired corpus.<br />

Figure 1: Strands of Corpus Annotation<br />

The workflow involves a number of modular tools (developed in the course of the<br />

preparation of the training corpus) that can be reused with any additional Northern<br />

Sotho corpus. These include a sentence tokenizer; the tagger lexicon and a tool to<br />

project its contents (i.e., potentially ambiguous annotations for individual word<br />

forms) against the corpus words; guessers for nouns and verbs; and, disambiguation<br />

rules for closed class item disambiguation in context.<br />

The procedure sketched here, and depicted in Figure 1, is in fact a combination of<br />

rule-based symbolic tagging and statistical tagging, whereby a number of ambiguities<br />

are solved by the rule-based component before the statistical tagger is used. This<br />

setup is similar to Klatt’s (2005) work on a corpus processing suite for German.<br />

106


3.4 Verb Guesser<br />

In Table 3 above, a few selected examples of derived verb forms of Northern Sotho<br />

are given. Except for very frequent forms of a few verbs, most verb forms are marked<br />

by unambiguous derivational and inflectional affixes. For example, a word form found<br />

in a corpus that ends in -antšwe will almost inevitably be a verb form (cf. rekantšwe<br />

in Table 3).<br />

Consequently, many verb forms can be identified by simple pattern matching.<br />

Based on the grammatical system of verb affixation sketched in Prinsloo (1994),<br />

we developed a verb form guesser. It compares each candidate form with a list of<br />

unambiguous verbal affixes to distinguish verb forms from forms of other categories.<br />

Given the productivity of verbal derivation in Northern Sotho (cf. section 2 above),<br />

this guesser will be needed on any new corpus of Northern Sotho to be annotated.<br />

If required, the grammatical information encoded in the verbal affixes can be made<br />

explicit in the annotation (cf. Table 4 above).<br />

3.5 Noun Guesser<br />

Suffixal derivation appears in nouns only to denote locatives, augmentatives/<br />

feminins and diminutives. Given the low frequency of these derivations, with the<br />

possible exception of the locative, a noun detection strategy based on pattern<br />

matching alone, in analogy to that of the verb guesser, will have low recall, even<br />

though its precision will be very high.<br />

But nouns are characterised by their class prefixes (cf. Table 1 above); prefixes of<br />

classes 1 to 10 indicate singular (classes 1,3,5,7 and 9) versus plural (classes 2,4,6,8<br />

and 10). The prefixes are not, however, unambiguous with respect to classes (mo-:<br />

class 1,3 and, less relevant, class 18; di-: class 8 and 10; etc.). Not all words starting<br />

with a syllable that can be a noun prefix are indeed nouns (cf. e.g. the verb form<br />

letetše ‘wait(ed) for’ where the first syllable le- is not the prefix of class 5).<br />

What is indeed a highly unambiguous indicator of a noun form is its syntagmatic<br />

environment, as well as the alternation pattern between singular and plural. Very<br />

often, nouns are accompanied by concords or adjectives, as illustrated by the example<br />

in Table 2, where the noun monna is followed by a demonstrative and a subject<br />

concord, both of which show agreement with the noun with respect to the class.<br />

Adjectives also show this agreement.<br />

We exploit this regularity in our noun form guesser as follows: to identify items<br />

of a given pair of singular/plural classes, we apply word sequence patterns to the<br />

corpus data, which rely on the presence of concords, pronouns, adjectives, and so<br />

107


forth in the neighbourhood of the noun candidates. We check for the existence of<br />

such patterns in parallel for singular forms and for their potential plural counterparts.<br />

The search is approximative, in so far as it checks the presence of agreement-bearing<br />

elements within a window of up to three words left or right of the noun candidate.<br />

The rules can, in principle, be triggered either by singular or by plural items (with the<br />

exception of class 9 versus class 10, where it is preferable to start from the plural).<br />

Table 7 contains an example of a noun guessing query (simplified, as many potential<br />

agreement-bearing indicator items are left out), formulated in the notation of the<br />

CQP corpus query language, which underlies the CWB Corpus WorkBench, (Christ et<br />

al. 1999), used in our experiments as a corpus representation and infrastructure. We<br />

indicate (parts of) the queries that extract nouns of classes 7 (and 8).<br />

Table 7: Sample Query for the Identification of Noun Candidates of Classes 7 + 8<br />

(<br />

);<br />

[word = ‘sego|selo|sebatakgomo|...|<br />

[]{0,2}<br />

setšhaba|seatla|sello’]<br />

[word = ‘sa|se|segolo|<br />

) |<br />

sekhwi|sengwe|seo|sona …|’]<br />

( [word = ‘sa|se|segolo|<br />

( ....)<br />

[]{0,2}<br />

...]<br />

[word =<br />

‘sego|selo|sebatakgomo|...’]<br />

108<br />

First part of query:<br />

candidate se- words<br />

as a disjunction;<br />

followed in distance 0 to 2<br />

by class-7-indicators noted as a<br />

disjunction<br />

or (second part of query):<br />

choice of indicators<br />

followed in distance 0 to 2<br />

by candidate words<br />

analogous procedure for<br />

noun candidates created<br />

by replacement of se-<br />

with di- (plus class 8 concords)<br />

When applied to the 43,000 words corpus sample, the query throws up,<br />

among others, the results displayed in Table 8.


Table 8: Sample Results of Noun Guessing for Classes 7 and 8<br />

Class 7 cands. Class 8 cands. N? Equivalent(s)<br />

selo dilo + thing, things<br />

setšhaba ditšhaba + nation, nations<br />

sello dillo + (out)cry, outcries<br />

sepetše *dipetše — walked<br />

sekelela dikelela — recommend, disappear<br />

The checking tool is robust towards inexistent forms (cf. *dipetše) and towards<br />

forms that are not nominal (due to the context constraint on agreement-bearing<br />

items, (cf. sekelela versus dikelela).<br />

A first qualitative evaluation of the noun guessing routines on all candidates from<br />

the 43,000 word corpus sample seems to suggest that the tool only fails on lexicalized<br />

irregular forms (e.g. mong - beng, ‘owner(s)’, instead of the hypothetical mong -<br />

*bang), and on nouns that, mostly due to semantic reasons, do not have both a singular<br />

and a plural form (such as Sepedi ‘Pedi language and culture’, or leboa ‘North’). As<br />

for the verb guesser, the noun guesser can be and has to be applied (for quantitative<br />

reasons) to any new corpus to be annotated.<br />

3.6 Rules for the Disambiguation of Closed Class Items<br />

Given the high degree of ambiguity in closed class items (see section 2.3), there is<br />

a major need for disambiguation strategies for these items. Even though a statistical<br />

tagger is designed for this type of disambiguation, a rule-based preprocessing, leading<br />

at least to a partial reduction of ambiguity, seems necessary.<br />

We use context-based disambiguation rules, in the spirit of Gross and Silberztein’s<br />

local grammars (Silberztein 1993) and of rule-based tagging. As with the noun guessing<br />

queries, disambiguation rules are implemented as queries in the format of the CQP<br />

language. Some extraction rules exclusively rely on lexical contexts (cf. the topmost<br />

part of Table 9), while others involve lexemes and word class tagged items (middle<br />

row), or a combination of lexical, categorical and morphological constraints (including,<br />

for example, the presence of certain affixes [cf. lower part of Table 9]). The examples<br />

in Table 9 all relate to the disambiguation of the form a, the most frequent and most<br />

ambiguous item in our sample (cf. Table 5).<br />

109


Table 9: Examples of Disambiguation Queries for the Form a<br />

‘o|O’ ‘be’ ‘a’ Sequence of o be a (‘he/she was’)<br />

[pos = ‘N.{1,2}’]<br />

‘a’<br />

[pos = ‘N.{1,2}’];<br />

‘a’<br />

[pos = ‘V’<br />

& word = ‘.*go’];<br />

Hypothesis: a: CS1<br />

Coverage: 109 instances<br />

Precision: 109 (100%)<br />

a between two nouns (‘of’: possessive)<br />

Hypothesis: a: CPOSS6<br />

Coverage: 42 instances<br />

Precision: 42 (100%)<br />

a preceding a verb form ending in -go (Relative Marker)<br />

Hypothesis: a: CS1 or CS6 or CO6<br />

Coverage: 75 instances<br />

63 (80,8%) CS1; 9 (15.4%) CS6; 3 (3.8%) C06.<br />

The examples show that some rules do not fully disambiguate, but leave a<br />

set of options. Since we use the rules as a preparatory step to statistical tagging<br />

(and to manual disambiguation in the preparation of the training corpus), partial<br />

disambiguation is still useful to reduce the effort needed at a subsequent stage (cf.<br />

the third example of Table 9, where the choice of eight tags for a is reduced to a four-<br />

way ambiguity).<br />

4. Methodological Considerations<br />

4.1 Sequencing of Processing Steps<br />

We use semi-automatic procedures to create tagging resources for Northern Sotho.<br />

As raw corpora are the only available input, a first step in the project is to define a<br />

tagset that underlies all subsequent work (cf. section 3.2).<br />

The creation of the tagger lexicon and the annotation of the training corpus<br />

mostly run in parallel. We classify word forms from the corpus, store their (possibly<br />

disjunctive) description in the lexicon, and annotate them at the same time in the<br />

upcoming training corpus. (We annotate each word form in the corpus with the<br />

respective entry from the tagger lexicon.) While the disjunctive annotations remain<br />

in the tagger lexicon, context-based rules are used to partly disambiguate the corpus<br />

occurrences (cf. section 3.4 and 3.5).<br />

To get the process started efficiently, we first manually annotated the thousand<br />

most frequent word forms in the corpus, aiming at a complete coverage of their<br />

110


potential word class features. This information can be provided easily on the basis of<br />

Northern Sotho grammar, as many of them are function words.<br />

Subsequently, we employed semi-automatic procedures (automatic pre-classification<br />

of data, followed by manual verification) that focus on high precision, allowing, at<br />

the same time, for efficient data production: we capitalised on unambiguous verb<br />

and noun forms, covering thereby more than one fourth of all corpus occurrences<br />

(tokens), and obtaining in parallel a stock of approximately 2800 additional entries of<br />

the tagger lexicon (word form types).<br />

Once nouns and verbs were annotated, disambiguation rules for closed class items<br />

were formulated (based on regularities of the Northern Sotho grammar) and applied;<br />

many of these contextual constraints involve verbs and nouns. The rules are ordered<br />

by specificity: as in many other NLP applications, the most specific cases are handled<br />

first; at the end of the cascade, more general rules are applied, which may also be<br />

less ‘safe’ and less effective, that is, have less precision and/or recall.<br />

In conclusion, the strategy may be characterised as ‘easy-first’ and ‘safety-first’: for<br />

example, as disambiguation rules cannot overwrite (previously verified) lexical data,<br />

the overall process is one of monotonic accumulation of information. A bootstrapping<br />

procedure proved most efficient, where the validated results of each of the above-<br />

mentioned steps are persistently represented in both corpus and lexicon, such that<br />

they are available as input for the subsequent steps.<br />

4.2 Reusability of the Created Resources<br />

As mentioned in section 3, our verb and noun guessers can be applied to other<br />

Northern Sotho corpora, as can the tool projecting lexical descriptions onto corpus<br />

word forms. Given the productivity of verbal derivation and the amount of nouns to<br />

be expected in larger corpora, we assume that both tools will prove useful in the<br />

preparation of an annotated version of the PSC. Moreover, even though statistical<br />

taggers are designed to both disambiguate in context and guess word class values<br />

for unknown words (i.e., those not contained in the system’s lexicon), reducing the<br />

amount of the latter may improve overall output quality.<br />

Obviously, the parallel growth of lexicon and corpus will continue when larger<br />

corpora will be treated. At a later stage, we envisage the parallel enhancement of<br />

both resources, not only in coverage, but also with respect to the degree of detail<br />

covered: some morphological details of nouns (locatives, feminins/augmentatives,<br />

and diminutives) and verbs (cf. Tables 3 and 4) can be identified, but are not yet<br />

accounted for in our resources. Thus, the tagger lexicon may become part of an NLP-<br />

111


oriented dictionary that would explicitly store such properties. As far as the corpus<br />

is concerned, a multilevel annotation would be more appropriate than the current<br />

monodimensional view: without changes to the current annotation, extra layers<br />

may be added for the above-mentioned features of nouns and verbs, but also for an<br />

appropriate treatment of fused forms (cf. dirang, ‘do what?’ from dira + eng) and of<br />

multiword items, for example, idiomatic expressions (cf. bona kgwedi ‘see the moon’<br />

i.e., ‘menstruate’). As Northern Sotho orthography is not yet fully standardised, a<br />

distinction between standard orthography and observed (possibly deviant) orthography<br />

may be introduced through additional layers.<br />

5. Conclusions and Future Work<br />

We reported on an ongoing research and development project for the creation<br />

of tagging resources for Northern Sotho. In this context, modular components of a<br />

two-layered architecture were created, which are needed in the first place for the<br />

preparation of a training corpus for statistical tagging, but which will prove equally<br />

useful, we hope, for the later development of larger corpora.<br />

We bootstrap the training corpus and the tagger lexicon in parallel, using semi-<br />

automatic procedures consisting of a rule-based automatic pre-classification and<br />

subsequent manual validation: the procedures concern the identification of verbal<br />

and nominal forms and the disambiguation of closed class items. These procedures are<br />

applied one after the other by order of their expected precision (‘easy-first’, ‘safety-<br />

first’), leading thereby to a partly disambiguated corpus. For the creation of the<br />

training corpus, the remaining ambiguities are removed manually, whereas this task is<br />

supposed to be left to the statistical tagger in the later creation of larger corpora.<br />

Linguistic knowledge about the language is extensively used in the definition of the<br />

automatic procedures: morphological and morpho-syntactic regularities in the local<br />

context provide the starting point for their formulation.<br />

Future work on the tools described in this paper will be devoted to the development<br />

of further disambiguation rules, to the finalisation of a fully disambiguated training<br />

corpus, and to tagger training and tests. This will allow us to (i) assess tagging quality<br />

as obtained by the use of the statistical tagger only in a setup with our rule-based<br />

pre-processing, (ii) to stabilise the proposed tagset on the basis of experience with<br />

statistical tagging, and (iii) to undertake tagging of the PSC, which could then serve<br />

for lexicographic exploration.<br />

A well-designed POS-tagger for Northern Sotho would provide a flying start to the<br />

development of similar taggers for the other Sotho languages, the Nguni languages,<br />

112


and Bantu languages in general. It is expected that only minor adjustments will be<br />

required to adapt a POS-tagger for Northern Sotho to the other two Sotho languages<br />

(Tswana and Southern Sotho) because these languages are closely related. Pending<br />

certain morphological parsing for the Nguni languages (i.e., Zulu, Xhosa, Swazi<br />

and Ndebele) the tagger will be equally usable, since these languages do not differ<br />

structurally from the Sotho languages. It could finally be extended to other Bantu<br />

languages, since Bantu languages in general have a common structure.<br />

6. Acknowledgements<br />

This work was carried out as a joint project between the Department of<br />

African languages of the University of Pretoria and the Institut für maschinelle<br />

Sprachverarbeitung of Universität Stuttgart. We would like to thank Elsabé Taljard<br />

(Pretoria) who contributed to the design of the tagset and who cross-checked a large<br />

proportion of the output of our tools. Furthermore, we would like to thank Gertrud<br />

Faaß (Stuttgart) for her invaluable help with the implementation of the noun and verb<br />

guessers and of the tagging support tools.<br />

113


114<br />

References<br />

Christ, O., Schulze, B.M. & König, E. (1999). Corpus Query Processor (CQP). User’s<br />

Manual. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart Stuttgart,<br />

Germany.<br />

http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/.<br />

De Schryver, G.M. & Prinsloo, D.J. (2000). “The Compilation of Electronic Corpora,<br />

with Special Reference to the African Languages.” Southern African Linguistics and<br />

Applied Language Studies 18(1-4), 89–106.<br />

De Schryver, G.-M. & Prinsloo, D.J. (2000a). “Electronic Corpora as a Basis for the<br />

Compilation of African-language Dictionaries, Part 1. The Macrostructure.” South<br />

African Journal of African Languages 20(4), 291–309.<br />

De Schryver, G.-M. & Prinsloo, D.J. (2000b). “Electronic Corpora as a Basis for the<br />

Compilation of African-language Dictionaries, Part 2: The Microstructure.” South<br />

African Journal of African Languages 20(4), 310–30.<br />

Guthrie, M. (1971). Comparative Bantu: an Introduction to the Comparative Linguistics<br />

and Prehistory of the Bantu Languages. Vol. 2: The Comparative Linguistics of the<br />

Bantu Languages, London: Gregg Press.<br />

Klatt, S. (2005). <strong>Text</strong>analyseverfahren für die Korpusannotation und<br />

Informationsextraktion. Aachen: Shaker.<br />

Lombard, D.P., Van Wyk, E.B. & Mokgokong, P.C. (1985). Introduction to the Grammar<br />

of Northern Sotho. Pretoria: J.L. van Schaik.<br />

Louwrens, L.J. (1991). Aspects of Northern Sotho Grammar. Pretoria: Via Afrika<br />

Limited.<br />

Matsepe, O.K. (1974). Tša ka mafuri. Pretoria: Van Schaik.


Poulos, G. & Louwrens, L.J. (1994). A Linguistic Analysis of Northern Sotho. Pretoria:<br />

Via Afrika Limited.<br />

Prinsloo, D.J. (1994). “Lemmatization of Verbs in Northern Sotho.” SA Journal of<br />

African Languages 14(2), 93-102.<br />

Prinsloo, D.J. (1991). “Towards Computer-assisted Word Frequency Studies in Northern<br />

Sotho.” SA Journal of African Languages 11(2).<br />

Prinsloo, D.J. & de Schryver, G.-M. (2001). “Monitoring the Stability of a Growing<br />

Organic Corpus, with Special Reference to Sepedi and Xitsonga.” Dictionaries: Journal<br />

of The Dictionary Society of North America 22, 85–129.<br />

Schmid, H. (1994). “Probabilistic Part-of-Speech Tagging Using Decision Trees.”<br />

Proceedings of the International Conference on New Methods in Language Processing.<br />

Manchester, UK, 44-49.<br />

Taljard, E. & Bosch, S.E. (this volume). “A Comparison of Approaches Towards Word<br />

Class Tagging: Disjunctively versus Conjunctively Written Bantu Languages”, 117-131.<br />

Silberztein, M. (1993). Dictionnaires électroniques et analyse automatique de textes:<br />

le système INTEX. Paris: Masson.<br />

Van Rooy, B. & Pretorius, R. (2003). “A Word-Class Tagset for Setswana.” Southern<br />

African Linguistics and Applied Language Studies 21(4), 203-222.<br />

115


A Comparison of Approaches to Word Class<br />

Tagging: Distinctively Versus Conjunctively<br />

Written Bantu Languages<br />

Elsabé Taljard and Sonja E. Bosch<br />

Northern Sotho and Zulu are two South African Bantu languages that make use of<br />

different writing systems, namely, a disjunctive and a conjunctive writing system,<br />

respectively. In this paper, it is argued that the different orthographic systems obscure<br />

the morphological similarities, and that these systems impact directly on word class<br />

tagging for the two languages. It is illustrated that not only different approaches are<br />

needed for word class tagging, but also that the sequencing of tasks is, to a large<br />

extent, determined by the difference in writing systems.<br />

1. Introduction<br />

The aim of this paper is to draw a comparison of approaches towards word class<br />

tagging in two orthographically distinct Bantu languages. The disjunctive versus<br />

conjunctive writing systems in the South African Bantu languages have direct<br />

implications for word class tagging. For the purposes of this discussion we selected<br />

Northern Sotho to represent the disjunctive writing system, and Zulu as an example<br />

of a conjunctively written language. These two languages, which belong to the South-<br />

Eastern zone of Bantu languages, are two of the eleven official languages of South<br />

Africa. Northern Sotho and Zulu are spoken by approximately 4.2 and 10.6 million<br />

mother-tongued speakers, respectively. Both these languages belong to a larger<br />

grouping of languages, that is, the Sotho and Nguni language groups, respectively.<br />

Languages belonging to the same language group are closely related, and to a large<br />

extent, mutually intelligible. Furthermore, since all three languages belonging to the<br />

Sotho group follow the disjunctive method of writing, the methodology utilised for<br />

part-of-speech tagging in Northern Sotho would to a large extent be applicable to the<br />

other two Sotho languages (Southern Sotho and Tswana) as well. The same holds true<br />

for Zulu with regard to the other Nguni languages (i.e., Xhosa, Swati and Ndebele),<br />

which are also conjunctively written languages. The South African Bantu languages<br />

are not yet fully standardised with regard to orthography, terminology and spelling<br />

rules, and, when compared to European languages, these languages cannot boast a<br />

wealth of linguistic resources. A limited number of grammar books and dictionaries<br />

117


are available for these languages, while computational resources are even scarcer. In<br />

terms of natural language processing, the Bantu languages, in general, undoubtedly<br />

belong to the lesser-studied languages of the world.<br />

In this paper, a concise overview is first given of the relevant Bantu morphology,<br />

and reference is made to the differing orthographical conventions. In the subsequent<br />

section, the available linguistic and computational resources for the two languages<br />

are compared, followed by a comparison between the approaches towards word class<br />

tagging for Northern Sotho and Zulu. In conclusion, future work regarding word class<br />

tagging for Bantu languages is discussed.<br />

2. Bantu Morphology and Orthography<br />

According to Poulos & Louwrens (1994:4), “there are numerous similarities that<br />

can be seen in the structure (i.e., morphology), as well as the syntax of words<br />

and word categories, in the various languages of this family.” These languages are<br />

basically agglutinating in nature, since prefixes and suffixes are used extensively in<br />

word formation.<br />

The focus in this concise discussion on aspects of Bantu morphology is on the<br />

two basic morphological systems: the noun class system, and the resulting system of<br />

concordial agreement.<br />

2.1 Noun Classes and Concordial Agreement System<br />

The noun class system classifies nouns into a number of noun classes, as signalled<br />

by prefixal morphemes also known as noun prefixes. For ease of analysis, these noun<br />

prefixes have been divided into classes with numbers by historical Bantu linguists,<br />

and represent an internationally accepted numbering system. In general, noun<br />

prefixes indicate number, with the uneven class numbers designating singular and the<br />

corresponding even class numbers designating plural. The following are examples of<br />

Meinhof’s (1932:48) numbering system of some of the noun class prefixes:<br />

118


Class # Northern Sotho Zulu<br />

Table 1: Noun Class System: An Illustrative Excerpt<br />

Preffix Example Prefix Example<br />

1 (sg) mo- motho “person” umu- umuntu „person”<br />

2 (pl) ba- batho “persons” aba- abantu “persons”<br />

1a(sg) Ø- makgolo “grandmother” u- udokotela “doctor”<br />

2b(pl) bo- bomakgolo “grandmothers” o- odokotela “doctors”<br />

3 (sg) mo- mohlare “tree” umu- umuthi “tree”<br />

4 (pl) me- mehlare “trees” imi- imithi “trees”<br />

7 (sg) se- setulo “chair” isi- isitsha “dish”<br />

8 (pl) di- ditulo “chairs” izi- izitsha “dishes”<br />

14 bo- botho “humanity” ubu- ubuntu “humanity”<br />

The correspondence between singular and plural classes is not, however, perfectly<br />

regular, since some nouns in so-called plural classes do not have a singular form; in<br />

Zulu, class 11 nouns take their plurals from class 10, while a class such as 14 is not<br />

associated with number.<br />

The significance of noun prefixes is not limited to the role they play in indicating<br />

the classes to which the different nouns belong. In fact, noun prefixes play a further<br />

important role in the morphological structure of the Bantu languages, in that they<br />

link the noun to other words in the sentence. This linking is manifested in a system<br />

of concordial agreement, which is the pivotal constituent of the whole sentence<br />

structure, and governs grammatical agreement in verbs, adjectives, possessives,<br />

pronouns, and so forth. The concordial morphemes are derived from the noun prefixes<br />

and usually bear a close resemblance to the noun prefixes, as illustrated by the bold<br />

printed morphemes in the following Northern Sotho example:<br />

Figure 1: Concordial Agreement – Northern Sotho<br />

In this sentence, three structural relationships can be identified. The class 2 noun<br />

bašemane ‘boys’ governs the subject concord ba- in the verb ba ka bala ‘they may<br />

read’ (1), as well as the class prefix ba- in the adjective bagolo ‘big’ (2), and the<br />

119


demonstrative pronoun ba, preceding the adjective. The corresponding Zulu example<br />

would be as follows, where (1) indicates subject-verb agreement and (2) is agreement<br />

between the noun and the adjective concord aba- in the qualificative abakhulu.<br />

The class 10 noun izincwadi ‘books’ determines concordial agreement of the object<br />

concord -zi- in the verb (3).<br />

Figure 2: Concordial Agreement – Zulu<br />

The predominantly agglutinating nature of the Bantu languages is clearly illustrated<br />

in the above sentences, where each word consists of more than one morpheme. This<br />

complex morphological structure will be discussed very briefly by referring to two of<br />

the most complex word types, namely nouns and verbs.<br />

2.2 Morphology of Nouns<br />

Nouns as well as verbs in the Bantu languages are constructed by means of the two<br />

generally recognised types of morphemes, namely roots and affixes, with the latter<br />

subdivided into prefixes and suffixes. The majority of roots are bound morphemes,<br />

since they do not constitute words by themselves, but require one or more affixes to<br />

complete the word. The root is generally regarded to be “the core element of a word,<br />

the part which carries the basic meaning of a word” (Poulos & Msimang, 1996:170).<br />

For instance, in the Northern Sotho example dipuku ‘books’, the root that conveys<br />

the semantic significance of the word is -puku ‘book’, the morpheme di- being the<br />

class prefix of class 10. In the Zulu word izincwadi, the prefixes are i- and -zin-, with<br />

-ncwadi carrying the basic meaning ‘book.’ By adding the suffixes –ng (Northern Sotho)<br />

and -ini (Zulu), and the prefix e- (in the case of Zulu) to the noun, a locative meaning<br />

is imparted:<br />

Northern Sotho: dipukung di-puku-ng ‘in the books’<br />

Zulu: ezincwadini e-(i)-zin-ncwadi-ini ‘in the books’<br />

120


2.3 Verbal Morphology<br />

In the case of the verb, the core element that expresses the basic meaning of the<br />

word is the verb root. The essential morphemes of a Bantu verb are a subject concord<br />

(except in the imperative and infinitive), a verb root, and an inflectional ending. Over<br />

and above the subject concord (s.c.), the form of which is determined by the class<br />

of the subject noun, a number of other morphemes may be prefixed to a verb root.<br />

These include morphemes such as object concords (o.c.), potential and progressive<br />

morphemes, as well as negative morphemes. Compare the following example in this<br />

regard:<br />

Table 2: Verbal Morphology - Northern Sotho & Zulu<br />

N.S ba ka di bala ba ka di bal- -a<br />

Z bangazifunda ba- -nga- -zi- -fund- -a<br />

“they can read them” s.c. cl 2 potential morpheme o.c. cl 10 Verb<br />

root inflectional ending<br />

It should be noted that whereas object concords also show concordial agreement<br />

with the class of the object noun, all other verbal affixes are class independent.<br />

Furthermore, verbal affixes have a fixed order in the construction of verb forms, with<br />

the object concord prefixed directly to the verb root.<br />

Derivational suffixes may be inserted between the verb root and the inflectional<br />

ending. In the following examples, it will be noted that the inflectional ending has<br />

changed to the negative –e/-i in accordance with the negative prefix ga-/a-, for<br />

example:<br />

Table 3: Verbal Derivation by Means of Suffixes<br />

N.S. ga ba rekiše ga ba rek- -iš- -e<br />

Z abathengisi a- -ba- -theng- -is- -i<br />

“they do not sell” negative morpheme s.c. cl 2 verb root suffix inflectional ending<br />

3. Conjunctive Versus Disjunctive Writing Systems<br />

Following this explanation of the morphological structure of the Bantu languages,<br />

a few observations will be made regarding the different writing systems that are<br />

followed in the Bantu languages, with specific reference to Northern Sotho and Zulu.<br />

121


These different writing systems impact directly on POS-tagging, as will be explained<br />

below. The following example illustrates the difference in these writing systems:<br />

O r t h o g r a p h i c a l<br />

representation<br />

Table 4: Conjunctivism Versus Disjunctivism<br />

Morphological<br />

analysis<br />

N.S ke a ba rata ke a ba rat- -a<br />

Z ngiyabathanda ngi- -ya- -ba- -thand- -a<br />

“I like them” s.c. 1p.sg PRES o.c. cl 2 verb root inflectional ending<br />

The English translation ‘I like them’ consists of three orthographic words, each of<br />

which is also a linguistic word, belonging to a different word category. In the case of<br />

the Zulu sentence, where the conjunctive system of writing is adhered to, we observe<br />

one orthographic word that corresponds to one linguistic word, which is classified<br />

by Zulu linguists as a verb. The orthographic word ngiyabathanda is therefore also a<br />

linguistic word, belonging to a particular word category. This correspondence between<br />

orthographic and linguistic words is a characteristic feature of Zulu that distinguishes<br />

it from Northern Sotho. In the disjunctively written Northern Sotho sentence, four<br />

orthographic words constitute one linguistic word that is again classified as a verb.<br />

In other words, in the latter case, four orthographic elements making up one word<br />

category are written as separate orthographic entities.<br />

The reason for the utilisation of different writing systems is based on both historical<br />

and phonological considerations. When Northern Sotho and Zulu were first put to<br />

writing, mainly by missionaries in the second half of the nineteenth century, they<br />

intuitively opted for disjunctivism when writing Northern Sotho, and conjunctivism<br />

when writing Zulu. Thus, an orthographic tradition was initiated that prevails even<br />

today. Although based on intuition, the decision to adopt either a conjunctive or<br />

a disjunctive writing system was probably guided by an underlying realisation that<br />

the phonological systems of the two languages necessitated different orthographical<br />

systems. As Wilkes (1985:149) points out, the presence of phonological processes<br />

such as vowel elision, vowel coalescence and consonantalization in Zulu makes a<br />

disjunctive writing system highly impractical: the disjunctive representation of the<br />

sentence Wayesezofika ekhaya ‘He would have arrived at home’ as W a ye s’ e zo fika<br />

ekhaya is almost impossible to read and/or to pronounce. In Northern Sotho, these<br />

phonological processes are much less prevalent, and, furthermore, most morphemes<br />

in this language are syllabic, and therefore pose no problems for disjunctive writing.<br />

122


What needs to be pointed out at this stage, however, is that there is indeed some<br />

overlap with regard to the orthographical systems used by the two languages, and<br />

that Northern Sotho and Zulu should rather be viewed as occupying different positions<br />

on a continuum ranging from complete conjunctivism to complete disjunctivism.<br />

The diagrams below illustrate the degree of overlap between the writing systems<br />

of the two languages (dashed lines indicate morphological units, solid lines indicate<br />

orthographical units). It can be observed that the disjunctive writing convention in<br />

Northern Sotho is mainly applicable to prefixes preceding the class prefix and prefixes<br />

preceding the verb root.<br />

Figure 3: Overlap Between Conjunctivism and Disjunctivism<br />

123


At this stage, it is important to note that the different writing systems utilised by<br />

the two languages actually obscure the underlying morphological similarities. These<br />

disjunctive versus conjunctive writing systems in the Bantu languages have direct<br />

implications for word class tagging, as will be demonstrated later in this paper. In<br />

the next section, the available computational resources for the two languages are<br />

compared.<br />

4. Computational Linguistic Resources<br />

Existing linguistic and computational resources should be exploited as far as<br />

possible in order to facilitate the task of word class tagging. Both languages have<br />

unannotated electronic corpora at their disposal – approximately 6.5 million tokens<br />

for Northern Sotho, and 5.2 million tokens for Zulu. These corpora were compiled<br />

in the Department of African Languages at the University of Pretoria and consist of<br />

a mixed genre of texts, including samples of most of the different literary genres,<br />

newspaper reports, academic texts, as well as Internet material. Since most of the<br />

texts incorporated in the corpora were not available electronically, OCR scanning was<br />

done, followed by manual cleaning of scanned material.<br />

The corpora have so far been utilised, among others, for the generation of frequency<br />

lists, which are of specific importance for the development and testing of word class<br />

tagging, especially in disjunctively written languages. In Northern Sotho, for instance,<br />

the top 10,000 types by frequency in the corpus represent approximately 90% of the<br />

tokens, whereas in Zulu the top 10,000 types represent only 62% of the tokens. This<br />

observation is directly related to the conjunctive versus disjunctive writing systems.<br />

Since frequency counts in an unannotated corpus are based on orthographical units,<br />

a large orthographic chunk such as ngiyabathanda found in Zulu would have a much<br />

lower frequency rate than the corresponding units ke, a, ba and rata in Northern Sotho.<br />

This implies that the correct tagging of the top 10,000 tokens in Northern Sotho, be it<br />

manual, automatic, or a combination of both, results in a 90% correctly tagged corpus.<br />

The low relation between types versus tokens in Zulu, however, results in a much<br />

smaller percentage, that is, only 62% of the corpus being tagged. It furthermore impacts<br />

directly on the methodology used for word class tagging in the two languages: the low<br />

type/token relationship in Zulu necessitates the use of an additional tool (such as a<br />

morphological analyser prototype as described in Pretorius & Bosch 2003) to achieve<br />

a higher percentage in the automatic tagging of the Zulu corpus. Let us look at the<br />

following examples, which have been analysed by the above-mentioned analyser:<br />

amanzi ‘water/that are wet’<br />

a[NPrePre6]ma[BPre6]nzi[NStem]<br />

124


a[RelConc6]manzi[RelStem]<br />

yimithi ‘they are trees’<br />

yi[CopPre]i[NPrePre4]mi[BPre4]thi[NStem]<br />

ngomsebenzi ‘with work’<br />

nga[AdvForm]u[NPrePre3]mu[BPre3]sebenzi[NStem]<br />

bangibona ‘they see me’<br />

ba[SC2]ngi[OC1ps]bon[VRoot]a[VerbTerm]<br />

abathunjwa ‘(they) who are taken captive/they are not taken captive’<br />

aba[RelConc2]thumb[VRoot]w[PassExt]a[VerbTerm4]<br />

a[NegPre]ba[SC2]thumb[VRoot]w[PassExt]a[VerbTerm4]<br />

Examples with more than one analysis exhibit morphological ambiguity that,<br />

in most cases, can only be resolved by contextual information. Nevertheless, a<br />

morphologically analysed corpus provides useful clues for determining word class<br />

tags, since the output of the morphological analysis is a rich source of significant<br />

information that facilitates the identification of word classes. For example, the above<br />

morphologically analysed words lead to the following information regarding further<br />

processing on word class level:<br />

Output of morpho-logical<br />

analysis<br />

[NPrePre] and/or [BPre] +<br />

[NStem] + …<br />

Table 5: Zulu Morphological Analysis and Word Classes<br />

Word class Examples<br />

NOUN<br />

amanzi<br />

[CopPre] + [NStem] + … COPULATIVE yimithi<br />

[SC] + [VRoot] + …<br />

OR<br />

[NegPre] + [SC] + [VRoot] + …<br />

VERB bangibona<br />

125<br />

abathunjwa<br />

[RelConc] + … QUALIFICATIVE abathunjwa; amanzi<br />

[AdvForm] + … ADVERB ngomsebenzi


Concerning the tags used in the above morphological analysis, it should be noted<br />

that “tags were devised that consist of intuitive mnemonic character strings that<br />

abbreviate the features they are associated with.” (Pretorius & Bosch 2003:208).<br />

The word class tagset for Zulu is based on the classification by Poulos and Msimang<br />

(1996:26). More will be said about this tagset further on in the paper. The features<br />

and tags concerned are as follows:<br />

Table 6: Zulu Tags – An Illustrative Excerpt<br />

Tag Feature<br />

[AdvForm] Adverbial formative<br />

[BPre6] Basic prefix class 6<br />

[CopPre] Copulative prefix<br />

[NegPre] Negative prefix<br />

[NPrePre6] Noun preprefix class 6<br />

[NStem] Noun stem<br />

[OC1ps] Object concord 1st pers singular<br />

[PassExt] Passive extension<br />

[RelStem] Relative stem<br />

[SC2] Subject concord class 2<br />

[VRoot] Verb root<br />

[VerbTerm] Verb terminative<br />

In this paper, it is argued that the difference in writing systems dictates the need<br />

for different architectures, specifically for a different sequencing of tasks for POS-<br />

tagging in Northern Sotho and Zulu. The approaches followed to implement word class<br />

taggers for Northern Sotho and Zulu will be presented in the following section.<br />

5. Comparison of Approaches Towards Word Class Tagging for Northern Sotho<br />

and Zulu<br />

With regard to Northern Sotho, the term POS-tagging is used in a slightly wider<br />

sense, following Voutilainen (Mitkov 2003:220) who states that POS-taggers usually<br />

produce more information than simply parts of speech. He indicates that the term<br />

‘POS-tagger’ is often regarded as being synonymous with ‘morphological tagger’,<br />

‘word class tagger’ or even ‘lexical tagger.’ POS-tagging for Northern Sotho results in<br />

a hybrid system, containing information on both morphological and syntactic aspects,<br />

although biased towards morphology. This approach is dictated, at least in part, by<br />

the disjunctive method of writing, in which bound morphemes such as verbal prefixes<br />

126


show up as orthographically distinct units. As a result, in Northern Sotho, orthographic<br />

words do not always correspond to linguistic words, which traditionally constitute word<br />

classes or parts of speech. Rather than to see this as a disadvantage, it was decided<br />

to make use of the morphological information already implicit in the orthography,<br />

thus doing morphological tagging in parallel to a more syntactically-oriented word<br />

class tagging. It is, therefore, not necessary to develop a tool for the separation<br />

of morphemes, since this is largely catered for by the disjunctive orthography of<br />

Northern Sotho. As a result, all verbal prefixes can, for example, be tagged by making<br />

use of standard tagging technology, even though they are actually bound morphemes<br />

belonging to a complex verb form. A further motivation for the tagging of these bound<br />

morphemes is the fact that they are grammatical words or function words belonging<br />

to closed classes that normally make up a large percentage of any Northern Sotho<br />

corpus. Tagging of these forms would therefore result in a large proportion of the<br />

corpus being tagged. The decision to tag all orthographically distinct surface forms,<br />

regardless of whether these are free or bound morphemes, resulted in a tagset that<br />

is somewhat larger than normal: even though only nine word classes are traditionally<br />

distinguished for Northern Sotho, the proposed tagset contains thirty-three tag types.<br />

This number is further increased by the distinction of class-based subtypes for some of<br />

these tag types: the category EMPRO (emphatic pronoun) for example, has seventeen<br />

subtypes in order to account for the pronouns of the first and second person, as well<br />

as those of the different noun classes. The total number of tags comes to 155. (For a<br />

full discussion of the tagset design, see Prinsloo & Heid in this volume.)<br />

However, the existence of complex morphological units whose parts are not<br />

realised as surface forms necessitates a multi-level annotation. A separate tool such<br />

as a morphological analyser would be needed for the analysis of inter alia verbal<br />

derivations of Northern Sotho. Typical examples that would need to be analysed by<br />

such a tool would be verbal suffixes. Such a multi-level approach could be represented<br />

as follows:<br />

127


Figure 4: Multi-level Approach Towards Word Class Tagging<br />

It should be noted that there are cases where the object concord appears within<br />

the verbal structure, notably the object concord of the first person singular. This<br />

particular object concord distinguishes itself from other object concords in that it is<br />

phonologically and orthographically fused to the verbal root. All other object concords<br />

are written separately from the verbal root and are thus easily identifiable, except for<br />

the object concord of class 1 before verb stems commencing with b-, for example, mo<br />

+ bona > mmona ‘see him/her.’ A procedure similar to the one illustrated above would<br />

be needed for these cases.<br />

In the case of Zulu, morphological aspects need not be included in the word class<br />

tagging, since these are already accounted for in the morphological analysis. This<br />

difference in approach to the tagsets can be mainly ascribed to the different writing<br />

systems. The word class tagset for Zulu used for purposes of illustration above is based<br />

on the classification by Poulos & Msimang (1996:26) according to which “words which<br />

have similar functions, structures and meanings (or significances) would tend to be<br />

classified together as members of the same word category […]” The tagset comprises<br />

the following: Noun, Pronoun, Demonstrative, Qualificative, Verb, Copulative, Adverb,<br />

Ideophone, Interjection, Conjunction, and Interrogative. It is well known that the<br />

128


degree of granularity of a tagset should be appropriate to the purposes of the tagged<br />

corpus (Allwood et al. 2003:230).<br />

The following diagram is a summary of the distinct approaches towards word class<br />

tagging as exemplified in the two Bantu languages, Northern Sotho and Zulu. The tasks<br />

that need to be performed are similar, but the approaches and sequencing of tasks<br />

differ significantly. It is noticeable that, in Northern Sotho, no dedicated tool is needed<br />

for the separation of morphemes, since this is already implicit in the disjunctive<br />

writing system. The tagger caters to a certain extent for morphophonological rules,<br />

but is especially significant for the second level, where morphosyntactic classification<br />

of morphemes takes place. Analysis of word formation rules would only need to be<br />

done on level II, for which a morphological analyser is needed.<br />

In the case of Zulu, the morphological analyser plays a significant role in levels I<br />

and II, where constituent roots and affixes are separated and identified by means of<br />

the modelling of two general linguistic components. The morphotactics component<br />

contains the word formation rules, which determine the construction of words from the<br />

inventory of morphemes (roots and affixes). This component includes the classification<br />

of morpheme sequences. The morphophonological alternations component describes<br />

the morphophonological changes between lexical and surface levels (cf. Pretorius &<br />

Bosch 2003:273-274). Finally, Northern Sotho and Zulu are on a par in level III, where<br />

the identification of word classes, associated with the assigning of tags, takes place.<br />

Figure 5: Task Sequencing in Northern Sotho and Zulu<br />

129


6. Conclusion and Future Work<br />

In this paper, a comparison of approaches towards word class tagging in two<br />

orthographically distinct Bantu languages, namely Northern Sotho and Zulu, was drawn.<br />

The disjunctive versus conjunctive writing systems in these two South African Bantu<br />

languages have direct implications for word class tagging. Northern Sotho on the one<br />

hand resorts to a hybrid system, which contains information on both morphological<br />

and syntactic aspects, although biased towards morphology. In the case of Zulu, on<br />

the other hand, morphological aspects need not be included in the word class tagging,<br />

since these are already accounted for in the morphological analysis. Word class tags<br />

for Zulu are associated with syntactic information. The work described in this paper is<br />

of crucial importance for pre-processing purposes, not only for automatic word class<br />

taggers of Northern Sotho and Zulu, but also for the other languages belonging to the<br />

Sotho and Nguni language groups.<br />

Regarding future work, two significant issues have been identified. First, cases of<br />

ambiguous annotation require the application of disambiguation rules based mainly<br />

on surrounding contexts. A typical example of ambiguity is that of class membership,<br />

due to the agreement system prevalent in these languages. For instance, in Northern<br />

Sotho as well as Zulu, the class prefix of class 1 nouns is morphologically similar<br />

to that of class 3 nouns, that is, mo- (N.S) and umu- (Z). This similarity makes it<br />

impossible to correctly assign class membership of words such as adjectives, which<br />

are in concordial agreement with nouns, without taking the context into account.<br />

Secondly, the standardisation of tagsets for use in automatic word class taggers of<br />

the Bantu languages needs serious attention. A word class tagset based on standards<br />

proposed by the Expert Advisory Group on Language Engineering Standards (EAGLES)<br />

was recently proposed for Tswana, a Bantu language belonging to the Sotho language<br />

group, by Van Rooy & Pretorius (2003). Similarly, Allwood et al. (2003) propose a<br />

tagset to be used on a corpus of spoken Xhosa, a member of the Nguni language group.<br />

In order to ensure standardisation, and therefore achieve reuseability of linguistic<br />

resources such as word class tagsets, this initial research on the standardisation of<br />

tagsets needs to be extended to all the Bantu languages.<br />

7. Acknowledgements<br />

We would like to thank Uli Heid for unselfishly sharing his knowledge and expertise<br />

with us. His comments on an earlier version of this paper added immeasurable value<br />

to our effort.<br />

130


131<br />

References<br />

Allwood, J., Grönqvist, L. & Hendrikse, A.P. (2003). “Developing a Tagset and Tagger<br />

for the African Languages of South Africa with Special Reference to Xhosa.” Southern<br />

African Linguistics and Applied Language Studies 21(4), 223-237.<br />

“Eagles.” <strong>Online</strong> at: http://www.ilc.cnr.it/EAGLES/home.html.<br />

Meinhof, C. (1932). Introduction to the Phonology of the Bantu Languages. (trad. van<br />

Warmelo, N). Berlin: Dietrich Reimer/Ernst Vohsen.<br />

Poulos, G. & Louwrens, L.J. (1994). A Linguistic Analysis of Northern Sotho. Pretoria:<br />

Via Afrika Limited.<br />

Poulos, G. & Msimang, T. (1996). A Linguistic Analysis of Zulu. Pretoria: Via Afrika<br />

Limited.<br />

Pretorius, L. & Bosch, S.E. (2003). “Computational Aids for Zulu Natural Language<br />

Processing.” Southern African Linguistics and Applied Language Studies 21(4), 267-<br />

82.<br />

Prinsloo, D.J. & Heid, U. (this volume). “Creating Word Class Tagged Corpora for<br />

Northern Sotho by Linguistically Informed Bootstrapping”, 97-115.<br />

Van Rooy, B. & Pretorius, R. (2003). “A Word-class Tagset for Setswana.” Southern<br />

African Linguistics and Applied Language Studies 21(4), 203-222.<br />

Voutilainen, A. (2003). “Part-of-Speech Tagging.” Mitkov, R. (ed.)(2003). The Oxford<br />

Handbook of Computational Linguistics. Oxford University Press: Oxford, 219-232.<br />

Wilkes, A. (1985). “Words and Word Division: A Study of Some Orthographical Problems<br />

in the Writing Systems of the Nguni and Sotho Languages.” South African Journal of<br />

African Languages 5(4), 148-153.


Grammar-based Language Technology<br />

for the Sámi Languages<br />

133<br />

Trond Trosterud<br />

Language technology projects are often either commercial (and hence closed for<br />

inspection), or small projects that run with no explicit infrastructure. The present<br />

article presents the Sámi language technology project in some detail and is our<br />

contribution to a concrete discussion on how to run medium-scale, decentralised,<br />

open-source language technology projects for minority languages.<br />

1. Introduction<br />

This article presents a practical framework for grammar-based language<br />

technologies for minority languages. Such matters are seldom the topic of discussion;<br />

one usually goes directly to the scientific results. In order to obtain these results,<br />

however, a good project infrastructure is needed. Moreover, for minority languages,<br />

the bottleneck is often represented by the lack of human expertise, that is people with<br />

a knowledge of the language, linguistics, and language technology. In such situations,<br />

we need to organise work in order to facilitate cooperation and avoid duplication of<br />

effort. Although the model presented here can hardly be considered the ultimate one,<br />

it is the result of accumulated experience gained from different types of projects,<br />

commercial, academic and grass-roots Open source, and we hereby present it as a<br />

possible source of inspiration.<br />

The Sámi languages make up one of the seven subbranches of the Uralic language<br />

family, Finnish and Hungarian being the most well-known members of two of the other<br />

sub-branches. From the point of view of typology, the Sámi languages have many<br />

properties in common with the other Uralic languages, but several non-segmental<br />

morphological processes have entered the languages as well. There are six Sámi<br />

literary languages: North, Lule, South, Kildin, Skolt and Inari Sámi. All of them are<br />

written with the Latin alphabet (including several additional letters), except Kildin<br />

Sámi, which uses the Cyrillic alphabet.<br />

Prior to our project, the main focus within Sámi computing was the localisation<br />

issue. Four of the six Sámi languages have letters that are not to be found in the Latin<br />

1 (or Latin 2) code table. At present, this issue is more or less resolved and North Sámi<br />

is the language with fewest speakers that at the same time is localised — out of the


ox — on all three major operating systems. No other language technology application<br />

existed prior to our work.<br />

2. Project Status Quo, Goals and Resources<br />

The work is organised in two projects, with slightly different goals. It started out<br />

as a university-based project, with the goal of building a morphological parser and<br />

disambiguator for North, Lule and South Sámi, in order to use it for scientific purposes<br />

(i.e., creating a tagged corpus with a Web-based graphical interface and using it for<br />

syntactic, morphological and lexical research, publishing reverse dictionaries, etc.).<br />

In 2003, the Norwegian Sámi parliament asked for advice on how to build a Sámi<br />

spellchecker. They considered the construction of this tool as vital for the use of<br />

North Sámi as an administrative language. As a result of this, there are now three<br />

people working on the University project and four a half people working on the Sámi<br />

parliament project. These projects will run with the present financing for another 2<br />

years.<br />

The status quo is that we have a parser with a recall of 80 - 93% on the grammatical<br />

analysis of words in running text (modulo genre) and we disambiguate the morphological<br />

output with a recall of 99% and a precision of 93%; the outcome is slightly worse<br />

for syntactic analysis. The parsers behind these results contain approximately 100<br />

morphophonological rules, 600 continuation lexica and 2000 disambiguation rules.<br />

The figures below show the output for the morphological parser for the sentence<br />

Mii háliidit muitalit dan birra “We would like to tell about it”.<br />

134


Figure 1: Morphological Analysis of a Sámi Sentence<br />

Figure 2 shows the same sentence in disambiguated mode. Here, all irrelevant<br />

morphological readings are removed, and in addition, syntactic information has been<br />

added on the basis of the information given by the morphological disambiguator.<br />

Figure 2: Disambiguated Version of the Same Sentence<br />

As for the speller project, there is an alpha version, made with the aspell utility.<br />

The parser has also been put to use in interactive pedagogical programs, and there<br />

are concrete plans for making a Sámi text-to-speech application.<br />

135


3. Choice of approach<br />

3.1 A Grammatical Versus Statistical Approach<br />

We use a grammar-based, rather than a statistical approach (proponents of the<br />

statistical approach often refer to this dichotomy as a choice between a ’symbolic’<br />

and a ’stochastic’ approach), which means that our parsers rely on a set of grammar-<br />

based, manually written rules, that can be inspected and edited by the user. There<br />

are several reasons for our choice:<br />

• We think some of the prerequisites for good results with the statistical<br />

approach are not present in the Sámi case;<br />

• We want our work to produce grammatical insight, not only functioning<br />

programs, and,<br />

• On the whole, we think the grammatical approach is better for our<br />

purposes.<br />

Addendum to (1): Good achievements with a statistical approach require both<br />

large corpora and a relatively simple morphological structure (low wordform/lemma<br />

ratio), as is the case for English. Sámi, as in the case with many other languages, has<br />

a rich morphological structure and a paucity of corpus resources, whereas the basic<br />

grammatical structure of the language is reasonably well understood.<br />

Addendum to (2): Our work is a joint academic and practical project. Work on<br />

minority languages will typically be carried out as cooperative projects between<br />

research institutions and in-group individuals or organisations devoted to the<br />

strengthening of the languages in question. Whereas private companies will look<br />

at the ratio of income to development cost and care less about the developmental<br />

philosophy, it is important for research institutions to work with systems that are<br />

not ‘black boxes’, but that are able to give insight into the language beyond merely<br />

producing a tagger or a synthetic voice.<br />

Addendum to (3): We are convinced that grammar-based approaches to both parsing<br />

and machine translation are superior to the statistical ones. Studies comparing the<br />

two approaches, such as Chanod & Tapanainen (1994), support this conclusion.<br />

This does not mean that we rule out statistical approaches. In many cases, the<br />

best results will be achieved by combining grammatical and statistical approaches.<br />

A particularly promising approach is the use of weighted automata, where frequency<br />

considerations are incorporated into the arcs of the transducers. We plan to apply<br />

standalone statistical methods after plan the grammatical analysis gives in. In other<br />

words, the cooperation should be ruled by the motto: ‘Don’t guess if you know.’<br />

136


3.1 Choosing Between a ‘Top-down’ and a ‘Bottom-up Approach’<br />

Within grammatical approaches to parsing there are two main approaches, which<br />

we may brand ‘top-down’ and ‘bottom-up’. The top-down approach tries to map a<br />

possible sentence structure upon the sentence, as a possible outcome of applying<br />

generative rules on an initial S node. If successful, the result is a syntactic tree<br />

displaying the hierarchical structure of the sentence in question.<br />

The bottom-up approach, on the other hand, takes the incoming wordforms and<br />

the set of their possible readings as input. Then they disambiguate multiple readings<br />

based upon context and build structures.<br />

We chose a bottom-up approach, because it proved robust, was able to analyse any<br />

input and gave good results.<br />

4. Linguistic Tools<br />

4.1 The Tools Behind our Morphological Analyser<br />

For our morphological analyser, we build finite-state transducers and use the<br />

finite-state tools provided by Xerox (documented in Beesley & Karttunen 2003). For<br />

morphophonological analysis, we have the choice of using the parallel, two-level<br />

morphology model (dating back to Koskenniemi 1983) with twolc or the sequential<br />

model (presented in Karttunen et al.1992) with xfst. Xerox’ advice is to use<br />

the latter; we use the former, but we see this mainly as a matter of taste. The<br />

morphophonological and lexical tools are composed into a single transducer during<br />

compilation, as described in the literature (cf. the figure below).<br />

Figure 2: A Schematic Overview of the Lexicon and Morphophonology of the Parser.<br />

137


A more serious question is the choice of Xerox tools versus Open Source tools. In our<br />

project, we have no wish to modify the source code of the rule compilers themselves,<br />

but we notice that all binary files compiled by the xfst, lexc and twolc compilers are<br />

copyrighted property of the Xerox Corporation. It is as if you were to write your own ‘C’<br />

program, but the compiled version of your program was copyright-owned by Kernighan<br />

and Ritchie, the authors of the C compiler. That said, it has been a pleasure working<br />

with Xerox: they have been very helpful, and as they see no commercial potential in<br />

Sámi, we notice no practical consequences of using proprietary compilers.<br />

4.2 The Tools Behind our Disambiguator<br />

For disambiguating the output of the morphological transducer we use constraint<br />

grammar. This is a framework dating back to Karlsson (1990), and the leading idea<br />

being that, for each wordform of the output, the disambiguator looks at the context<br />

and removes any reading that does not fit the context. The last reading can never be<br />

removed and, in the successful case, only the appropriate reading is left. The Brill<br />

tagger can be seen as a machine-learning variety of the constraint grammar parser.<br />

There are several versions of the constraint grammar compilers. The original one<br />

was written in Lisp by Fred Karlsson. Later, Pasi Tapanainen wrote a compiler in ‘C’,<br />

called CG-2; this version may be licensed from http://www.connexor.com. We use an<br />

open source version of this compiler, made by Eckhard Bick. It must be stressed that<br />

the debugging facility of the Connexor compiler is superior to its competitors.<br />

The optimal implementation would probably be to write the constraint grammar<br />

as a finite state transducer, as suggested in the Finite State Intersection Grammar<br />

framework. So far, nothing has come out of this work.<br />

4.3 One-base, Multi-purpose Parsers<br />

Working with minority languages, the lack of human resources is often as hard a<br />

problem as the lack of financial ones. With this in mind, avoiding duplicating work<br />

becomes crucial. The most time-consuming task in any linguistic project is building<br />

and maintaining a lexicon, be it in the form of a paper dictionary, a semantic wordnet,<br />

or the lexicon for a parser. The optimal solution is to keep only one version of the<br />

lexicon and extract relevant information from it, in order to automatically build paper<br />

and electronic dictionaries, orthographical wordlists or parsers. In our project, this<br />

has not yet been implemented, but for new languages we try out prototype models<br />

in order to make this work for new languages. Our plan is to use XML as text storage,<br />

and various scripts to extract the relevant lexicon versions.<br />

138


It goes without saying that we use only source for a morphological transducer for<br />

linguistic analysis, pedagogical programs, spellers, and so forth. These applications<br />

often need slightly different transducers, in which case we mark the source code so<br />

that it is possible to compile different transducers from the same source code. For<br />

the academic project we make a tolerant parser that analyses as much of the attested<br />

variation as possible. The spellchecker has a totally different goal: here we build a<br />

stricter version that only accepts the forms codified in the accepted standard. This<br />

approach is even more appropriate, as we are the only language technology project<br />

working on Sámi. Any further application will build upon our work, and our goal is to<br />

make it flexible enough to facilitate this.<br />

5. Infrastructure<br />

5.1 Computer Platform<br />

Our project is run on Linux and Mac OS X (Unix). The Xerox tools come in a Windows<br />

version as well, but the lack of a decent command-line environment and automatic<br />

routines for compiling makes it impractical. The cvs base is set up on a central Linux<br />

machine. We also use portable Macintoshes, both because they have a nice interface<br />

and because they offer programs that make it easier to work from different locations,<br />

such as the SubEthaEdit program mentioned below.<br />

5.2 Character Set and Encoding<br />

Most commercially interesting languages are covered by one of the 8-bit ISO<br />

standards. Many minority languages fall outside of this domain. It is our experience<br />

that it is both possible and desirable to use UTF-8 (multi-byte Unicode) in our source<br />

code (i.e., build the parser around the actual orthography of the language in question,<br />

rather than to construct some auxiliary ASCII representation). With the latest versions<br />

of the Linux and Unix operative systems and shells, we have access to tools that are<br />

UTF-8 aware and, although it takes some extra effort to tune the development tools<br />

to multi-byte input, the advantage is a more readable source code (with correct<br />

letters instead of digraphs) and an easier input/output interface, as UTF-8 now is the<br />

de facto standard for digital publishing.<br />

There is one setting where one could consider using a transliteration, namely, for<br />

languages using syllabic scripts, such as Inuktitut and Cherokee. In a rule stating that<br />

a final vowel is changed in a certain environment, a syllabic script will not give any<br />

single vowel symbol to change; rather than changing, for instance, a to e in a certain<br />

context, the rule must change the syllabic symbol BA to BE, DA to DE, FA to FE, GA to<br />

139


GE, and so forth. It still may be better, however, to use the original orthography; each<br />

case requires its own evaluative process.<br />

5.3 Directory Structure<br />

We have put some effort in finding a good directory structure for our files. The<br />

philosophy is as follows: different types of files are kept separate. (The source files<br />

have their own directory, and binary and developer files are kept separate.)<br />

Figure 3: Directory Structure<br />

All our source and documentation files are under version control using cvs. This<br />

means that the original files are stored on our central computer (with backup routines),<br />

and that each co-worker ‘checks out’ a local copy that becomes his or her version to<br />

work on. After editing, the changed files are then copied back, or ‘checked in’ to the<br />

central repository. For each check-in, a short note on what has been done is written.<br />

We also have set up a forwarding routine, so that all co-workers get a copy of all cvs<br />

log messages via email.<br />

140


Figure 4: Quote from cvs Log<br />

Using cvs (or some other version control system) is self-evident to any programmer,<br />

and it may be seen as quite embarrassing that such a trivial fact should be even<br />

mentioned here. It is our experience that the use of version control systems is by<br />

no means standard within academic projects, and we strongly urge anyone not using<br />

such tools to consider doing so. Backup routines become easier, and, when expanding<br />

from one-person projects to large projects, it is a prerequisite for when several co-<br />

workers collaborate on the same source files. We even recommend cvs for one-person<br />

projects. Using cvs, it is easier to document what has been done earlier, and to go<br />

back to previous versions to find out when a particular error may have crept in.<br />

5.5 Building with ‘Make’<br />

Another self-evident programmer’s tool is the use of makefiles, via the program<br />

make. In its basic form, make functions like a script and saves the work of typing the<br />

same long set of compilation commands again and again. With several source files, it<br />

becomes important to know whether one should compile or not. The make program<br />

keeps track of the age of the different files, and compiles a new set of binary files<br />

only when the source files are newer then the target binary files. The picture shows<br />

the dependencies between the different source and binary files.<br />

141


5.6 Tracing Bugs<br />

Figure 5: Dependencies in the Project’s Makefile<br />

As the project grows, so do the number of people debugging it, and hence the number<br />

of bugs and errors. We have designed an error database, in our case Bugzilla, which<br />

keeps track of errors. The database can be found at the address http://giellatekno.<br />

uit.no/bugzilla/. Interested persons may visit the URL. There is a requirement that<br />

you log in with an e-mail account and (preferably) a name, but otherwise the bug<br />

database is open for inspection.<br />

5.7 Internal Communication in a Decentralised Project<br />

We have co-workers in Tromsø, Kautokeino and Helsinki. Crucial for the project’s<br />

progress is the possibility of coordinating our work. For that, we have the following<br />

means:<br />

142


• We have made a project-internal newsgroup. Discussion is carried out<br />

in this environment rather than in personal emails, since more than one person<br />

may have something to say on the issue, and since it is easier to go back to earlier<br />

discussions using the newsgroup format.<br />

• For simultaneous editing of the same document, be it source code or a<br />

meeting memo, we use a program called SubEthaEdit (http://www.codingmonkeys.<br />

de/subethaedit/ - [only for Mac OS X]). This program makes it possible for several<br />

users to edit the same file at the same time. Combined with the use of the telephone<br />

(or voice chat!), we may discuss complicated matters on a common rule set while<br />

editing together, even though we sit in different countries.<br />

• For informal discussions, we use chat programs. The built-in Mac OS X<br />

chat application iChat also facilitates audio and video chats with decent to high<br />

quality of the video and sound (mainly restricted by the available bandwidth).<br />

• We have meetings over the phone; although we planned to conduct them<br />

using iChat (with up to ten participants in the same audio chat), technical problems<br />

with a firewall has stopped us from doing this.<br />

• The cvs version control and Bugzilla error database also facilitate working<br />

in several locations.<br />

5.8 Source Code and Documentation<br />

In our experience, a systematic approach to documentation is also required also<br />

when the project engages only one worker, and it is indispensable when the number of<br />

workers grows beyond two. Working on the only Sámi language technology project in<br />

the world, we acknowledge that all future work will take our work as a starting point.<br />

We thus work in a one hundred-years perspective, and write documentation so that<br />

those who follow us will be able to read what we have done.<br />

We document:<br />

• The external tools we use (with links to the documentation provided by<br />

the manufacturer);<br />

• The infrastructure of our project; and,<br />

• Our source files and the linguistic decisions we have made.<br />

In an initial phase, we used to write documentation in HTML, which was available<br />

only internally on the project machines. We now write documentation in XML, and<br />

convert it to HTML via the XML publishing framework (Forrest, http://forrest.apache.<br />

org/). Documentation can be published in many ways, but it is our experience that it<br />

is most convenient to read the documentation in a hypertext format such as HTML .<br />

143


Since the documentation has grown, we also use a search engine (provided by Forrest)<br />

to find what we have written on a given topic.<br />

The internal documentation of our project is open for inspection at the website<br />

http://divvun.no/ (the proofing tools project), as well as http://giellatekno.uit.no<br />

(the academic disambiguator project). The technical documentation is in English, and<br />

can be found under the tab TechDoc.<br />

Our source code is open as well, it is downloadable via anonymous cvs via our<br />

technical documentation. We believe that publishing the source code of projects like<br />

this will lead to progress within the field, not only in general, but especially for<br />

minority language projects.<br />

By publishing the documentation and the source code, we make it easy to explain<br />

what we do; we hope that it will inspire others to perhaps give us some constructive<br />

feedback as well. The only possible drawback of this openness is that it exposes our<br />

weaknesses to the whole world. So far, we have not noticed any negative effects in<br />

this regard.<br />

6. Costs<br />

Except for the computers themselves and the operating system and applications<br />

that come with them, we have mostly used free or open-source software for all our<br />

tasks. In the few cases where we have paid for software, there are free or open-source<br />

alternatives. The notable exception is the set of compilers for morphophonological<br />

automata. For analysing running text and generating stray forms, the Xerox tools can<br />

be used in the versions found in Beesley & Karttunen (2003). For our academic project,<br />

these tools have proven good enough, but in order to generate larger paradigms, the<br />

commercial version of the tools is needed.<br />

As for the computers, the only really demanding task is compiling the transducers.<br />

If one is willing to wait five minutes for the compilation, any recent computer can do<br />

fine, but the top models cut compilation time to less than half of what the cheapest<br />

models can do. Macs turned out to be a good choice for our projects, and the cheapest<br />

Mac can be bought for roughly 500 USD/EUR. One good investment, though, is to buy<br />

more RAM than what can be found on the standard configuration.<br />

7. Summary<br />

When doing language technology for minority languages, we are constantly faced<br />

with the fact that there are few people working with each language, and that different<br />

language projects set off in different directions, often due to mere coincidence. Our<br />

answer to this challenge is to share both our experience and our infrastructure with<br />

144


others. By doing this, we hope that people will borrow from us, and comment upon<br />

what we do and how we do it. We also look forward to being confronted with other<br />

solutions and to borrowing improvements back.<br />

145


146<br />

References<br />

Beesley, K.R. & Karttunen, L. (2003). Finite State Morphology. Stanford: CSLI<br />

Publications. http://www.fsmbook.com/.<br />

Bick, E. (2000). The Parsing System “Palavras”: Automatic Grammatical Analysis of<br />

Portuguese in a Constraint Grammar Framework. Dr. phil. thesis, Aarhus University<br />

Press.<br />

Brill, E. (1992). “A Simple Rule-based Part of Speech Tagger.” Proceedings of the Third<br />

Conference on Applied Natural Language Processing. ACL, Trento, Italy, 1992.<br />

“Bugzilla.” <strong>Online</strong> at http://giellatekno.uit.no/bugzilla/.<br />

“Bures boahtin sámi giellateknologiija prošektii.” <strong>Online</strong> at<br />

http://giellatekno.uit.no.<br />

Chanod, J.P. & Tapanainen, P. (1994). “Tagging French: Comparing a Statistical and<br />

a Constraint-based Method.” Seventh Conference of the European Chapter of the<br />

Association for Computational Linguistics, 149-156.<br />

“Connexor.” <strong>Online</strong> at http://www.connexor.com.<br />

“Divvun - sámi korrektuvrareaiddut.” <strong>Online</strong> at http://divvun.no/.<br />

“Forrest.” <strong>Online</strong> at http://forrest.apache.org/.<br />

Jelinek, F. (2004). “Some of my best friends are linguists.” LREC 2004. http://www.<br />

lrec-conf.org/lrec2004/doc/jelinek.pdf.<br />

Karlsson, F. (1990).“Constraint Grammar as a Framework For Parsing Running <strong>Text</strong>.” Karlgren,<br />

H. (ed.) (1990). Papers presented to the 13th International Conference on Computational<br />

Linguistics, 3, 168–173, Helsinki, Finland, August. ICCL, Yliopistopaino, Helsinki.


Karttunen, L., Kaplan, R.M. & Zaenen, A. (1992). “Two-level morphology with<br />

composition.” COLING’92, Nantes, France, August 23-28, 141-148.<br />

Koskenniemi, K. (1983). Two-level Morphology: A General Computational Model for<br />

Word-form Production and Generation. Publications of the Department of General<br />

Linguistics, University of Helsinki.<br />

Samuelsson, C. & Voutilainen, A. (1997). “Comparing a linguistic and a stochastic<br />

tagger.” Proceedings of the 35th Annual Meeting of the Association for Computational<br />

Linguistics, 1997.<br />

“SubEthaEdit.” <strong>Online</strong> at http://www.codingmonkeys.de/subethaedit/.<br />

Voutilainen, A. Heikkilä, J. & Anttila, A. (1992). Constraint Grammar of English, A<br />

performance-Oriented Introduction, 21. Helsinki: Department of General Linguistics,<br />

University of Helsinki.<br />

147


The Welsh National <strong>Online</strong><br />

Terminology Database<br />

Dewi Bryn Jones and Delyth Prys<br />

Terminology standardisation work has been on-going for the Welsh language for<br />

many years. At an early date, a decision was taken to adopt international standards<br />

such as ISO 704 and ISO 860 for this work. It was also decided to store the terminologies<br />

in a standard format in electronic databases, even though the demand in the early<br />

years was for traditional paper-based dictionaries. Welsh is now reaping the benefits of<br />

those far-seeing early decisions. In 2004, work began on compiling a national database<br />

of bilingual (Welsh/English) standardised terminology. Funded by the Welsh Language<br />

Board, it will be made freely available on the World Wide Web. Electronic databases<br />

already in existence have been revisited and reused for this project, with a view to<br />

updating them to conform to an ISO terminology mark-up framework (TMF) standard.<br />

An additional requirement of this project is that the term lists should be packaged and<br />

made available in a compatible format for downloading into popular Termbase systems<br />

found in translation tool suites such as Trados, Déjà Vu and Wordfast. As far as we<br />

know, this is the first time that a terminology database has been developed to provide<br />

a freely available Termbase download utility at the same time as providing an online<br />

searchable facility. Parallel work of utilising an ISO lexical mark-up framework (LMF)<br />

compliant standard for another project, namely, the LEXICELT Welsh/Irish dictionary,<br />

has provided the opportunity to research similarities and differences between a<br />

terminological concept-based approach and a lexicographical lexeme-based one.<br />

Direct comparisons between TMF and LMF have been made, and both projects have<br />

gained from new insights into their strengths and weaknesses. This paper will present<br />

an overview of a simple implementation for the online database, and attempt to show<br />

how frugal reuse of existing resources and adherence to international standards both<br />

help to maximise sparse resources in a minority language situation.<br />

1. Introduction<br />

Terminology for Welsh has seen increased activity over the last ten years, in<br />

particular in government administration and the public sector, following the passing<br />

of the Welsh Language Act 1994 and the establishment of the National Assembly for<br />

Wales. Many bilingual Welsh/English dictionaries have been published by various<br />

organisations operating in Wales covering subject fields within secondary education,<br />

149


academia, health, social services and public administration. Welsh terminology is<br />

generally perceived as merely an aid to standardised translation for English terms (cf.<br />

Prys 2003).<br />

Depending on the organisation responsible, for the commissioning of a dictionary,<br />

dissemination to translators and the public at large is done via paper-based and/or<br />

electronically-based means. As a result, however, provision of standardised terminology<br />

is fragmented and dispersed in nature. Translators have to keep and maintain their<br />

own collection of paper-based dictionaries, and/or keep track of where and how to<br />

access electronic versions. Finding a Welsh translation may involve laboriously looking<br />

in more than one dictionary.<br />

Meanwhile, the public at large, who would not have such a collection of<br />

terminology dictionaries, would not be part of a determinologization process, where<br />

specialised terms become incorporated into general language, thereby safeguarding<br />

or increasing the presence of Welsh in the commercial, printed media and popular<br />

culture sectors.<br />

Thus, the Welsh Language Board commissioned the e-Welsh team to develop the<br />

Welsh National <strong>Online</strong> Terminology Database in order to centralise and facilitate<br />

efficient terminology dictionary dissemination.<br />

2. Requirements for the Welsh National <strong>Online</strong> Terminology Database<br />

The Welsh National <strong>Online</strong> Terminology Database project requirements were<br />

comprised of two parts. First, previously published dictionaries of standardised<br />

terminology, in particular those commissioned by the Welsh Language Board, were<br />

compiled and stored into a new online terminology database system. This new online<br />

terminology database would constitute the second part of the requirements, wherein<br />

its purpose would be to provide a freely available central Web-based resource<br />

supporting the dissemination of Welsh standardised terminology via the greatest<br />

number of formats, channels and mechanisms.<br />

The system would support:<br />

• searching and presenting term translations across one or more dictionaries<br />

stored in the system;<br />

• downloading of complete dictionaries in various formats for offline<br />

use and integration into the translators’ own Translation Memory environments’<br />

termbase tools such as Trados Multiterm, Deja Vu, WordFast and a dictionary<br />

product developed by e-Welsh called Cysgeir; and,<br />

150


• presentation of its data as XML for possible incorporation with other<br />

online terminology systems.<br />

Since the database system would be a repository of published standardised<br />

terminology there were no requirements for wider-scoped terminology management<br />

functionalities such as editing and standardisation process support.<br />

3. Standards in Welsh Terminology<br />

Since an early date, ISO international standards have been adopted in Welsh<br />

terminology. In 1998, members of the e-Welsh team compiled a guideline document<br />

on the use of ISO standards (Prys & Jones 1998) for all terminology standardisation<br />

providers in Wales.<br />

The document mandated the use of principles and methods from ISO 704 and<br />

860. The guidelines helped to raise the discipline of terminology standardisation for<br />

Welsh above what might otherwise be typical of a lesser-used and resourced language<br />

where:<br />

• the work may be led by linguists with insufficient subject specialist<br />

knowledge, or subject specialists with insufficient linguistic expertise;<br />

• less technically competent subject specialists experts would independently<br />

develop a paper-based dictionary in a word processing application; and,<br />

• new words and terms are too easily coined along with spelling reforms in<br />

a misguided attempt to widen the appeal of the language.<br />

The guidelines mandated the use of databases in accordance with ISO/TR<br />

12618:1994 for terminology. The guideline document advised on the structure of such<br />

databases, as well as on the fields (or data categories) to be populated for any Welsh<br />

terminology dictionaries.<br />

The development of dictionaries would be conducted in tabular format with columns<br />

to store fields such as terms, term plurals, disambiguators for homonyms and Welsh<br />

grammatical information such as parts-of-speech. Crucially, each row represented the<br />

concept level.<br />

Thus, by employing a single table in a simple database application, no special<br />

terminological tools are required. A consistent bidirectional dictionary is easily<br />

created, whilst report and macro functionality found in Office productivity software<br />

can be used to create printed versions. A single table is also simple to host on a<br />

website.<br />

Below is a typical example of a Welsh/English terminology dictionary entry:<br />

home help (of person) cynorthwyydd cartref eb cynorthwywyr cartref;<br />

151


home help (of service) cymorth cartref eg<br />

cynorthwyydd cartref eg cynorthwywyr cartref home help;<br />

In effect, this simple adoption of recommendations from the ISO standards made all<br />

previous terminology dictionaries commissioned in Wales ‘future proof’ and available<br />

to the Welsh National <strong>Online</strong> Terminology Database project electronically and already<br />

in a database format.<br />

Over the course of numerous dictionary commissions, the guidelines have become<br />

outdated compared to the latest ISO standards. Further needs have been identified,<br />

including those to improve the interoperability of data and the reuse of software<br />

components. Weaknesses were identified in the guidelines both in the structure<br />

of terminological data and in the specification of data category selection. More<br />

specifically:<br />

Structure – a single database table is too rigid a structure. Columns were sometimes<br />

duplicated as a means of overcoming this inflexibility in order to store multiple terms<br />

for a single concept. This is bad database design and practice.<br />

Data Category Selection - although a data category selection was specified, their<br />

actual names for use in database tables were not. Therefore, across many dictionary<br />

database tables, the field for containing, for example, the English term, has different<br />

names such as [English] and [Saesneg] and [Term Saesneg].<br />

Thus, the Welsh National <strong>Online</strong> Terminology Database project presented an<br />

opportunity to update and extend our adoption of standards.<br />

4. Welsh National <strong>Online</strong> Terminology Database<br />

4.1 Use of Standards for the Welsh National <strong>Online</strong> Terminology Database<br />

The Welsh National <strong>Online</strong> Terminology Database would need an improved structure<br />

in order to scale up to the number of terms and richness of data that terminological<br />

entry records may be expected to store in the future. The improved structure would<br />

come from ISO 16642, a.k.a. Terminological Mark-up Framework (TMF).<br />

The TMF simply defines an abstract or meta-model for terminological entries. From<br />

the meta-model, Terminological Mark-up Languages (TML) can be derived to facilitate<br />

the representation and transfer of terminology data. Thus, adoption of the TML data<br />

model, as illustrated in the following figure, can be seen to provide a much-improved<br />

representation for all terminology entries in the Welsh National <strong>Online</strong> Terminology<br />

Database.<br />

152


Figure 1: ISO 16642 / TMF Meta-Model Structure for Terminological Entries<br />

The structure is concept-based, with a hierarchical structure containing multiple<br />

language sections, each containing one or more terms in the language that represents<br />

the concept in question. At the conceptual level, conceptual or classification system<br />

data can be added to increase the granularity of terms classification up and beyond<br />

the containing dictionary and the subject field implied by the dictionary title.<br />

Language sections provide the means for storing terms from one, two or any number<br />

of languages. This gives the potential for multilingual Welsh terminology dictionaries<br />

and incorporating with languages that are related or widely used by Welsh speakers<br />

such as Spanish and other Celtic languages.<br />

The term section provides an efficient means for associating many terms that<br />

represent a concept in a particular language. The TMF also specifies the mechanism<br />

by which its structure is populated with data categories selected from a data category<br />

register or catalogue as defined in ISO 12620:1999.<br />

The catalogue contains well-defined data categories and pick list values for<br />

use within a TML structural representation. Essentially, ISO12620:1999 provides<br />

standardised names for data categories.<br />

153


4.2 Implementing TMF: a Simple First Implementation<br />

a) TMF Structure Compliance<br />

A very simple implementation was completed that allowed us to quite easily use<br />

and benefit from aspects of the TMF. We did not intend to derive our own terminology<br />

representation from the TMF, preferring instead to use an already existing XML format.<br />

The TMF XML implementation chosen was TBX 1 . TBX complies with the requirements of<br />

the TMF meta-model. It is a flexible format that allows users to specify their own data<br />

categories selection from ISO 12620 and specification of their own data categories via<br />

its eXtensible Constraint Specification (XCS).<br />

A number of resources from the TBX home page at http://www.lisa.org are available<br />

to aid in its adoption, in particular, documentation to describe the standard further. To<br />

describe the structure, an XML Schema definition is provided. Also, a collection of ISO<br />

12620 data categories are described and provided in a default eXtensible Constraints<br />

Specification file (TBXDv04.xml).<br />

There is limited tool support for TBX, but with a freely available XML Schema<br />

Definition tool within the Microsoft .NET framework - xsd.exe 2 , we were able to<br />

generate serializable C# classes. The generated C# code would be available to any<br />

other code written for constructing TMF compliant representations of terminological<br />

entries for inclusion in the Welsh National <strong>Online</strong> Terminology Database system.<br />

Construction would simply involve constructing object instances of the generated TBX<br />

classes and assigning values to member variables. When such objects instances are<br />

serialized with via the .NET framework’s XML serializer, the resulting XML conforms to<br />

the original TBX schema. The following shows the generated C# class for the top level<br />

TBX TermEntry element.<br />

[System.Xml.Serialization.XmlRootAttribute(Namespace=””, IsNullable=false)]<br />

public class termEntry<br />

}<br />

/// <br />

public note<strong>Text</strong>_impIDLangTypTgtDtyp descrip;<br />

/// <br />

///<br />

[System.Xml.Serialization.XmlElement]<br />

1 http://www.lisa.org/tbx<br />

2 http://msdn.microsoft.com/library/en-us/cptools/html/cpconXMLSchema<br />

DefinitionToolXsdexe.asp<br />

154


public descripGrp[] descripGrp;<br />

/// <br />

public note<strong>Text</strong>_impIDLangTypTgtDtyp admin;<br />

/// <br />

public adminGrp adminGrp;<br />

/// <br />

public transacGrp transacGrp;<br />

/// <br />

public note<strong>Text</strong>_impIDLang note;<br />

/// <br />

public @ref @ref;<br />

/// <br />

public xref xref;<br />

/// <br />

[System.Xml.Serialization.XmlElementAttribute(«langSet»)]<br />

public langSet[] Items;<br />

/// <br />

[System.Xml.Serialization.XmlAttributeAttribute(DataType=”ID”)]<br />

public string id;<br />

}<br />

b) Data Category Selection<br />

Data categories, in particular TBX’s eXtensible Constraints Specification support,<br />

also need to be available for the construction of terminological entries. However,<br />

we decided, since there aren’t yet a great number of data categories used by Welsh<br />

dictionaries, that it was simpler to hardcode the placement and setting of data<br />

155


categories into the TBX structure with wrapper code for the generated C# TBX code.<br />

Thus, the wrapper code provides easy access to the superset of all fields or data<br />

categories from all imported dictionaries.<br />

The default selection of data categories given by TBX correspond to most fields<br />

used in previous dictionaries. Newly utilised categories from ISO 12620 would aid in<br />

the machine processing of terms. For example, the SortKey is used for implementation<br />

of Welsh sort order for all destined lists or dictionaries exports. Some data categories<br />

were created in order to store efficiently data for the typical Welsh/English dictionary<br />

entry given earlier.<br />

• Welsh Part-Of-Speech (WelshPartOfSpeech)<br />

The standard picklist of values for representing part-of-speech for Welsh terms<br />

could be maintained with the addition of the WelshPartOfSpeech data category.<br />

• Welsh Plural (WelshPlural)<br />

Further grammatical information for a term such as the plural can be stored under<br />

this data category.<br />

• Disambiguator (concept-disambiguator)<br />

A simple disambiguating field containing a brief explanation for the context of term<br />

had been mandated by previous guidelines in cases of Welsh or English homonyms, for<br />

example:<br />

primate (=bishop) achesgob<br />

primate (=monkey) pmat<br />

The default language for the concept-dissambiguator data category is English.<br />

However, when a Welsh language disambiguator needs to be included, this would be<br />

contained within TBX’s XML tag.<br />

• Dictionary (originatingDictionary)<br />

The dictionaries from which a term originates can be noted in the TBX representation<br />

via the use of this new category.<br />

• Responsible Organisation (responsibleOrganisation)<br />

The organisation responsible for the standardisation of the term can be noted in<br />

this new category. The additional data categories specification is given in the extract<br />

below from TBXDv04_CY.xml:<br />

<br />

<br />

156


termEntry <br />

<br />

<br />

<br />

adf ans be eb eg eg/b ell adj n v<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

langSet termEntry term <br />

<br />

<br />

<br />

termEntry term <br />

<br />

An example of a TBX XML string with Welsh data categories in action:<br />

<br />

<br />

(of person)<br />

<br />

consolidatedElement<br />

<br />

<br />

157


home help<br />

entryTerm<br />

<br />

singular<br />

<br />

<br />

home help$sf$(of person)$sf$<br />

<br />

<br />

<br />

<br />

<br />

<br />

cynorthwyydd cartref<br />

entryTerm<br />

<br />

eg<br />

<br />

<br />

cynorthwywyr cartref<br />

<br />

<br />

singular<br />

<br />

<br />

<br />

cynorthwyydd cartref$sf$$sf$eg<br />

<br />

<br />

<br />

<br />

158


c) Database Design<br />

The TBX XML representation needs to be stored in a relational database. Database<br />

schemas can also usually be generated from XML Schemas. Since the Welsh National<br />

<strong>Online</strong> Terminology Database has the simple purpose of being a storage facility for the<br />

dissemination and presentation of standardised and thus fixed or published terminology<br />

data in various formats, a sufficient yet effective solution would be to store the entire<br />

TBX representation in the database as a string. Thus, the database design contains a<br />

single table for containing all terminological entries.<br />

Table 1: Database Table Design for Storing the Terminological Entries<br />

159<br />

TermEntries<br />

PK TermEntry_ID<br />

TermEntry<br />

Term data within XML strings cannot be accessible in a relational manner for search<br />

and querying, and so forth. Therefore, an index table is added, consisting of each<br />

term’s XPaths that point to its locations in the containing TBX strings stored under<br />

TernEntry_ID.<br />

PK<br />

PK<br />

PK<br />

Table 2: Database Table Design for Index to Terminological Entries<br />

TermIndex<br />

TermIndex_ID<br />

TermEntry_ID<br />

Language_ID<br />

Term<br />

TermEntry_XPath<br />

SortKey


Table 3: An Example Entry in the TermIndex Table<br />

TermIndex Value<br />

TermIndex_ID<br />

2614<br />

TermEntry_ID<br />

Language_ID<br />

Term<br />

TermEntry_XPath<br />

SortKey<br />

4.3 TBX transformations<br />

1328<br />

cy<br />

golwg grŵp<br />

//termGrp[@id=’tid-tg-2554-cy-1’]<br />

golwg grwp$sf$$sf$eb<br />

Presenting and transforming TBX strings in the database in various other formats<br />

such as HTML, CSV, and so forth would involve using simple XSLT transformations.<br />

• CSV<br />

A review of the import process of commercial termbase products used by Welsh<br />

translators showed that, in one way or another, all of them support importing<br />

terminology data from CSV formatted files.<br />

The following program creates an English to Welsh CSV format data for the term<br />

‘group view’ (XPath = termGrp id=”tid-tg-2554-en-1”):<br />

<br />

<br />

<br />

<br />

<br />

“<br />

<br />

„“,<br />

„”,<br />

“”,<br />

“”,<br />

<br />

160


“\”,<br />

“”<br />

“”<br />

<br />

<br />

<br />

<br />

<br />

<br />

The sample outputs CSV lines that provide each Welsh term translation for a specific<br />

English term for a specific concept (and concept-disambiguator value.<br />

A specific XPath (and therefore XSLT) is required for each terminological entry<br />

transformation. This has made a negligible difference to the performance of<br />

transforming entire dictionary contents and/or multiple entries.<br />

• Trados<br />

Trados’ termbase product MultiTerm imports terms marked up in its own MultiTerm<br />

XML format, as well as CSV. Importing terms from CSV files in most translation memory<br />

systems (including Trados) is quite a manual and complex task for some translators,<br />

despite both having wizards to help in mapping CSV fields to their terrmbase fields<br />

and structures. Fortunately, Trados MultiTerm has its own XML mark-up language to<br />

facilitate importing terminology. Although not entirely compliant to TMF or TBX,<br />

MultiTerm XML does follow the TMF meta structure. Data category selection and<br />

definition is described in XML Termbase Definition files (with extension .xdt).<br />

Thus, the Welsh National <strong>Online</strong> Terminology Database employs a simple XSLT to<br />

construct Multiterm XML from TBX by mapping data categories in identical structures,<br />

and provides a fixed accompanying .xdt file.<br />

<br />

“<br />

<br />

<br />

161


<br />

<br />

<br />

<br />

“<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

162


<br />

<br />

<br />

<br />

<br />

• Deja Vu<br />

<br />

<br />

<br />

<br />

<br />

<br />

Deja Vu’s Termbase can import terminology from a number of simple formats for<br />

example, CSV text, Excel, and Access. It does not support import of terminological<br />

data in any flavour of XML.<br />

Its import process contains target templates that are similar in structure and data<br />

categories to other common termbases such as Eurodictautom and TBX. However,<br />

the user is left with the manual task of mapping the CSV/Excel flat structure to<br />

the destination template. Simplifying the import process is under continued<br />

investigation.<br />

• WordFast<br />

WordFast does not support CSV, but its terminology support is a simple bilingual<br />

glossary text file with a source column and a target column that is easily loaded into<br />

the Wordfast toolbar. The XSLT transformation for exporting to this format is trivial as<br />

it is a subset of CSV.<br />

• Cysgeir<br />

Cysgeir is a Welsh/English dictionary software product produced by the e-Welsh<br />

team that incorporate many Welsh NLP features such as a lemmatizer for aiding in the<br />

search for Welsh words and terms. Cysgeir installs onto Windows PCs and integrates<br />

with popular Office productivity software and some translation memory packages, such<br />

163


as WordFast. Cysgeir supports loading terminology and browsing multiple dictionaries<br />

described in Cysgeir’s own proprietary formatted dictionary files. Integrating<br />

functionality for exporting Cysgeir dictionary files from TBX descriptions stored in the<br />

Welsh National <strong>Online</strong> Terminology Database system is under development.<br />

• Raw TBX<br />

Simple ASP.NET code can expose the raw TBX XML strings to other online terminology<br />

systems from specially crafted URLs. An example URL supported by the Welsh National<br />

<strong>Online</strong> Terminology Database is: http://www.e-gymraeg.co.uk/bwrdd-yr-iaith/<br />

termau/tbx.aspx?en=view.<br />

The result returned is merely the concatenation of all TBX strings found in the<br />

database containing the English term ‘view’. The system adds header information<br />

along with a link to the data category selection described in our eXtensible Constraints<br />

Specification file - TBXv04_CY.xml.<br />

4.4 Current State<br />

The national online database has been online since August 2005 and was launched<br />

with standardised terminology for Information Technology. The URL is http://<br />

www.e-gymraeg.org/bwrdd-yr-iaith/termau. Despite being a very simple use and<br />

implementation of the TMF on a Microsoft .NET/ ASP.NET/ SQL Server platform, it<br />

has already proven to be effective in its support of Welsh translators and software<br />

localisers. The website provides a simple search panel and presents results to the<br />

right of the screen, as seen in the screenshot:<br />

Figure 2: Screenshot of Search Results from the Welsh National <strong>Online</strong> Terminology<br />

Database<br />

164


During a period of ten weeks between August and mid-October 2005, over six<br />

hundred queries were made to this one dictionary, and it has been downloaded in its<br />

entirety as a CSV file over thirty times with twenty downloads requesting Welsh as<br />

the source language. The work of importing other pre-existing electronic dictionaries<br />

is about to begin and two new dictionaries covering ecology and retail will soon be<br />

added.<br />

5. Perspectives and Conclusion<br />

This paper has illustrated, for the interest of other lesser-resourced languages,<br />

a very simple implementation of online dissemination of terminology and adoption<br />

of the TMF standards. International standards have helped to steer terminology<br />

standardisation for Welsh (despite being a lesser spoken and resourced language) on a<br />

productive and sound course. Adoption of these standards in the past facilitated the<br />

creation of many dictionaries that, in the future, would prove easy to include in the<br />

Welsh National <strong>Online</strong> Terminology Database. Frugal reuse of existing resources is key<br />

for the development of language technology for lesser-resourced languages such as<br />

Welsh.<br />

In the meantime, the scope of ISO TC37 has expanded to cover the principles,<br />

methods and applications related to terminology, language resources and knowledge<br />

organisation for the multilingual information society. A family of standards are being<br />

developed with common principles that deal with lexicons, terminology, morpho-<br />

syntactic annotation, word segmentation and data category management. Therefore,<br />

we consider that continued gradual adoption of the ISO standards on terminology and<br />

lexicography will maximise reuse levels of linguistic data and software components<br />

even further.<br />

Parallel work also conducted by the e-Welsh team within the LEXICELT Welsh/<br />

Irish dictionary project, where the ISO lexicography standard (the Lexical Mark-up<br />

Framework ISO/CD 24613) has been used, identified similarities with the ISO TMF, and<br />

opportunities for reusability and merging of language resources.<br />

LMF aims to be a standard for MRD and NLP lexicon standards, and therefore define,<br />

in a similar way to the TMF, a meta structure along with mechanisms for data category<br />

selections from meta data registries. Nevertheless, whereas TMF is concept oriented,<br />

LMF is lexeme-based. Most existing lexical standards are said to fit with the LMF, such<br />

as the <strong>Text</strong> Coding Initiative (TEI), OLIF, CLIPS, WordNet, and FrameNet.<br />

165


Ultimately, the same structure and content can be used for a number of different<br />

purposes, from speech technology and printed dictionaries, to machine translation,<br />

where, at the moment, individual standalone application specific standards exist.<br />

Data from interoperable terminology and lexicography resources would not only<br />

enrich each other, but also use the same software components and systems.<br />

A bridge identified by many between terminological and lexical meta-models links<br />

a term entry in the TMF to the lexeme entry in the LMF. Thus, either of the two<br />

database implementations could be linked, or a super structure could be constructed<br />

containing a superset of term entry and sense entry data categories organised on<br />

concept or lexeme basis.<br />

Figure 3: Bridging ISO’s TMF and LMF Meta_Model Structures<br />

Linking lexical data and terminological data can improve the richness of a term’s<br />

grammatical information beyond the capabilities of the TMF. Already noted above,<br />

such a link would be an aid in the Welsh terminology standardisation process. In<br />

adhering to ISO 860 recommendations, term candidates are evaluated for inflected<br />

forms. Such inflected forms may exist already under the corresponding lexeme in its<br />

Form->Inflections section, or may be simply added to improve the lexical database.<br />

Larger opportunities for reuse exist if we realise that the adoption of standards in<br />

the Welsh terminology could go beyond merely standardised translations, in particular,<br />

adopting advanced aspects of the standards such as concept modelling.<br />

166


Classification of terms and their finer-grained organisation into a conceptual<br />

organisation may offer users of the Welsh National <strong>Online</strong> Terminology Database system<br />

better discovery and comprehension of terms in Welsh than in English. It may even<br />

contribute to improving English terminology standardisation, since many organisations<br />

and/or terminology dictionaries have different terms representing identical concepts<br />

that hamper and frustrate cooperation.<br />

LMF supports translation by linking senses in the same way that TMF’s term entries<br />

are linked as subordinates to the same concept entity. Simple, direct and bidirectional<br />

links between senses in respective LMF entries may be sufficient for simple bilingual<br />

dictionaries, but more complicated and multilingual lexical dictionaries require an<br />

interlingual concept system to handle the various levels of concept precision. Thus,<br />

further opportunities for the reuse of software components for concept modelling<br />

exists.<br />

Such observations and gradual expansion of international standards adoption for<br />

Welsh language technology can only further ‘future-proof’ Welsh against all future<br />

developments and applications in language technology such as semantic web,<br />

knowledge bases, and machine translation.<br />

6. Acknowledgements<br />

The Welsh National <strong>Online</strong> Terminology Database project is funded by the Welsh<br />

Language Board.<br />

167


168<br />

References<br />

“Cronfa Genedlaethol o Dermau.” <strong>Online</strong> at http://www.e-gymraeg.org/bwrdd-yr-<br />

iaith/termau.<br />

“Déjà Vu.” <strong>Online</strong> at http://www.atril.com/default.asp.<br />

“Microsoft .NET framework - xsd.exe”. On line at:<br />

http://msdn.microsoft.com/library/en-us/cptools/html/cpconXMLSchemaDefinitionToolXsdexe.asp<br />

ISO/TR 12618:1994 Computational aids in terminology – Creation and use of<br />

terminological databases and text corpora. (TC37/SC3).<br />

ISO 860:1996 Terminology work – Harmonization of Concepts and Terms,<br />

(TC37/SC1).<br />

ISO 12620:1999 Computer Applications in Terminology – Data Categories<br />

(TC37/SC3).<br />

ISO 704:2000 Terminology Work – Principles and Methods (TC37/SC1).<br />

ISO 16642:2003 Computer Applications in Terminology - Terminological Markup<br />

Framework (TC37/SC3).<br />

ISO/CD 24613 Language Resource Management – Linguistic Annotation<br />

Framework (TC37/SC4).<br />

Prys, D. (2003). “Setting the Standards: Ten Years of Welsh Terminology Work.”<br />

Proceedings of the Terminology, Computing and Translation Conference,<br />

Swansea University, March 27-28, 2004. Swansea: Elsavier.


Prys, D. & Jones, J.P.M. (1998). The Standardization of Terms Project. Report prepared<br />

for the Welsh Language Board.<br />

“TBX TermBase eXchange.” <strong>Online</strong> at http://www.lisa.org/tbx.<br />

“Trados.” <strong>Online</strong> at http://www.trados.com/.<br />

“Wordfast.” <strong>Online</strong> at http://www.wordfast.net/.<br />

169


SpeechCluster: A Speech Data Multitool<br />

171<br />

Ivan A. Uemlianin<br />

When collecting and annotating speech data, to build a database, for example,<br />

speech researchers face a number of obstacles. The most obvious of these is the<br />

sparseness of data, at least in a usable form. A less obvious obstacle, but one that is<br />

surely familiar to most researchers, is the plethora of available tools with which to<br />

record and process the raw data. Some example packages include: EMU, Praat, SFS,<br />

JSpeechRecorder, Festival, HTK, and Sphinx. Although, prima facie, an embarrassment<br />

of riches, each of these tools proves to address a slightly different set of problems,<br />

to be slightly (or completely) incompatible with the other tools, and to demand a<br />

different area of expertise from the researcher. At best, this is a minor annoyance; at<br />

worst, a project must expend significant resources to ensure that the necessary tools<br />

can interoperate. As this work is no doubt repeated in unrelated projects around the<br />

world, an apparently minor problem becomes a possibly major - and undocumented<br />

- drag on progress in the field. This danger is especially extreme in research on<br />

minority and lesser-spoken languages, where a lack of resources or expertise may<br />

completely preclude research. Researchers need some way of abstracting from all<br />

these differences, so they can conduct their research. The simplest approach would<br />

be to provide an interface that can read and write the existing formats, and provide<br />

other facilities as required.<br />

On the WISPR project-- developing speech processing resources for Welsh and Irish-<br />

- we have adopted this approach in developing SpeechCluster. The intention behind<br />

SpeechCluster is to enable researchers to focus on research rather than file conversion<br />

and other low-level, but necessary preprocessing. SpeechCluster is a freely available<br />

software package, released and maintained under an open-source license.<br />

In this paper, we present SpeechCluster (reviewing the requirements it addresses and<br />

its overall design), we demonstrate SpeechCluster in use, and finally, we evaluate its<br />

impact on our research and outline some future plans.<br />

1. The Context<br />

Lesser-used languages (LULs) are often lesser-resourced languages. Majority<br />

languages have wealthy patron states, with the money and the labour force to develop<br />

language resources as required. For example, Microsoft alone has over 6000 hours of<br />

US English speech data at its disposal (Huang 2005). Patron organisations of lesser-


used languages often do not have access to such power, and they must use their<br />

resources wisely.<br />

Research and development in language technology brings many stimulating<br />

challenges. With LULs especially, these challenges may include considerations about<br />

the status and use of the language (users and patrons of the language are likely to<br />

take an active interest, and often language technology research can become part of<br />

the life of the language itself).<br />

Research and development in language technology also brings a great deal of<br />

tedious labour. Data must be collected and archived, and there are several layers of<br />

processing that need to be done before any ‘interesting’ R&D can begin. Often, the<br />

physical forms of the storage and the processing tools- the file formats and software<br />

implementations- provide obstacles of their own.<br />

Since these obstacles are contingent upon the machinery rather than the research<br />

problem, they are often categorised as ‘chores’ and tackled quite differently to other<br />

tasks on the project. At worst, these obstacles will be tackled manually; at best,<br />

scripts will be written ad hoc as the need arises, to be discarded (or ‘archived’) at<br />

the end of the project. These approaches are inefficient, especially when compared<br />

to the conscious and investigative approach taken with other parts of the project.<br />

Resources are wasted, and specialists can spend significant portions of their time<br />

involved with inappropriate (and more importantly, unpleasant) activities.<br />

In the speech research department of a large corporation, the costs associated<br />

with this waste can be passed on to the customer; in smaller research establishments,<br />

these costs may preclude speech research altogether.<br />

2. Our Problem<br />

Our goals on the WISPR project are to develop speech technology resources for<br />

Welsh and Irish that we can make freely available to commercial companies. There<br />

are currently no such resources at all for Irish, and very limited resources for Welsh<br />

(language resources available for Welsh include two text resources: CEG, a 1 million<br />

word balanced and tagged corpus (Ellis et al. 2001) and a large collection of webpages<br />

(Scannell 2004], both of which are for non-commercial use only; a telephone speech<br />

corpus (Jones et al. 1998); and a small, experimental speech database (Williams<br />

1999).<br />

The project must therefore begin from the bottom, starting with data collection<br />

and annotation, moving on to developing necessary speech databases, acoustic models<br />

(AMs) and language models (LMs), and, finally, developing packaged software artefacts<br />

172


that can be used by external developers. With limited resources of time, money and<br />

labour, every administrative chore added to the workflow reduces resources available<br />

for more delivery-oriented tasks.<br />

The first decision to the problem of ‘chores’ was that the solution should be a<br />

Speech Processing Resource in its own right. The solution should consist of a set of<br />

reusable, extensible, shareable tools to be made available to (a) ourselves on future<br />

projects, and (b) other teams working on speech processing projects around the<br />

world.<br />

2.1 Requirements<br />

The main design goals of our solution are as follows:<br />

• researchers should be able to work independently of data format<br />

restrictions;<br />

• necessary, complicated, but uninteresting tasks should be automated;<br />

• interesting, but complicated tasks should be made simple;<br />

• researchers should be able to address linguistic problems with linguistic<br />

solutions;<br />

• the toolkit should be increasingly simple to maintain and develop; and,<br />

• the toolkit should encourage its own use and development.<br />

• Researchers should be able to work independently of data format<br />

restrictions.<br />

Data can be collected, transcribed and stored in a range of formats. Each of the<br />

range of available tools for language technology research accepts or generates data<br />

in its own format, or in a limited range of standard formats. Researchers should not<br />

have to worry about which format works with which application: they should be able<br />

to pick the application necessary (or preferred) for the research problem, and the<br />

data should be readily accessible in the correct format.<br />

• Necessary, complicated, but uninteresting tasks should be automated.<br />

This applies to life in general, of course.<br />

• Interesting but complicated tasks should be made uncomplicated.<br />

Due to the nature of the field, where it is often necessary to process large sets of<br />

data, many of the more interesting problems (e.g. building AMs for speech recognition)<br />

involve procedures that are repetitive (e.g. those that have to be applied to every<br />

item in a corpus) or complicated (e.g. initialising a system). Researchers learning<br />

about a new area are hampered when these tasks dominate learning time.<br />

173


• Researchers should be able to address linguistic problems with linguistic<br />

solutions.<br />

Often, a linguistic problem, or a problem initially described in language terms<br />

(e.g. retranscribing the data using a different phoneset) has to be redescribed in<br />

programming terms before it can be addressed. Problems should be addressable in the<br />

terms in which they occur.<br />

• The toolkit should be increasingly simple to maintain and develop.<br />

Over its lifetime, any toolkit increases in functionality: new problems occur and<br />

new tasks become possible. If extensions are increasingly difficult to implement, the<br />

toolkit eventually disintegrates (e.g. into a library of loosely-related scripts), becomes<br />

impossible to maintain, and falls into disuse. A well-designed toolkit can avoid this<br />

fate.<br />

• The toolkit should encourage its own use and development.<br />

It should be preferable to use the toolkit than to revert to the bad old ways.<br />

Nevertheless, further use of the toolkit should stimulate researchers to confront it<br />

with new problems, and to think of new areas in which the toolkit might be used.<br />

If possible, the toolkit should be extensible by the researchers themselves, rather<br />

than having to rely on a separate maintainer. In this case, the design of the toolkit<br />

should promote the writing of readable, reusable code.<br />

3. A Solution<br />

3.1 Introduction<br />

Our first (and current) attempt at a software artefact that meets these requirements<br />

is the SpeechCluster package (Uemlianin 2005a). SpeechCluster is a collection of small<br />

programs written in a programming language called Python. Python has a very clear,<br />

readable syntax, and is especially suited for projects with several programmers, or<br />

with novice programmers. As such, it suited our aim of encouraging non-programmers<br />

to write their own tools.<br />

The SpeechCluster package consists of a main SpeechCluster module with the basic<br />

API, and a number of modules that can be used as command line tools. The tools are<br />

intended to be used as such, but they can also be used as ‘examples’, or as a basis for<br />

customisation or further programming with SpeechCluster.<br />

Table 1 shows a list of the tools currently available as part of SpeechCluster: Below,<br />

we look at two of these in more detail before exploring the use of SpeechCluster as an<br />

API. Finally, we look at SpeechCluster in a larger system.<br />

174


Table 1: SpeechCluster command-line tools<br />

Tool Function<br />

segFake ‘Fake autosegmentation’ of a speech audio file<br />

segInter Interpolates labels into a segmented but unlabelled segment tier<br />

segMerge Merges separate label files<br />

segReplace Converts labels in a label file<br />

segSwitch Converts label file format<br />

splitAll Splits audio/label file pairs<br />

3.2 Using SpeechCluster I: The Tools<br />

a) SegSwitch<br />

SegSwitch is a label file format converter. It converts label files between any of the<br />

formats supported by SpeechCluster (currently, Praat <strong>Text</strong>Grid, esps and the various<br />

HTK formats [i.e., the simple .lab format and the multi-file .mlf format]). This kind of<br />

format conversion is a very common task. For example, HTK requires files to be in its<br />

own esps-like format, but our team prefers to handlabel files in Praat, which outputs<br />

its own <strong>Text</strong>Grid format. Festival uses an esps-like format that is slightly different<br />

from HTK’s.<br />

SegSwitch has a simple command-line interface (see Table 2), in which single files<br />

or whole directories can be converted easily and perfectly.<br />

Usage:<br />

Examples:<br />

Table 2: segSwitch usage<br />

segSwitch -i -o <br />

segSwitch -d -o <br />

segSwitch -i example.lab -o example.<strong>Text</strong>Grid<br />

segSwitch -d labDir -o textGrid<br />

A simple facility like this has a remarkable effect on the efficiency of a team. The<br />

team no longer has to worry about in what file format they have to work. They can<br />

concentrate on the research task converting files in and out of particular formats as<br />

needed. In a sense, the two parts of the work- the research and the bookkeeping-<br />

have been separated, and the bookkeeping is done by the tools. This division of labour<br />

is repeated between the tools and the SpeechCluster module itself. As much of the<br />

low-level data manipulation as possible is carried out by SpeechCluster, so that the<br />

tools themselves can be written in simple, task-oriented terms.<br />

Table 3 shows the main code for segSwitch (excluding the command-line parsing and<br />

the loop over files in a directory): all of the work of file format conversion is done by<br />

175


the code shown. Looking past the Python syntax, this code is a direct implementation<br />

of an intuitive statement of the task (see Table 4).<br />

Table 3: Simplified python code for segSwitch<br />

Line Code<br />

1 from speechCluster import *<br />

2<br />

3 def segSwitch(inFn, outFn):<br />

4 “””<br />

5 Args: string inFn: input filename<br />

6 string outFn: output filename<br />

7 Returns: None<br />

8 Uses filename extensions to determine input<br />

9 & output formats.<br />

10 “””<br />

11 spcl = SpeechCluster(inFn)<br />

12 ofext = os.path.splitext(outFn)[1]<br />

13 outFormat = SpeechCluster.formatDict[ofext.lower()]<br />

14 out = spcl.write_format(outFormat)<br />

15 open(outFn, ‘w’).write(out)<br />

Table 4: segSwitch task statement<br />

Line(s) Task<br />

11 read in an input file<br />

12-13 work out from the output filename what the output format should be<br />

14 generate the output format data<br />

15 write the data out to a file, using the output filename given.<br />

All of the hard programming is hidden in the SpeechCluster module, which is<br />

imported in line 1, and which provides useful facilities like formatDict(ionary) and<br />

write_format(format).<br />

b) SplitAll<br />

One of the special features of SpeechCluster is that it will treat a sound file (i.e.,<br />

recorded speech) and its associated label file as a pair, and can manipulate them<br />

together. SplitAll shows this in action.<br />

SplitAll addresses the problem of the researcher who requires a long speech file to<br />

be split into smaller units along with its associated label file; for example, one may<br />

require a long utterance containing pauses to be split into its constituent phrases. Of<br />

course, data can be recorded or segmented into shorter units before it is labelled, but<br />

when data is re-used, its requirements often change.<br />

176


This task is a minor inconvenience if you just have one or two files, but if you have<br />

five hundred (or even just fifty) it becomes important to automate it. Furthermore, it<br />

would be better psychologically if the researcher could envisage this as a single task,<br />

rather than two related tasks (i.e., splitting the wav file; then splitting the label file<br />

to match). The best option is to delegate the task to a machine.<br />

As with segSwitch, splitAll has a simple command-line interface (see Table 5).<br />

Table 5: splitAll usage<br />

Usage<br />

splitAll -n -t [-l ]<br />

inDir outDir<br />

Example Splits into<br />

splitAll -n 5 -t Phone inDir outDir 5 phone chunks<br />

splitAll -n 1 -t Word inDir outDir single words<br />

splitAll -n 5 -t Second inDir outDir 5s chunks<br />

splitAll -n 1 -t Phone -l sil inDir outDir split by silence<br />

SplitAll is intended to be used on directories of files and can process hundreds of<br />

speech/label file pairs in moments. Again, the effect is to separate the researcher<br />

from the drudgery of looking after files.<br />

Apart from a function that parses the command-line parameters into the variable<br />

splitCriteria, the code for splitAll is just as simple as that for segSwitch (see Table<br />

6). The excerpt seen here loops through the filestems in a directory, a filestem being<br />

a filename without its extension (e.g. example.wav and example.lab have the same<br />

filestem ‘example’). Line 8 generates a speechCluster from a filestem: this means<br />

that all files with the same filestem– (e.g. a wav file and a lab file) are read into the<br />

one speechCluster. Line 9 then calls split, saving the results into the given output<br />

directory.<br />

Table 6: Simplified python code for splitAll<br />

Line Code<br />

1 from speechCluster import *<br />

2<br />

3 def splitAll(splitCriteria, inDir, outDir):<br />

4 stems = getStems(inDir)<br />

5 for stem in stems:<br />

6 fullstem = ‘%s%s%s’ % (inDir, os.path.sep, stem)<br />

7 print ‘Splitting %s.*’ % fullstem<br />

8 spcl = SpeechCluster(fullstem)<br />

9 spcl.split(splitCriteria, outDir)<br />

177


This codewalk tells you nothing about how SpeechCluster.split(splitCriteria) works,<br />

but that’s the point. The SpeechCluster module provides facilities like split() that<br />

allow the researcher to phrase problems and solutions in task-oriented terms rather<br />

than programming-oriented terms.<br />

3.3 SpeechCluster as an API<br />

The two main design features of SpeechCluster are:<br />

7 it stores segmentation details internally in an abstract format; and,<br />

8 it can treat an associated pair of sound and label files as a unit.<br />

In terms of the facilities SpeechCluster provides, these features translate into the<br />

methods shown in Table 7.<br />

Table 7: SpeechCluster methods<br />

Interface (i.e. read/write) methods<br />

read_format(fn) write_format(fn)<br />

read_ESPS(fn) write_ESPS(fn)<br />

read_HTK_lab(fn) write_HTK_lab(fn)<br />

read_HTK_mlf(fn) write_HTK_mlf(fn)<br />

read_HTK_grm(fn) write_HTK_grm(fn)<br />

read_stt(fn) write_stt(fn)<br />

read_<strong>Text</strong>Grid(fn) write_<strong>Text</strong>Grid(fn)<br />

read_wav(fn) write_wav(fn)<br />

Methods for manipulating label and sound files (and label/sound file pairs)<br />

merge(other)<br />

replaceLabs(replaceDict)<br />

setStartEnd(newStart, newEnd)<br />

split(splitCriteria, saveDir, saveSegFormat)<br />

When programming using SpeechCluster as a library, the researcher/developer can<br />

program using the linguistic terms of the problem, not the programming terms of the<br />

programming language.<br />

There is documentation available (Uemlianin 2005a), and the Python pydoc facility<br />

allows the researcher/developer to access documentation ‘interactively’ (see Figure<br />

1).<br />

3.4 Using SpeechCluster II: Making a New Script<br />

Although there are tools provided as part of the SpeechCluster package, the<br />

SpeechCluster module itself presents an accessible face, and it is hoped that<br />

178


esearcher/developers are able to use SpeechCluster to build new tools for new<br />

problems.<br />

a) SegFake<br />

SegFake provides an example of using SpeechCluster to help write a script to<br />

address a specific problem. Handlabelling is passé. It is laborious, tedious and error-<br />

prone; but sometimes researchers in LULs have to do it. If there are no AMs to do<br />

time-alignment, there seems to be no alternative to labelling the files by hand.<br />

When labelling prompted speech (e.g. recited text), the phone labels are more-<br />

or-less given (i.e., from a phonological transcription of the text). The labeller is not<br />

really providing the labels, only the boundary points. It would be helpful if the task<br />

could be reduced to specifying phone boundaries in a given label file. In other words,<br />

if the task could be divided between SpeechCluster and the human: SpeechCluster<br />

generates a label file in a requested format with approximate times, and the human<br />

corrects it.<br />

This was the idea behind segFake. SegFake detects the end-points of the speech in<br />

the wav file (currently it assumes a single continuous utterance) and evenly spreads a<br />

string of labels across the intervening time. A resulting <strong>Text</strong>Grid is shown in Figure 1.<br />

Clearly, the probability of any boundary being correct approaches zero, but the task<br />

facing the human labeller has been substantially simplified.<br />

179


Fig. 1<br />

We can phrase a more explicit description of the problem (see Table 8); once the<br />

problem has been thus specified, translating it into Python is simple (see Table 9), and<br />

then this tool can be used from the command line (see Table 10). segFake results,<br />

viewed in Praat, are shown in Figure 2.<br />

Table 8: Pseudocode representation of fakeSeg problem<br />

GIVEN: wav file<br />

in the wav file, identify endpoints of speech: START, END<br />

T = END – START<br />

L = T / N<br />

Specify label boundaries, starting at START and incrementing by L<br />

180<br />

list of N labels


Line Code<br />

Fig. 2<br />

Table 9: segfake in python<br />

1 def fakeLabel(fn, phoneList, tierName=’Phone’, outFormat=’<strong>Text</strong>Grid’):<br />

2 seg = SpeechCluster(fn)<br />

3 segStart, segEnd, fileEnd = seg.getStartEnd()<br />

4 width = (segEnd - segStart)*1.0 / len(phoneList)<br />

5 tier = SegmentationTier()<br />

6 # start with silence<br />

7 x = Segment()<br />

8 x.min = 0<br />

9 x.max = segStart<br />

10 x.label = SILENCE_LABEL<br />

11 tier.append(x)<br />

12 for i in range(len(phoneList)):<br />

13 x = Segment()<br />

14 x.label = phoneList[i]<br />

15 x.min = tier[-1].max<br />

16 x.max = x.min + width<br />

17 tier.append(x)<br />

18 # end with silence<br />

19 x = Segment()<br />

20 x.min = tier[-1].max<br />

21 x.max = fileEnd<br />

181


22 x.label = SILENCE_LABEL<br />

23 tier.append(x)<br />

24 tier.setName(tierName)<br />

25 seg.updateTiers(tier)<br />

26 outFormat = SpeechCluster.formatDict[‘.%s’ \<br />

% outFormat.lower()]<br />

27 return seg.write_format(outFormat)<br />

Table 10: segFake usage<br />

Usage<br />

segFake.py -f -o (<strong>Text</strong>Grid | esps | htklab )<br />

<br />

segFake.py -d -t <br />

-o <br />

Example<br />

segFake.py -f amser012.wav -o <strong>Text</strong>Grid<br />

m ai hh ii n y n j o n b y m m y n y d w e d i yy n y b o r e<br />

segFake.py -d wav -t trans.txt -o <strong>Text</strong>Grid<br />

3.5 Using SpeechCluster III: A Bigger Example: PyHTK<br />

So far, SpeechCluster has been shown in fairly limited contexts, essentially as a file<br />

management tool to protect researchers from administrative drudgery. This was one<br />

of the key goals of SpeechCluster. We have seen this producing a quantitative effect:<br />

giving the researcher a bit more time, but not really changing the kind of work a<br />

researcher can do. The next example shows that SpeechCluster can have a qualitative<br />

effect too.<br />

The Hidden Markov Model Toolkit (HTK) (Woodland et al. 1994) is a toolkit to build<br />

HMMs, primarily for Automatic Speech Recognition (ASR), but it is also beginning to<br />

be used for research in speech synthesis or <strong>Text</strong>-to-Speech (TTS). HTK also provides<br />

facilities for language modelling used in ASR, but is increasingly being applied to<br />

problems in other domains. Like DNA sequencing (e.g. Kin 2003). It is a de facto<br />

standard in academic speech technology research, and no doubt has similar penetration<br />

into commercial research and development, particularly with small and medium-sized<br />

enterprises (SMEs). Although it is not open-source software, it is available free of<br />

182


charge, and the models generated can be used commercially with no license costs.<br />

Compared with other such toolkits (e.g. Sphinx and ISIP) it is usable, powerful, and<br />

accurate. Nevertheless, it is still not easy to use. HTK is:<br />

• Difficult technically: the ideal HTK user is a computer scientist who<br />

understands HMM internals, is comfortable with the command-line, and can write<br />

supporting scripts as required; and,<br />

• Complicated and time-consuming: use of HTK involves writing long chains<br />

of heavily parametrised commands, tests, adjustments and iterations.<br />

This is no criticism of HTK, of course (HMM building is complicated), but the<br />

consequence is that its use is limited to computer scientists already working on speech<br />

technology research projects (mostly ASR). This is normal (all part of the academic<br />

way of institutionalising specialisation); however, it acts as a limit on the usability of<br />

language resources (i.e. corpora), and on the potential of language researchers.<br />

PyHTK (Uemlianin 2005b) aims to change all that. PpyHTK is a Python wrapper<br />

around HTK, hiding the complexities of building and using HTK models behind a very<br />

simple command-line interface. A selection of commands from an HTK session is shown<br />

in Figure 3.<br />

These commands are roughly equivalent to the command pyhtk.py –a hmm4.<br />

Nobody would type out all those HTK commands longhand. As in the case of some<br />

of the functions of SpeechCluster, each project will write their own little scripts<br />

to generate the commands. As with SpeechCluster, this reduplication of code is a<br />

huge and invisible waste of effort; and more so here than with SpeechCluster, writing<br />

scripts to run HTK requires a familiarity with HTK and at least some familiarity with<br />

the ins and outs of HMMs. HTK is very far from being ‘plug-n-play’.<br />

With pyHTK all that is required to get started is a speech corpus (i.e., a set of wav<br />

files) and some level of transcription. PyHTK uses SpeechCluster to put everything into<br />

the formats required by HTK, and then runs the necessary HTK commands to build a<br />

model and/or conduct recognition. In other words, SpeechCluster acts as an interface<br />

between HTK and your data, and pyHTK acts as an interface between you and HTK.<br />

With pyHTK as an interface, HTK can be used with no knowledge or understanding<br />

at all of the underlying technology. It is perhaps true that ‘a little knowledge is a<br />

dangerous thing,’ but pyHTK should not be seen as promoting a lack of understanding.<br />

Rather, with pyHTK you can:<br />

• ‘Try out’ ASR research and get more seriously involved if it looks<br />

worthwhile;<br />

183


Fig. 3<br />

• Learn about the technology at your own pace, while you work, instead of<br />

having to cram up-front; and,<br />

• Start working in a new area without having to hire a new team.<br />

As part of pyHTK, SpeechCluster can have a qualitative effect on a team’s<br />

potential: new areas of research and development (ASR, TTS and language modelling)<br />

become accessible. For example, we have built a diphone voice with Festival (Black<br />

et al. 1999). We have gathered the data (recording a phonetically balanced corpus<br />

of around 3000 nonsense utterances). Before we can build the voice, we must label<br />

the data (i.e., provide time-stamped phonological transcriptions). Labelling all the<br />

data by hand would have taken around 100 person-hours. In a small team, this kind of<br />

labour-time is not available.<br />

Instead, using SpeechCluster and pyHTK, we have been able to do the following:<br />

• Use segFake to generate a starter segmentation of the data (we also manually<br />

transcribed just under 100 of the files, i.e., about 3% of the data).<br />

• Iterate pyHTK twelve times overnight on the given segmentation. This<br />

involved: building an AM based on the given segmentation; re-labelling the<br />

184


segFake’d training data with the AM; and, saving the generated segmentation<br />

for the next iteration.<br />

The resulting segmentation would not satisfy a linguist- Figure 4 compares a segFake<br />

segmentation with an execution of this process– but the boundaries are sufficiently<br />

accurate to build a voice with Festival. Although we manually labelled a very small<br />

proportion of the data, we hypothesise that this had little effect on the quality of the<br />

final voice. In other words, SpeechCluster and pyHTK have enabled an almost fully<br />

automated build of a synthetic voice.<br />

4. Conclusion<br />

Fig. 4<br />

Using SpeechCluster can save time and avoid a lot of stress. Developing SpeechCluster<br />

(using resources that would have been spent on repetitive chores) has resulted in<br />

a deliverable artefact: a reusable, shareable and extensible software package for<br />

manipulating speech data.<br />

SpeechCluster has been developed as part of the WISPR project, and the facilities<br />

it offers reflect the tasks we have faced on WISPR. In future, SpeechCluster will<br />

accompany our work in a similar way; consequently it is not possible to predict entirely<br />

the development of SpeechCluster. However, two directions can be indicated:<br />

• Corpora: It would be useful if SpeechCluster could treat a speech corpus<br />

with the same abstraction as it treats a sound/label file pair. In this case, the<br />

corpus, as a unit, could be described (e.g. counts of and relations between various<br />

185


units) and manipulated (e.g. subset selection). This direction will include a layer<br />

for compatibility with EMU.<br />

• Festival: We are developing a Python wrapper for Festival, similar to pyHTK<br />

for HTK. This development may have implications for changes in SpeechCluster.<br />

SpeechCluster can be downloaded from the address given in the documentation<br />

(Uemlianin 2005a). We are making it available under an open-source (BSD) license.<br />

We take seriously the proposition that SpeechCluster should be a usable, shareable<br />

resource. We encourage researchers and developers in the field to use SpeechCluster,<br />

and we shall as far as possible maintain and update SpeechCluster in line with users’<br />

requests.<br />

5. Acknowledgements<br />

This work is being carried out as part of the project ‘Welsh and Irish Speech<br />

Processing Resources’ (WISPR) (Williams et al. 2005). WISPR is funded by the Interreg<br />

IIIA European Union Programme and the Welsh Language Board. I would also like<br />

to acknowledge support and feedback from other members of the WISPR team, in<br />

particular Briony Williams and Aine Ni Bhrian.<br />

186


187<br />

References<br />

Black, A. W., Taylor, P. & Caley, R. (1999). The Festival Speech Synthesis System.<br />

http://www.cstr.ed.ac.uk/projects/festival/.<br />

Ellis, N. C. et al. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 Million Word Lexical<br />

Database and Frequency Count for Welsh.<br />

http://www.bangor.ac.uk/ar/cb/ceg/ceg_eng.html.<br />

Huang, X. (2005). Challenges in Adopting Speech Technologies. CSTR-21. Edinburgh,<br />

September 2005.<br />

Jones, R.J. et al. (1998). “SpeechDat Cymru: A Large- Scale Welsh Telephony<br />

Database.” Proceedings of the Workshop on “Language Resources for European<br />

Minority Languages, May 27th 1998, Granada, Spain.<br />

Kin, T. (2003) “Designing Kernels for Biological Sequence Data Analysis.” Doctoral thesis.<br />

School of Knowledge Science, Japan Advanced Institute of Science and Technology.<br />

Scannell, K.P. (2004). Corpus Building for Minority Languages. <strong>Online</strong> at http://borel.<br />

slu.edu/crubadan/index.html<br />

Uemlianin, I. (2005a). SpeechCluster Documentation. <strong>Online</strong> at<br />

http://www.bangor.ac.uk/~cbs007/speechCluster/README.html<br />

Uemlianin, I. (2005b). PyHTK Documentation. <strong>Online</strong> at<br />

http://www.bangor.ac.uk/~cbs007/pyhtk/README.html<br />

Williams, B. (1999). “A Welsh Speech Database: Preliminary Results.” Proceedings of<br />

Eurospeech 99, September 1999, Budapest, Hungary.<br />

Williams, B., Prys, D. & Ni Chasaide, A. (2005). “Creating an Ongoing Research<br />

Capability in Speech Technology for two Minority Languages: Experiences from the


WISPR Project.” Interspeech 2005. Lisbon.<br />

http://www.bangor.ac.uk/ar/cb/wispr.php<br />

Woodland, P.C. et al. (1994). “Large Vocabulary Continuous Speech Recognition Using<br />

HTK.” Acoustics, Speech, and Signal Processing, ii, 125-128.<br />

http://htk.eng.cam.ac.uk/<br />

188


XNLRDF: The Open Source Framework for<br />

Multilingual Computing<br />

Oliver Streiter and Mathias Stuflesser<br />

XNLRDF (Natural Language Resource Description Framework) attempts to collect,<br />

formalise and formally describe NLP resources for the world’s writing systems so<br />

that these resources can be automated in language-related applications like Web-<br />

browsers, mail-tools, Web-crawlers, information retrieval systems, or computer-<br />

assisted language learning systems. XNLRDF is a free software, distributed in XML-<br />

RDF or as database dump. It proposes to replace idiosyncratic ad-hoc solutions for<br />

Natural Language Processing (NLP) tasks within the aforementioned applications with<br />

a standard interface to XNLRDF. The linguistic information in XNLRDF extends the<br />

information offered by Unicode so that basic NLP tasks like language recognition,<br />

tokenization, stemming, tagging, term-extraction, and so forth can be performed<br />

without additional resources. With more than 1000 languages at use on the Internet<br />

(and their number continually rising), the design and development of such software<br />

has become a pressing need. In this paper, we describe the basic design of XNLRDF,<br />

the metadata, the type of information the first prototype already provides, and our<br />

strategies to further develop the resources.<br />

1. XNLRDF as a Natural Extension of Unicode<br />

1.1 The Advantages of Unicode<br />

Unicode has simplified the completion of NLP tasks for many of the world’s writing<br />

systems. Whereas, in the past, specific implementations were required, nowadays<br />

programming languages like Java, C++, C or Perl provide an interface to Unicode<br />

properties and operations. Unicode not only describes ‘code elements’ 1 of scripts<br />

by assigning the code element a unique code point, but it also assigns properties like<br />

uppercase, lowercase, decimal digit, mark, punctuation, hyphen, separator, or the<br />

script to the code elements. In addition, it defines operations on the code elements<br />

such as uppercasing, lowercasing and sorting. Thus, computer applications, especially<br />

those operating in multilingual contexts, are better off when processing texts in<br />

Unicode than in traditional encodings such as LATIN1, BIG5 or KOI-r.<br />

1 Similar, but not identical to characters (cf. our discussion in section 2.1).<br />

189


1.2.The Inadequacy of the Notions of Unicode for NLP Metadata<br />

On the other hand, the conceptual framework of Unicode is limited. Its principal<br />

notions are code elements and scripts. Important notions such as character, language<br />

or writing system have, astonishingly, no place in Unicode. As Unicode describes<br />

mainly scripts, two languages that use the same script (e.g., Italian and French) are<br />

essentially the same! The fact that French uses with ‘ç’ (the cedille) a character<br />

unknown in Italian is not formally described in Unicode. For this reason, additional<br />

knowledge (e.g., about languages, regions or legacy encodings) has been integrated<br />

into Unicode/Internationalisation programming libraries for a limited number of<br />

languages (e.g., ICU, Basis Technology Rosette, Lextek Language Identifier). 2<br />

As for the notion of language, it is not only absent from the formal framework of<br />

Unicode, but to our knowledge, nobody has attempted, except for limited purposes,<br />

a large-scale mapping between Unicode scripts and the world’s most important<br />

language identification standards (i.e., ISO 639 and the SIL-codes of Ethnologue).<br />

This is astonishing, as neither the language code, nor the code of the locality of a<br />

document, nor the script taken in isolation are sufficiently rich to serve as metadata.<br />

Metadata in language-related applications have the function to map a document to<br />

be processed to the adequate NLP resources. In XNLRDF, the notion of writing system<br />

is used for this purpose. It represents a first large-scale attempt to map scripts onto<br />

language identification standards.<br />

1.3 The Writing System in XNLRDF as Metadata<br />

In XNLRDF, the writing system of a text document is an n-tuple of metadata categories,<br />

which include the language, the locality and the script as the most distinguishing ones.<br />

In Belgium, for example, text documents are produced (at least) in Dutch, French<br />

and German. The locality, therefore, is not enough as a single discriminative feature<br />

of these documents. Neither is the category language taken by itself, since Dutch,<br />

French and German are written in other countries as well. Furthermore, even the<br />

tuple language_locality, as it is frequently used (e.g., FR_be, NL_be), is not sufficient<br />

for all text documents and NLP resources. There exist localities that have two or<br />

more alternative scripts for the same language. For example, Serbian in Serbia and<br />

Montenegro is written with the Latin or Cyrillic scripts, and Hausa in Nigeria in the<br />

Latin or Ajami scripts.<br />

An extended analysis of the world’s writing systems reveals that at least four more<br />

categories are required for an unambiguous specification of the writing system. These<br />

2 For a detailed survey, see Unicode Inc. (2006).<br />

190


categories are: the orthography, the writing standard, the time period of the writing<br />

system, and (for transliterations), a reference to another writing system.<br />

Supporting evidence for the necessity of these categories comes, for example,<br />

from Abkhaz. Not only has Abkhaz been written with two different Cyrillic alphabets,<br />

but also with two different Latin alphabets, one between 1926 and 1928, and another<br />

between 1928 and 1937. One might want to distinguish these writing systems by their<br />

name (the standard) or by the time period. In such cases, we do not exclude the first<br />

solution, although there is frequently no standard name for the standard. If possible,<br />

we prefer to use the time period, as it offers the possibility to calculate intersections<br />

with other time constraints (e.g., the production date of the document).<br />

The writing standard is best explained by the different, concurring, isochronic<br />

writing standards for Norwegian: Nynorsk, Bokmål, Riksmål and Høgnorsk are different<br />

contemporaneous conventions essentially representing the same language (http://<br />

en.wikipedia.org/wiki/Norwegian).<br />

The orthography is best illustrated by the spelling reform of German, where the new<br />

orthography came into force in different localities at different times, and overlapped<br />

with the old spelling for a different number of years. Again, use of the time period<br />

is a nice feature, but it does not allow dispensing with the category of orthography.<br />

Unfortunately, orthographies, also frequently lacking a standard name, are referred<br />

to at the time of their introduction as ‘new’ in opposition to ‘old.’ This denomination<br />

of orthographies, however, becomes meaningless in a diachronic perspective where<br />

each ‘new’ orthography will eventually grow ‘old.’<br />

1.4 The Case of Braille and other Transliterations<br />

Reference is necessary to correctly represent transliterations, that is,<br />

transliterations in the sense of one-to-one mappings, but also as one-to-many or<br />

many-to-many mappings. Reference introduces recursiveness into the metadata of<br />

the writing system, a complexity that is hard to avoid. Braille is a good example of<br />

a transliteration system that changes with the standards and spelling reforms of the<br />

referenced writing system. There exists a Norwegian Braille derived from Nynorsk, and<br />

a Norwegian Braille derived from Bokmål. By the same principle, Braille of the new<br />

German orthography is different from Braille based on the old German orthography.<br />

Similarly, Braille changes with respect to the locality of the Braille documents<br />

that might differ in origin from the locality of the referenced writing system. For<br />

example, Spanish Braille in a Spanish-speaking country is different from the Spanish<br />

of a Spanish-speaking country represented as Braille in the USA. We can only handle<br />

191


this complexity precisely when we allow writing systems to refer to each other<br />

recursively. Thus Braille, as with other transliteration systems, is represented as a<br />

writing system with its own independent locality, script, standard (e.g., contracted<br />

and non-contracted), and time period. The language of the transliteration and the<br />

referenced writing system are nevertheless the same, although XNLRDF would allow<br />

this to change for the transliteration as well.<br />

A transliteration is thus marked by a reference to another writing system, and, in<br />

the descriptive data, mappings between these two systems in the form of a mapping<br />

table, (e.g., between Bokmål and Bokmål Braille). Mappings between writing systems<br />

are a natural component in the description of all writing systems, even if they do<br />

not represent transliterations of each other, (e.g., mappings between hànyŭ pīnyīn,<br />

wade-giles, gwoyeu romatzyh and bopomofo/zhùyīn fúhào). The Braille of Mandarin<br />

in the People’s Republic, incidentally, is a transliteration of hànyŭ pīnyīn.<br />

To sum up, the metadata needed to identify the appropriate or best NLP resources<br />

for the processing of a text-document are much more complex than what current<br />

standards have defined. In other words, relying on only one part of the metadata,<br />

such as the Unicode scripts or the language codes combined with locality codes,<br />

is not always accurate and thus not completely reliable for automated NLP-tasks.<br />

If NLP-technologies on the Web have, until now, not suffered from this important<br />

misconception (e.g., in the metadata specification in the HTTP or XML header), it<br />

is because they either target about two dozen common languages (applying default<br />

assumptions that prevent less frequently used writing systems from being correctly<br />

processed), or because a linguistically informed human mediates between documents<br />

and resources.<br />

2. XNLRDF and Information Needs Beyond Unicode<br />

Let us assume, for expository purposes, that an NLP-application can correctly<br />

identify the writing system of a document to be processed, and that this writing system<br />

contains references to Unicode scripts or code points. In effect, little follows from<br />

this, as Unicode defines only a very limited amount of operations, and defines them<br />

only for a script and not a writing system. The task of XNLRDF is thus to reformulate<br />

the operations defined in Unicode in the terms of a writing system, and, secondly, to<br />

enlarge the linguistic material so that more operations than those defined in Unicode<br />

become possible.<br />

192


2.1 Unicode and Characters: Uppercasing, Lowercasing and Sorting<br />

Contrary to a common sense understanding of Unicode, the conceptual design of<br />

Unicode avoids the notion of character, since this is a language-specific notion, and<br />

languages are not covered by Unicode. Unicode refers instead to code elements (),<br />

which frequently coincide with characters, but also contain combining characters<br />

such as diacritics. Characters and code elements further differ, if ligatures (Dutch ‘ij’,<br />

Spanish ‘ll’, ‘ch’, Belorussian Lacinka ‘dz’) are to be treated as one character in a<br />

language. Uppercasing of ligatures is thus essentially undefined, and will produce from<br />

‘xy’ uniformly either ‘Xy’ or ‘XY’, without knowing the requirements of the writing<br />

system. It is thus obvious that specifying the character set of writing systems and<br />

describing the mapping between the characters (e.g., for uppercasing and lowercasing)<br />

is one principle task in XNLRDF, just as lowercasing, for example, is an important step<br />

in the normalisation of a string (e.g., for a lexicon lookup or information retrieval).<br />

Similarly, the sorting of characters, the second operation defined in Unicode (e.g.,<br />

for the purpose of presenting dictionary entries or creating indices), depends on the<br />

writing system, and can only be approximately defined on the basis of the script.<br />

Thus, Unicode might successfully sort ‘a’ before ‘b’, but already the position of ‘á’<br />

after ‘a’ or after ‘z’ is specific to each writing system. Another example is the Spanish<br />

‘ll.’ Although it is no longer considered a character, it maintains its specific position<br />

between ‘l’ and ‘m’ in a sorted list. Thus, sorting requires basic writing system-specific<br />

information, which XNLRDF sets out to describe. What this example also shows is<br />

that the definition of collating sequences for the characters of a writing system is<br />

independent from the status of the character (base character, composed character,<br />

contracted character, contracted non-character, context-sensitive character, foreign<br />

character, swap character, etc.).<br />

2.2 Linguistic Information: What Else?<br />

The operations covered by Unicode are limited, and most NLP-applications would<br />

require much more linguistic knowledge when processing documents in potentially<br />

unknown writing systems. First, an application might need to identify the encoding<br />

(e.g. KOI-R), the script (Cyrillic), the language (Russian), the standard (civil script),<br />

and orthography (after 1917) of a document. Part of this information might be drawn<br />

from the metadata available in the document, from the Unicode range, or the URL<br />

of a document (in our example, http://xyz.xyz.ru), but filling in the remaining gaps,<br />

(e.g., mapping from the encoding KOI-R to the language Russian, from the language<br />

to potential scripts, or from a script to a language) requires background information<br />

about the legacy encodings and writing systems. This background information is<br />

193


available in XNLRDF. Information supporting the automatic identification of writing<br />

systems with no or incomplete metadata will also be supported by XNLRDF in the<br />

form of character n-grams. These n-grams are compiled from classified text samples<br />

or corpora within or without XNLRDF. Thus, for each writing system, XNLRDF allows<br />

to give information on URLs of other documents (of the same writing system), to raw<br />

text collections, and to elaborated corpora.<br />

From the identified writing system, the application can start to retrieve additional<br />

resources that support the segmentation, stemming, hyphenation, and so forth of<br />

the document. A Web-crawler, for example, would try to find those text units (words<br />

and phrases) that are suitable for indexing. In most cases, the document will be<br />

segmented into words using a limited number of writing-specific word-separating<br />

characters (e.g., empty space, comma, hyphen, etc.). Although Unicode should provide<br />

this information, writing systems also differ as to which characters are unambiguous<br />

word separators, ambiguous ones, or not word separators at all. Thus, within those<br />

languages using Latin script, some integrate an empty space into a word, for example,<br />

Northern Sotho (Prinsloo & Heid 2006), while others like Lojban integrate ‘,’ and<br />

‘.’ in the middle of a word. Unconditionally splitting a text in these languages with<br />

the empty space character ( ), a comma (‘,’), or a period (‘.’) would cut words into<br />

undefined chunks.<br />

For writing systems that do not mark word boundaries (e.g., Han-characters or<br />

Kanji), the Web-crawler should index either each character individually (this is what<br />

Google does), or identify words through word lists and/or rules. Spelling variants<br />

(humour, humor), writing variants (Häuser, Haeuser, H&aauml;user or 灣,灣,Wan),<br />

inflected word forms (come, came), abbreviations (European Union, EU) should be<br />

mapped onto their base forms to improve the quality of document retrieval. All these<br />

are basic operations XNLRDF sets out to cover.<br />

3. Difficulties in Obtaining Information beyond Unicode<br />

The need for a linguistic extension of Unicode has long been recognised, and<br />

most of the information that applications, as the one sketched above, require is<br />

available from online resources. Thus, NLP-applications, at least theoretically, could<br />

get them automatically from the Web. If this were without problems, XNLRDF would<br />

be a redundant copy of other online information. However, for several reasons,<br />

the resources on the Web or the information contained within cannot be accessed,<br />

extracted and integrated by these applications (and by humans only with difficulty).<br />

First, there might exist difficulties to find and access the resources:<br />

• Resources can not be found because metadata are not available; or,<br />

194


• The resource is not directly accessible for applications: for example, accessing it<br />

requires transactions like registering, submitting passwords, entering the credit card<br />

number, etc.<br />

Then, once a resource is found and accessed, there might be difficulties to extract<br />

or understand the necessary information, such as:<br />

• The resource is not formally structured;<br />

• The information within the resource is formally structured, but the syntax of the<br />

structure is not defined: for example, fields are separated by a special character, but<br />

the character used is not specified;<br />

• The information is ambiguous, as in the following example: “Abkhaz is a North<br />

West Caucasian language with about 105,000 speakers in Georgia, Turkey and Ukraine<br />

(...) The current Cyrillic-based system” (http://www.omniglot.com/writing/abkhaz.<br />

htm), which does not specify which region actually is using or not using the Cyrillic-<br />

based script at present;<br />

• The syntax is defined, but the semantics of the units are not as defined as it could<br />

be through using XML namespace, and so forth. With a namespace, a NOUN-tag can<br />

be linked to the URL containing the definition of the tag. Thus, different NOUN-tags<br />

could be used without confusion;<br />

• The information in the different resources is not compatible, that is, the notion<br />

of language varies greatly between resources. To give one example, what the Office<br />

of the High Commissioner for Human Rights (Universal Declaration of Human Rights)<br />

describes as Zapoteco is not covered in Omniglot, and is split into more than fifty<br />

languages by Ethnologue and the Rosetta Project ; or<br />

• Most resources are language-centred and do not put the writing system into the<br />

centre of the description. To understand how serious this misconception is, imagine<br />

you search a Chinese document and get Chinese in Braille, which is Chinese to the<br />

same degree as what you expected to get.<br />

In view of all this, there is an enormous need to bring the available resources<br />

together and make them compatible, available and parsable; otherwise, the information<br />

will be barely usable for NLP-applications. This compiling work necessarily involves<br />

a combination of the linguists’ careful classification, description, and automated<br />

approaches to knowledge acquisition. Both techniques will first exploit other resources<br />

relevant for XNLRDF.<br />

195


4. Related Activities and Available Resources<br />

Fortunately, XNLRDF is embedded in a wide field of research activities that create,<br />

document and make accessible natural language resources. What makes XNLRDF<br />

stand out in its field is its focus on Natural Language Processing resources on the<br />

one hand, and fully-automated access to the data by an application on the other.<br />

Nevertheless, XNLRDF will try to profit from related projects and to comply with<br />

available standards.<br />

Repositories of the world’s languages are available online. Figuring most<br />

prominently among them are: Omniglot, Ethnologue , The Rosetta Project, TITUS,<br />

and the Language Museum (http://www.language-museum.com). Although these<br />

resources offer rich information on scripts and languages, they are almost unusable<br />

for computer applications, as they are designed for human users. The difficulties in<br />

using Ethnologue, for example, derive from its focus on spoken languages and its<br />

tendency to introduce new languages where others just see regional variants of the<br />

same language. This problem has been inherited by the Rosetta Project and the World<br />

Atlas of Language Structures (Haspelmath et al. 2005). In addition, some sites (e.g.,<br />

the Language Museum) use scanned images of characters, words and texts that of<br />

course are almost impossible to integrate into NLP resources. Still other sites (e.g.,<br />

TITUS) use mainly transcriptions or transliterations that are equally worthless without<br />

a formal definition of the mappings applied. Currently, the information available on<br />

these sites is checked and integrated manually into the XNLRDF data structure.<br />

OLAC, the Open Language Archives Community project , is setting up a network<br />

of interoperating repositories and services for hosting and accessing NLP resources.<br />

The project’s aims and approaches are very close to those of XNLRDF, and we foresee<br />

a considerable potential for synergy. The metadata and their definition is what will<br />

be most relevant to XNLRDF. However, the OLAC user scenario assumes a human user<br />

looking for resources and tools, whereas XNLRDF is designed to allow applications to<br />

find resources autonomously given a text document to be processed and a task to be<br />

achieved.<br />

Closely related to OLAC is the E-MELD project, which supports the creation of<br />

persistent and reusable language resources. In addition, queries over disparate<br />

resources are envisaged To which extent XNLRDF can profit from E-MELD has yet to be<br />

investigated in detail.<br />

Data consortia like ELRA or LDC host NLP resources that can be identified<br />

through the machine-readable metadata in OLAC. However, resources are not freely<br />

accessible. Commercial transactions are required between the identification of the<br />

196


esource and the access to the resource. For this reason, these resources will remain<br />

unexplored, even if prices are modest. Although ELRA and LDC have their merits, for<br />

small languages, better solutions are available for the hosting of data (cf. Streiter<br />

2005).<br />

Project Gutenberg provides structured access to its 16,000 documents (comprising<br />

about thirty languages) through an XML-RDF. Unfortunately, information characterising<br />

text T1 as translation of T2 is still not provided, that is, although parallel corpora are<br />

implicitly present, they are not identifiable as such. In theory, the documents of<br />

Project Gutenberg could be used to build up corpora in XNLRDF. Such a copying of<br />

resources, however, might only be justifiable for writing systems for which little corpus<br />

material is available. More important might be a mapping from the writing system of<br />

XNLRDF to the documents of Project Gutenberg, thus translating the available XML-<br />

RDF in terms of XNLRDF.<br />

Free monolingual and parallel corpora are available at a great number of sites,<br />

most prominently at http://www.unhchr.ch/udhr/navigate/alpha.htm (Universal<br />

Declaration of Human Rights in 330 languages), http://www.translatum.gr/bible/<br />

download.htm (The Bible), and The European Parliament Proceedings Parallel Corpus<br />

(http://people.csail.mit.edu/koehn/ publications/europarl), among others. Those<br />

documents that support otherwise underrepresented writing systems will be integrated<br />

into XNLRDF in the form of corpora.<br />

The Wikipedia project is interesting for XNLRDF for a number of reasons. First,<br />

it provides documents that can be used to build corpora without infringing upon<br />

copyrights. Second, as the Wikipedia is available in more than one hundred languages,<br />

thousands of quasi-parallel texts become accessible. Third, the model of cooperation<br />

in Wiki projects, and the underlying software, will indicate the way XNLRDF will go.<br />

Thus, XNLRDF will gradually enlarge the community of researchers involved to the<br />

point that the world’s linguists will be able to collect the data they need for their<br />

writing systems. This issue will be further discussed below.<br />

5. Conceptual Design of XNLRDF<br />

The purpose of XNLRDF is to find adequate NLP resources to process a text document.<br />

To this end, the metadata of the document and the resource are matched. The better<br />

the match, the more suitable the resource is for the processing of the document. The<br />

metadata matched are those categories that make up the writing system.<br />

197


5.1 Finding Resources via the Writing System<br />

The writing system has a function similar to SUBJECT.LANGUAGE in the OLAC-<br />

metadata, defined in Simon & Bird (2001) as “[…] a language which the content of<br />

the resource describes or discusses.” A writing system in XNLRDF is defined by the n-<br />

tuple of the category’s language, locality, script, orthography, standard, time period,<br />

and reference to another writing system. The writing system is a property of the<br />

text document and the resource. In XNLRDF, for each writing system there is a more<br />

abstract writing system (e.g., without constraints in locality) as a fallback solution to<br />

fill in empty categories with default assumptions. In general, for each language there<br />

is one writing system without a locality that provides a default locality in the event<br />

that no locality can be derived from the document. (Cf. Plate 1: Different Writing<br />

Systems for Mandarin Chinese. The first row is the fallback with the default locality.)<br />

These underspecified writing systems are currently also used (and perhaps incorrectly)<br />

for supranational writing systems, (e.g., English-based writing in the UN).<br />

Plate 1: Writing Systems for Mandarin Chinese. Note the first row showing Chinese without<br />

locality as a super-regional language. In case of doubt, the application has to assume<br />

China as the locality where the text-document originated.<br />

The inclusion of a writing system into XNLRDF is pragmatically handled. Included<br />

are all writing systems for which text documents with yet uncovered combinations of<br />

language, locality, and so forth can be found. The same pragmatic approach is used to<br />

(or not to) distinguish languages and dialects. Thus, dialects are treated identically to<br />

languages, whenever documents of that variant are found (e.g., Akan Akuapem, Akan<br />

Asante and Akan Fante). A writing that claims to be representing a language family is<br />

registered with a writing system of this language family. The same goes for localities;<br />

whenever a document is reasonably associated with one region - even if that region<br />

is not a recognised geographical, administrative or economical body - the region will<br />

be included as locality.<br />

198


5.2 The Names of Metadata Categories<br />

All this leads to the overall problem that, for the main categories of the writing<br />

system, no standardised identifiers are available. We already discussed the lack of<br />

standard names for the standard and orthography of a writing system. But in addition,<br />

languages, localities and scripts do not necessarily have standard names or standard<br />

codes, albeit XNLRDF tries to integrate the ISO 339 codes for languages (ISO 639 2006)<br />

(the 2-letter code for languages ISO-639-1 and the 3-letter code for languages ISO-<br />

639-2), the SIL-codesVersion 14 of Ethnologue, the Unicode-naming of scripts, and<br />

ISO-3166 (ISO 3166 2006) encoding of localities (countries, regions, and islands).<br />

A number of limitations, however, make these codes difficult to use: ISO-639-1<br />

covers only a few languages; ISO-639-2 assigns more than one code to one language;<br />

both ISO norms assign one code to sets of languages, language families, and so on;<br />

and, SIL-codes change from version to version (about every four years), and do not<br />

cover historic languages, artificial languages, language groups or languages that exist<br />

only as written standard.<br />

The situation for the encoding of languages will improve with the adoption of<br />

the draft ISO/DIS 639-3 as a standard (presumably in 2006), as it will combine the<br />

respective advantages of the SIL-codes and the ISO-codes. Until then, applications<br />

will continue to use the RFC 3066 standard for HTTP headers, HTML metadata and in<br />

the XML lang attribute. 2 and 3-letter codes are interpreted as ISO-639-1 or ISO-639-<br />

2 respectively. ISO-639-1 can be mapped on ISO-639-3, and ISO-639-2 is identical to<br />

ISO-639-3, so that, in the future, only ISO-639-1 (transitional) and ISO-639-3 will be<br />

needed (for more information on this development, consult the webpages http://<br />

en.wikipedia.org/wiki/ISO_639-3, http://www.ietf.org/rfc/rfc3066.txt and http://<br />

www.ethnologue.com/codes/ default.asp). SIL-codes will then become superfluous<br />

in XNLRDF, and languages that are not written can be removed from XNLRDF. The<br />

advantage of ISO-639-3 is that it can group together individual spoken languages (such<br />

as two dozen spoken Arabic languages) to ‘macro languages’ (Arabic), thus preventing<br />

writing systems from being fragmented due to the fragmentation of languages.<br />

Most reliably, however, the categories of the writing system can be accessed with<br />

their natural language name in one the world’s major writing systems, for which<br />

XNLRDF guarantees an unambiguous match. As a consequence of this recursion, as<br />

outlined in Gödel’s ‘Incompleteness Theorems’, neither the names nor the categories<br />

can be formally defined; they can only be explained by the use they are put to (e.g.,<br />

the material that is attached to a name). Fortunately, this problem is not inherent<br />

199


to XNLRDF, but is also shared by other classification standards like ISO norms and SIL<br />

codes.<br />

5.3 Linguistic Information for Writing Systems<br />

A writing system is associated via a resource type with the corresponding resources.<br />

Writing systems stand in a many-to-many relation to encoding (Plate 2), numerals (Plate<br />

3), and function words (Plate 4); characters; sentence separators; word separators;<br />

URLs (classified according to genres); dictionaries; monolingual and parallel corpora;<br />

and, n-gram statistics.<br />

Plate 2: A writing system (Mandarin Chinese in Taiwan) related to ENCODING.<br />

Plate 3: A writing system (Thai) related to NUMERALS.<br />

200


Plate 4: A writing system (Thai) related to FUNCTION_WORDS.<br />

5.4 Methods and Implementation<br />

The data-model is implemented in a relational database, which provides all means<br />

to control the coherence of the data, create backups, and allow people from different<br />

parts of the world to work on the same set of data. For applications working with<br />

relational databases, this data can be downloaded under the GNU Public License as<br />

database dump (PostgreSQL). An Interface to the database has been created as a<br />

working tool for the creation and maintenance of data.<br />

An additional goal is to make XNLRDF available in XML-RDF. RDF, a framework for<br />

the description of resources, has been designed by the W3C to describe resources<br />

with their necessary metadata for applications rather than for people (Manola & Miller<br />

2004). Whereas, in the relational database, the defaulting mechanism is programmed<br />

as a procedure, in XML-RDF defaults are compiled out. In this way, the information in<br />

XNLRDF can be accessed through a simple look-up with a search string such as ‘Thai’,<br />

‘Thailand’, ‘Thai;Thailand’, and so forth.<br />

201


6. Envisaged Usage and Impact<br />

In order to give a word-to-word translation, for example, within a Web-browser, the<br />

Web-browser has to know where to find a dictionary and how to use it. With only one<br />

such resource, a special function within the Web-browser might handle this (e.g., a<br />

number of FIREFOX add-ons do exactly this). But with hundreds of language resources,<br />

a more general approach is required that not only involves adequate resources, but<br />

also metadata with an NLP-specific metadata dictionary and metadata syntax. NLP-<br />

operations like tagging or meaning disambiguation for annotated reading have then<br />

to be defined recursively in the metadata syntax: in this way, a tagger can call a<br />

tokenizer if it can’t perform tokenization itself.<br />

The substantiation of the concept of XNLRDF will thus consist of compiling XNLRDF<br />

into an Mozilla-compatible RDF and integrating it into an experimental Mozilla module.<br />

Not only is Mozilla a base for a great number of very popular applications (e.g.,<br />

Firefox, Thunderbird, Bugzilla, Netscape, Mozilla Browser, and Mozilla e-mail), but it<br />

also disposes of an RDF-Machine that can be accessed via JavaScript and XPConnect<br />

(Boswell et al. 2002). A minor test-application of XNLRDF in Mozilla might thus have<br />

a tremendous impact.<br />

Less spectacular than the still pending integration into Mozilla is the testbed where<br />

XNLRDF is currently used. It serves as a linguistic database for Gymn@zilla, a CALL<br />

system that handles about twenty languages, with new languages added on a regular<br />

basis (Streiter et al. 2005). In general, CALL systems are very likely to be the first<br />

applications to profit from XNLRDF. They are frequently applicable to many languages<br />

and require relatively uncomplicated operations. In fact, many CALL modules are<br />

freely available, and, to some extent, language independent (e.g., Hot Potatoes).<br />

In practice, however, they are often only suited for an undefined group of languages<br />

(e.g., they require a blank to separate words). With the linguistic intelligence of<br />

XNLRDF, such modules could not only extend the range of languages, but also generate<br />

better exercises and provide better feedback.<br />

Web-crawlers and IR systems are other candidates that will certainly profit from<br />

XNLRDF. While most IRs may be tuned to one or a few languages, they generally lack<br />

the capacity to process a wide range of languages. The large amount of NLP systems<br />

integrated in Google shows the importance of linguistic knowledge in IR.<br />

To sum up, we not only hope to bring many more languages to text document<br />

processing applications, but hope to do this in a standard format that can be easily<br />

processed by XML or XML-RDF-enabled applications.<br />

202


7. Status of the Project and Future Developments<br />

The project is still an unfunded garage project. In the previous project phase,<br />

we defined the base and implemented the first model in a relational database.<br />

An interface to that database has been created to allow new data to be entered<br />

via the WWW. After inserting more than 1000 writing systems and getting a better<br />

understanding of the framework necessary to describe a writing system, we are<br />

currently adding linguistic information to describe the writing systems. The data<br />

structures for characters, corpora, dictionaries, and so forth are still changing when<br />

new requirements or linguistic complexities are encountered. URLs and corpora are<br />

collected to support the description of the writing system and as useful material to be<br />

integrated in XNLRDF for NLP-applications (e.g., for the creations of word lists).<br />

In the meantime, we hope to attract more researchers to collaborate in the project.<br />

It is impossible to answer now, whether or not the project will be as open as the<br />

Wikipedia. It is certain, however, that this endeavour will require the collaboration of<br />

a wide range of researchers around the globe. Very likely, small tools will be created<br />

around XNLRDF that will illustrate the use the resource can be put to, and motivate<br />

linguists to enter data for their language (writing system). Such tools will have the<br />

additional advantage to check the accuracy and completeness of the data.<br />

8. Glossary<br />

Language<br />

Language is one of the discriminating features that defines a writing system in XNLRDF.<br />

XNLRDF uses language identification standards such as ISO-639-1 and ISO-639-2 to map<br />

language names to unambiguous language codes.<br />

Locality<br />

Locality is one of the discriminating features that defines a writing system. ISO 3166<br />

is the standard that defines locality codes. However, XNLRDF pragmatically includes<br />

a region as locality, whenever there is a document that is reasonably associated with<br />

the region. This applies also even if the region is not a recognised geographical,<br />

administrative or economic body.<br />

Metadata Categories<br />

Metadata in language-related applications have the function to map a document to<br />

be processed to the appropriate NLP (natural language processing) resources. XNLRDF<br />

uses the categories of language, locality, script, orthography, standard, time period,<br />

203


and reference to another writing system. Together, these NLP metadata categories in<br />

XNLRDF define a writing system.<br />

Natural Language Resource<br />

A natural language resource in XNLRDF refers to structured linguistic information<br />

and/or NLP applications that are accessible to machines via a clearly defined writing<br />

system. Types of resources include, for example, encoding, numerals, function words,<br />

characters, sentence separators, word separators, URLs, dictionaries, corpora, and n-<br />

gram statistics, as well as applications for basic NLP tasks such as language recognition,<br />

tokenization, stemming, tagging, segmentation, hyphenation, and indexing of<br />

complex NLP implementations such as term extraction, document retrieval, meaning<br />

disambiguation, and ComputerAssisted Language Learning tools.<br />

Orthography<br />

Orthographies can sometimes be tracked using the time period category. However,<br />

different orthographies might coexist for a certain time span (e.g., in German,<br />

after the latest orthography reform). Therefore, locality is one of the discriminating<br />

features that defines a writing system.<br />

Reference<br />

Reference is used in XNLRDF to describe transliterations. The transliteration is a<br />

writing system on its own, but can only be understood and correctly processed when<br />

referring to another underlying writing system. For example, a text written in Braille<br />

can only be understood and processed when referred to the underlying writing system<br />

(e.g., Braille referring to standard German in Austria in ‘new orthography’). Reference<br />

is a recursive category. It is one of the discriminating features that defines a writing<br />

system.<br />

Script<br />

In Unicode, legacy scripts are named (e.g., Latin, Arabic and Cyrillic). XNLRDF uses<br />

these script names as a discriminating feature to define writing systems.<br />

Time Period<br />

The time period offers the possibility to calculate intersections with other time<br />

constraints (e.g., between the validity of an orthography and the production date<br />

of the document). Therefore, time period is one of the discriminating features that<br />

defines a writing system.<br />

204


Writing Standard<br />

Sometimes the same language can be written in different, concurring, isochronic<br />

writing standards. For example, Nynorsk, Bokmål, Riksmål and Høgnorsk are different<br />

contemporaneous conventions that represent Norwegian. Therefore, writing standard<br />

is one of the discriminating features that defines a writing system.<br />

Writing System<br />

The writing system helps to map a document to the adequate NLP resources necessary<br />

to process the document. A writing system in XNLRDF is defined by the n-tuple of<br />

language, locality, script, orthography, standard, time period and reference to another<br />

writing system. In XNLRDF, for each writing system there are also more abstract writing<br />

systems –(e.g., those without constraints in locality) as a fallback to fill in the missing<br />

information with default assumptions.<br />

XNLRDF<br />

XNLRDF stands for ‘Natural Language Resource Description Framework.’ It is an Open<br />

Source Framework for Multilingual Computing, designed to allow applications to find<br />

language resources autonomously, given a text document to be processed and a task<br />

to be achieved. XNLRDF is distributed either in XML-RDF or as database dump.<br />

205


206<br />

References<br />

Boswell, D. et al. (2002). Creating Applications with Mozilla. Sebastopol: O’Reilly.<br />

“E-MELD”. <strong>Online</strong> at http://emeld.org.<br />

“Ethnologue”. <strong>Online</strong> at http://www.enthnologue.com.<br />

“European Parliament Proceedings Parallel Corpus 1996-2003”. <strong>Online</strong> at http://<br />

people.csail.mit.edu/koehn/publications/europarl.<br />

Haspelmath, M. et al. (eds) (2005). The World Atlas of Language Structures. Oxford:<br />

Oxford University Press.<br />

“Hot Potatoes”. <strong>Online</strong> at http://web.uvic.ca/hrd/halfbaked. ISO 639 (1989). Code<br />

for the representation of the names of languages.<br />

ISO 3166-1 (1997). Codes for the representation of names of countries and their<br />

subdivisions -- Part 1: Country codes.<br />

ISO 3166-2 (1998). Codes for the representation of names of countries and their<br />

subdivisions -- Part 2: Country subdivision code.<br />

ISO 3166-3 (1999). Codes for the representation of names of countries and their<br />

subdivisions -- Part 3: Code for formerly used names of countries.<br />

Manola, F. & Miller, E. (eds) (2004). “W3C Recommendation 10”. RDF Primer, February<br />

2004. http://www.w3.org/TR/rdf-primer/.<br />

Norwegian (2006, March 7). In Wikipedia, The Free Encyclopedia. Retrieved March 7,<br />

2006. <strong>Online</strong> at http://en.wikipedia.org/wiki/Norwegian.<br />

“OLAC, the Open Language Archives Community project”. <strong>Online</strong> at http://www.


language-archives.org/documents/overview.html.<br />

“Omniglot”. <strong>Online</strong> at http://www.omniglot.com.<br />

Prinsloo, D. & Heid, U. (this volume). “Creating Word Class Tagged Corpora for Northern<br />

Sotho by Linguistically Informed Bootstrapping”, 97-115.<br />

“Rosetta Project”. <strong>Online</strong> at http://www.rosettaproject.org.<br />

Simons, G. & Bird S. (eds) (2001). OLAC Metadata Set. http://www.language-archives.<br />

org/OLAC/olacms.html.<br />

Streiter, O. (this volume). Implementing NLP-Projects for Small Languages: Instructions<br />

for Funding Bodies, Strategies for Developers, 29-43.<br />

Streiter, O. et al. (2005). “Dynamic Processing of <strong>Text</strong>s and Images for Contextualized<br />

Language Learning”. Proceedings of the 22nd International Conference on English<br />

Teaching and Learning in the Republic of China (ROC-TEFL), Taipei, June 4-5, 278-<br />

98.<br />

“TITUS”. <strong>Online</strong> at http://titus.uni-frankfurt.de.<br />

“Translatum”. <strong>Online</strong> at http://www.translatum.gr/bible/download.htm.<br />

“Unicode Enabled Products”. <strong>Online</strong> at<br />

http://www.unicode.org/onlinedat/products.html.<br />

“Universal Declaration of Human Rights” <strong>Online</strong> at<br />

http://www.unhchr.ch/udhr/index.htm.<br />

207


Speech-to-Speech Translation for Catalan<br />

Victoria Arranz, Elisabet Comelles and David Farwell<br />

This paper focuses on a number of issues related to adapting an existing interlingual<br />

representation system to the ideosyncracies of Catalan in the context of the FAME<br />

Interlingual Speech-to-Speech Machine Translation System for Catalan, English and<br />

Spanish. The FAME translation system is intended to assist users in making hotel<br />

reservations when calling or visiting from abroad. Following a brief presentation of<br />

the Catalan language, we describe the system and review the results of a major<br />

user-centered evaluation. We then introduce Interchange Format (IF), the interlingual<br />

representation system underlying the translation process, and discuss six types of<br />

language-dependent problems that arose in extending IF to the treatment of Catalan,<br />

along with our approach to dealing with these problems. They include the lack of<br />

dialog-level structural relationships, conceptual gaps, the lack of register distinctions<br />

(e.g. specifically and formality), the treatment of proper names, the lack of a method<br />

for dealing with partitives and conceptual overgranularity. Finally, we summarise the<br />

contents and suggest some future directions for research and development.<br />

1. Introduction<br />

The goal of this paper is to review a number of problems that arose in adapting<br />

an existing Interlingua, Interchange Format (IF), to the treatment of Catalan, and<br />

to describe our approach to dealing with them. As classes, these problems are not<br />

peculiar to Catalan per se, but the language presents an interesting case study in<br />

terms of their particular manifestation and how they might be dealt with. They<br />

include the need for representing dialogue-level structural relations, dealing with<br />

conceptual gaps, the need for representing register distinctions, a semi-productive<br />

method for dealing with proper names, the need for representing partitive references,<br />

and dealing with conceptual overgranularity. This effort was part of the development<br />

of the FAME Interlingual Speech-to-Speech Machine Translation System for Catalan,<br />

English and Spanish, which was carried out between 2001 and 2004.<br />

In section 2, we provide a background for the discussion, giving some information<br />

on Catalan language, and describing the project and the translation system. In section<br />

3, we briefly describe an evaluation procedure, and present some results from a<br />

major user-centred evaluation. This section proves the feasibility of the adaptation<br />

and the success of the system, which was publicly demonstrated at the 2004 Forum<br />

209


of Cultures in Barcelona (with a very positive outcome). In section 4, we discuss<br />

Interchange Format (IF), the interlingua underlying the translation process, the<br />

inadequacies encountered and the modifications made while adapting the framework<br />

to Catalan and Spanish. Finally, in Section 5, we summarize the results and conclude<br />

with a discussion of future directions.<br />

2. Background<br />

The Catalan language, with all its variants, is spoken in the Païssos Catalans which<br />

include the Spanish regions of Catalonia, Valencia and Balearic Islands, the French<br />

department of the Pyréneés Orientales, and the Italian area of Alghero. Inside the<br />

Spanish territory, Catalan is also spoken in some parts of Aragon and Murcia. Catalan<br />

is a Romance language and shows similarities with other languages belonging to the<br />

Romance family, in particular with Spanish, Galician and Portuguese. Nowadays,<br />

Catalan is undertood by 9 million people and spoken by 7 million people.<br />

The FAME Interlingual Speech-to-Speech Translation System for Catalan, English<br />

and Spanish was developed at the Universitat Politècnica de Catalunya (UPC), Spain,<br />

as part of the recently completed European Union-funded FAME project (Facilitating<br />

Agents for Multicultural Exchange) that focused on the development of multi-modal<br />

technologies to support multilingual interactions (see http://isl.ira.uka.de/fame<br />

for details). The FAME translation system is an extension of the existing NESPOLE!<br />

translation system (Metze et al. 2002; Taddei et al. 2003) to Catalan and Spanish in<br />

the domain of hotel reservations. At its core is a robust, scalable, interlingual speech-<br />

to-speech translation system having cross-domain portability that allows for effective<br />

translingual communication in a multi-modal setting. Although the system architecture<br />

was initially based on NESPOLE!, all the modules have now been integrated on an<br />

Open Agent platform (Holzapfel et al. 2003; for details see http://www.ai.sri.com/<br />

~oaa). This type of multi-agent framework offers a number of technical features for<br />

a multi-modal environment that are highly advantageous for both system developers<br />

and users.<br />

Broadly speaking, the FAME translation system consists of an analysis component<br />

and a generation component. The analysis component automatically transcribes spoken<br />

source language utterances and then maps that transcription into an interlingual<br />

representation. The generation component maps from interlingua into target language<br />

text that, in turn, is passed to a speech synthesiser that produces a spoken version<br />

of the text. The central advantage of this interlingua-based architecture is that, in<br />

adding additional languages to the system (such as Catalan and Spanish), it is only<br />

necessary to develop new analysis and generation components for each new language<br />

210


in order to be able to translate into and out of all of the other existing languages in<br />

the system. In other words, no source-language-to-target-language specific transfer<br />

modules are required, as would be the case for transfer systems, with the result that<br />

the development task is considerably simplified.<br />

For both Catalan and Spanish speech recognition, we used the JANUS Recognition<br />

Toolkit (JRTk) developed at Universität Karlsruhe and Carnegie Mellon University<br />

(Woszczyna et al. 1993). For the text-to-text component, the analysis side utilises<br />

the top-down, chart-based SOUP parser (Gavaldà 2000) with full domain action level<br />

rules to parse input utterances. Natural language generation is done with GenKit, a<br />

pseudo-unification-based generation tool (Tomita et al. 1988). For both Catalan and<br />

Spanish, we use a <strong>Text</strong>-to-Speech (TTS) system fully developed at UPC, which uses a<br />

unit-selection-based, concatenative approach to speech synthesis.<br />

The Interchange Format (Levin et al. 2002), the interlingua used by the C-STAR<br />

Consortium (see http://www.c-star.org for details), has been adapted for this effort.<br />

Its central advantage in representing dialogue interactions such as those typical of<br />

speech-to-speech translation systems is that it focuses on identifying the speech acts<br />

and the various types of requests and responses typical of a given domain. Thus,<br />

rather than capturing the detailed semantic and stylistic distinctions, it characterises<br />

the intended conversational goal of the interlocutor. Even so, in mapping into or<br />

out of IF, it is necessary to take into account a wide range of structural and lexical<br />

properties related to Catalan and Spanish.<br />

For the initial development of the Spanish analysis grammar, the already existing<br />

NESPOLE! English and German analysis grammars were used as a reference point.<br />

Despite using these grammars, great efforts had to be made to overcome important<br />

differences between English, German and the Romance languages in focus. The<br />

Catalan analysis grammar, in turn, was adapted from the Spanish analysis grammar,<br />

and, in this case, the process was rather straightforward. The generation grammars<br />

for Catalan and Spanish were mostly developed from scratch, although some of the<br />

underlying structure was adapted from that of the NESPOLE! English generation<br />

grammar. Language-dependent properties such as word order, gender and number<br />

agreement, and so forth needed to be dealt with representationally, but on the whole,<br />

starting with existing structural descriptions proved to be useful. On the other hand,<br />

the generation lexica play a significant role in the generation process and these had to<br />

be developed from scratch. As for the generation grammars, however, a considerable<br />

amount of work took place in parallel for both Romance languages, which contributed<br />

to a more efficient development of both the Catalan and Spanish generation lexica.<br />

211


3. Evaluation<br />

The evaluation performed was done on real users of the speech-to-speech<br />

translation system, in order to both:<br />

• examine the performance of the system in as real a situation as possible, as if<br />

it were to be used by a real tourist trying to book accommodation in Barcelona;<br />

and,<br />

• study the influence of using speech input, and thus Automatic Speech Recognition<br />

(ASR), in translation.<br />

3.1 Task-Oriented Evaluation Metrics<br />

A task-oriented methodology was developed to evaluate both the end-to-end<br />

system (with ASR and TTS) and the source language transcription to target language<br />

text subcomponent. An initial version of this evaluation method had already proven<br />

useful during system development, since it allowed us to analyse content and form<br />

independently, and thus contributed towards practical system improvements.<br />

The evaluation metric used recognises three main categories (Perfect, Ok and<br />

Unacceptable), where the second was further subdivided into Ok+, Ok and Ok-. During<br />

the evaluation, this metric was independently applied to two separate parameters,<br />

form and content. In order to evaluate form, only the generated output (text or<br />

speech) was considered by the evaluators. To evaluate content, evaluators took into<br />

account both the input utterance or text and the output text or spoken utterance.<br />

Accordingly, the meaning of the metrics varies depending on whether they are being<br />

used to judge form or to judge content:<br />

• Perfect: well-formed output (form) or communication of all the information the<br />

speaker intended (content).<br />

• Ok+/Ok/Ok-: acceptable output, grading from only some minor error of form (e.g.<br />

missing determiner) or some minor uncommunicated information (Ok+) to some<br />

more serious problem of form or uncommunicated information content (Ok-).<br />

• Unacceptable: unacceptable output, either essentially unintelligible (form) or<br />

information unrelated to the input (content).<br />

3.2 Evaluation Results<br />

The results obtained from the evaluation of the end-to-end translation system for<br />

the different language pairs are shown in Tables 1, 2, 3 and 4. The results obtained<br />

from the translation of clean audio-transcriptions are summarised in Tables 5, 6, 7<br />

and 8. From the results, we can conclude that many of the errors are caused by the<br />

212


ASR component. This is particularly so when translating from English 1 into Catalan<br />

or Spanish. For instance, if we consider the form parameter, Tables 7 and 8 show<br />

that there are no unacceptable translations when using the text-to-text interlingual<br />

translation system for the English-Catalan and English-Spanish, while Tables 3 and 4<br />

show that performance drops 5.99% and 9.6%, respectively, when using the speech-<br />

to-speech system.<br />

In fact, the interlingual translation component performs very well when used on<br />

text input and degrades when using speech input. However, it should be pointed<br />

out that, even so, results remain rather good for the end-to-end system. For the<br />

worst of our language pairs (English-Spanish), a total of 62.4% of the utterances were<br />

judged acceptable in regard to content. This is comparable to evaluation results of<br />

other state-of-the-art systems such as NESPOLE! (Lavie et al. 2002), which obtained<br />

slightly lower results and was performed on Semantic Dialog Units (see below) instead<br />

of utterances (UTT), thus simplifying the translation task. The Catalan-English and<br />

English-Catalan pairs were both quite good with 73.1% and 73.5% of the utterances<br />

being judged acceptable, respectively, and the Spanish-English pair performs very<br />

well with 96.4% of the utterances being acceptable.<br />

Table 1: Evaluation of End-to-End Translation (with ASR)<br />

for the Catalan-English Pair. Based on 119 UTTs.<br />

SCORES FORM CONTENT<br />

Perfect 70.59% 31.93%<br />

OK+ 5.04% 15.12%<br />

OK 6.72% 9.25%<br />

OK- 9.25% 16.80%<br />

Unacceptable 8.40% 26.90%<br />

Table 2: Evaluation of End-to-End Translation (with ASR) for the Spanish-English Pair.<br />

Based on 84 UTTs.<br />

SCORES FORM CONTENT<br />

Perfect 92.85% 71.42%<br />

OK+ 4.77% 11.90%<br />

1 It should be pointed out that the efforts to develop the ASR systems were focused on the Catalan and<br />

Spanish language models. The language model for the English ASR was used as is, when provided by the<br />

NESPOLE! partners. As a result, the English ASR was not as domain sensitive and, consequently, more<br />

error prone. The only work done was to enlarge its lexicon.<br />

213


OK 1.19% 7.14%<br />

OK- 0% 5.96%<br />

Unacceptable 1.19% 3.58%<br />

Table 3: Evaluation of End-to-End Translation (with ASR) for the English-Catalan Pair. Based<br />

on 117 UTTs.<br />

SCORES FORM CONTENT<br />

Perfect 64.96% 34.19%<br />

OK+ 15.39% 11.97%<br />

OK 8.54% 14.52%<br />

OK- 5.12% 12.82%<br />

Unacceptable 5.99% 26.50%<br />

Table 4: Evaluation of End-to-End Translation (with ASR) for the English-Spanish Pair.<br />

Based on 125 UTTs.<br />

SCORES FORM CONTENT<br />

Perfect 64.80% 17.60%<br />

OK+ 4.80% 10.40%<br />

OK 12.00% 18.40%<br />

OK- 8.80% 16.00%<br />

Unacceptable 9.60% 37.60%<br />

Table 5: Evaluation of Translation for Audio Transcription of the Catalan-English Pair. Based<br />

on 119 UTTs.<br />

SCORES FORM CONTENT<br />

Perfect 85.72% 73.10%<br />

OK+ 5.89% 13.45%<br />

OK 2.52% 4.20%<br />

OK- 4.20% 6.73%<br />

Unacceptable 1.69% 2.52%<br />

214


Table 6: Evaluation of Translation for Audio Transcription of the Spanish-English Pair. Based<br />

on 84 UTTs.<br />

SCORES FORM CONTENT<br />

Perfect 96.42% 91.66%<br />

OK+ 2.38% 3.57%<br />

OK 0% 0%<br />

OK- 0% 3.57%<br />

Unacceptable 1.20% 1.20%<br />

Table 7: Evaluation of Translation for Audio Transcription of the English-Catalan Pair. Based<br />

on 117 UTTs.<br />

SCORES FORM CONTENT<br />

Perfect 89.75% 88.89%<br />

OK+ 8.55% 1.70%<br />

OK 1.70% 0.85%<br />

OK- 0% 4.28%<br />

Unacceptable 0% 4.28%<br />

Table 8: Evaluation of Translation for Audio Transcription of the English-Spanish Pair. Based<br />

on 125 UTTs.<br />

SCORES FORM CONTENT<br />

Perfect 95.2% 82.4%<br />

OK+ 4% 7.2%<br />

OK 0.8% 3.2%<br />

OK- 0% 5.6%<br />

Unacceptable 0% 1.6%<br />

4. Interchange Format<br />

In this section, we discuss the use of the Interchange Format (IF) for Machine<br />

Translation, and then we examine some problems of applying IF to new languages such<br />

as Catalan and Spanish.<br />

4.1 Introduction to Interchange Format and Discussion<br />

IF is based on Searle’s Theory of Speech Acts (Searle 1969). It tries to represent<br />

the speaker’s intention rather than the meaning of the sentence per se. In the hotel<br />

reservation domain, there are several speech acts, such as giving information about<br />

215


a price, asking for information about a room type, verifying a reservation, and so<br />

on. Since domain concepts such as prices, room type and reservation are included in<br />

the representation of the act, in our interlingua, such speech acts are referred to as<br />

Domain Actions (DAs), and they are the type of actions that are discussed. These DAs<br />

are formed by different combinatory elements expressing the semantic information<br />

that needs to be communicated.<br />

Generally speaking, an IF representation has the following elements:<br />

Speaker’s Tag + DA + Arguments<br />

The Speaker’s Tag may be a for the agent’s contributions, or c for the client’s.<br />

Inside the DA we find the following elements:<br />

• Speech Act: a compulsory element that can appear alone or followed by other<br />

elements. Examples of Speech-Acts include: give-information, negate, request-<br />

information, etc.<br />

• Attitude: an optional element that represents the attitude of the speaker when<br />

explicitly present. Some examples are: +disposition, +obligation, and so on.<br />

• Main Predication: a compulsory element that represents what is talked about.<br />

Examples of these elements are: +contain, +reservation, and so on; and,<br />

• Predication Participant: optional elements that represent the objects talked<br />

about, for instance, +room, +accommodation, and so on.<br />

The DA is followed by a list of arguments. These elements are expressed by<br />

argument-value pairs positioned inside a list and separated by a “,”.<br />

By way of example, an IF representation of the sentence in (1) contains all the<br />

elements mentioned above:<br />

(1) Would you like me to reserve a room for you?<br />

IF: a: request-information+disposition+ reservation+room (for-whom=you,<br />

who=i, disposition=(who=you, desire), room spec=(quantity=1, room))<br />

From this representation we know that the speaker is the agent and that he is<br />

asking for some information. The attitude expressed here is a desire of a second<br />

person singular, that is, the client. The main predication is to make a reservation and<br />

the predication participant is one room.<br />

216


Interchange Format is heavily influenced by English, and this may cause problems<br />

when using it to represent Romance languages such as Catalan or Spanish. Most of<br />

these problems are solvable, however, and in general, IF works rather well to represent<br />

both languages. The following subsections describe six different issues that have been<br />

encountered when adapting the IF to Catalan and Spanish.<br />

4.2 Dialogue Context Ambiguity<br />

The meaning of an expression sometimes changes depending on the dialog context.<br />

That is to say, a unique expression can have different meanings depending on its place<br />

in the conversation. This is the case, for instance, of the Catalan expression digui’m.<br />

In Catalan, it has a different meaning depending on whether it used when answering a<br />

telephone call or when responding to a suggestion. This difference in meaning is seen<br />

here in examples (2) and (3):<br />

(2) 9-ENG-CLIENT: Shall I give you my Visa number then?<br />

10-CAT-AGENT: Digui’m.<br />

Go ahead.<br />

10-IF:a: request-action+proceed (who=you, communication-mode=phone)<br />

(3) CAT-AGENT: Viatges Fame, digui‘m?<br />

Fame Travel. Hello?<br />

IF: a: introduce-self (who=name-viajes_fame)<br />

a: dialog-greet (who=you, to-whom=I, communication-mode=phone)<br />

In example (2), the agent uses the expression to indicate to the client that he is<br />

ready and that the client may proceed to give his visa number. However, in (3), the<br />

expression appears at the beginning of a conversation, as a kind of greeting indicating<br />

to the client that the agent is already listening to him.<br />

There is currently no way to represent dialog structure information within the<br />

interlingual formalism, and so only one of the translations (go ahead) is used as a<br />

default; but a solution would not be difficult to implement. The first step would be to<br />

represent various types of conversational contexts (opening, response-to-offer, etc.),<br />

and then to modify the analysis grammars to parse differently according to context. In<br />

this case, the analyser recognises that it is parsing (and thus interpreting) a dialogueopening<br />

segment (indicating one meaning, i.e., hello) or a post offer-information<br />

segment (indicating a different meaning, i.e., go ahead).<br />

217


4.3 Formality Feature<br />

In Catalan and Spanish, there is a distinction between formal and informal personal<br />

pronouns, especially for second person singular and plural. However, as the IF is<br />

influenced by English, this distinction is not reflected in this interlingua. In example<br />

(4), the verbal form ajudar-lo (to help you) implies a formal relationship between<br />

the speaker and singular addressee, while in (5) ajudar-te (to help you), the implied<br />

relationship is familiar.<br />

(4) CAT-AGENT: ¿En què puc ajudar-lo?IF: a: offer+help (help=(who=i, to-<br />

whom= you))<br />

(5) CAT-AGENT: ¿En què puc ajudar-te?<br />

IF: a: offer+help (help=(who=i, to-whom= you))<br />

But if we inspect the IF representations for both examples, we see that they are<br />

the same. This is due to the lack of a formality feature in this interlingua. This does<br />

not imply any problem when translating from Catalan/Spanish into English, as the<br />

latter does not have any formal register; but it could cause a loss of meaning when<br />

translating from Catalan into Spanish or vice versa, for instance, or from either of<br />

these two languages into French, for example, which also makes a second person<br />

register distinction.<br />

To solve this problem of representing register, we can add a new argument-<br />

value pair to the IF with the argument [formal=] and the values (yes) or (no). When<br />

implementing this new feature, the IF representation for examples (4) and (5) would<br />

be (6) and (7), respectively.<br />

(6) IF: a: offer+help (help=(who=i,( to-whom= you, formal=yes)))<br />

(7) IF: a: offer+help (help=(who=i, (to-whom= you, formal=no)))<br />

Through the use of these new argument-value pairs, we would be able to<br />

communicate the feature of formality and have it available in the target language, if<br />

applicable.<br />

4.4 Conceptual Gaps in Catalan and Spanish<br />

Another problem we had to overcome when developing the Catalan and Spanish<br />

grammars had to do with the lexicons. Since IF was developed with English as point<br />

of reference, there are IF values that refer to lexical items that do not exist in<br />

Catalan or Spanish per se. In essence, the semantic field is not divided equivalently<br />

between the languages. Sometimes, it is a word or an expression, such as Christmas<br />

crackers, that does not exist either in the Catalan or Spanish culture. When facing<br />

this problem we maintain the same English word, as there is no cultural equivalent.<br />

218


Sometimes the solution is not that straightforward, however, given that the word<br />

without equivalent in Catalan or Spanish is an important word in the dialogue. This is<br />

the case of king-size bed and queen-size bed, as shown in example (8). Both words are<br />

rather important within the hotel reservation domain we work in, and what’s more,<br />

the client is supposed to be an English speaker, so he would most definitely use it. As a<br />

consequence, we could not adopt the solution proposed in the previous example, and<br />

we had to introduce phrasal equivalents based on already existing Catalan/Spanish<br />

words referring to bed types to cover those two values. The Catalan and Spanish<br />

equivalents would be un llit extragran and un llit gran, for Catalan, and una cama<br />

extragrande and una cama grande, for Spanish.<br />

(8) ENG-CLIENT: I would like a room with a king-size bed.<br />

IF: c: give-information+disposition+bed (disposition=(who=I, desire),<br />

room-spec= (quantity=1, room, contain=(quantity=1, king-bed)))<br />

4.5 Proper Nouns<br />

Currently, all proper names are included in the IF by the use of values. That is<br />

to say, each proper name is represented by a different value. In our domain, proper<br />

names are mainly person names, street names, city names, hotel names, names of<br />

monuments, museums and other attractions, and so on. For instance, the proper<br />

name Hotel Duc de la Victoria is represented in the IF under the class *barcelona-<br />

hotel-names* by the value [name-hotel_duc_de_la_victoria]. Although this is a good<br />

way to represent proper names when they are well known to the interlocutors, we<br />

should also point out that it implies a great effort on the developer’s part. Whenever<br />

a new proper name is added, it should be first included in the IF Specification files,<br />

and then both analysis and generation grammars of all languages have to be updated<br />

to include this new proper name. As a consequence, all developers have to be aware<br />

of the new values included in the IF specifications, especially those working on the<br />

analysis side. Otherwise, this proper name will not be analysed.<br />

Moreover, when developing our analysis grammars, we had to deal with the<br />

phenomena of bilingualism in Barcelona, and in Catalonia in general. In Barcelona,<br />

both Catalan and Spanish are spoken, and when using a proper name there is a certain<br />

degree of code mixing. As a result the name may be in Catalan, in Spanish, or in a<br />

mixture of both languages, as shown in (9). When including proper names in Catalan<br />

analysis grammars, we took into account all those forms: the Catalan name, the<br />

Spanish name and the hybrid version were added under the IF value representing the<br />

proper name, as shown in (10):<br />

(9) CAT: Carrer Pelai (Pelai Street)<br />

219


SPA: Calle Pelayo<br />

SPA/CAT: Calle Pelai<br />

(10) [name-carrer_pelai]<br />

(carrer pelai)<br />

(calle pelayo)<br />

(calle pelai)<br />

In any case, while this way of treating proper names is adequate, a good way to<br />

avoid the significant effort it entails would be to generate proper name representations<br />

automatically, or scrap proper name representations altogether and pass proper name<br />

forms directly to the target language as strings. Either way, the key is to deal with<br />

proper name translation independently from translation of other expressions.<br />

4.6 Catalan ‘de’ Partitive<br />

In Catalan there is a phenomenom called de partitive. This construction is used<br />

when a qualifying adjective has an elided head, or when it is used in construction<br />

with the impersonal pronoun en, as shown in example (11a). In the sentence (11b),<br />

there is a noun phrase llit extragran (king-size bed) formed by a head noun (llit – bed)<br />

and a qualifying adjective (extragran – king-size). In Catalan, this noun phrase can be<br />

transformed by eliding the head noun llit, inserting the pronoun en in its place and<br />

introducing the adjective extragran by the preposition de (in this case d’).<br />

(11a) CAT: En tenim un d’extragran<br />

ENG: We have one in king-size. b) CAT: Tenim un llit extragran<br />

ENG: We have a king-size bed<br />

At first sight, if we want to have an interlingua representation for (11a), it would<br />

be example (12). In this representation, we have an [object-spec=] argument that<br />

contains a subargument [size=] with the value (king-bed-size). The focus of the<br />

representation is on the size of the object without explicitly mentioning the type per<br />

se.<br />

(12) give-information+existence+object (provider=we, object-spec=<br />

(quantity=1, size=king-bed-size))<br />

However, (11a) actually continues to refer to a king-size bed. Furthermore, ideally,<br />

the interlingua should be language independent. It would not be fair to create a new<br />

value such as king-bed-size only for Catalan especially, since we could represent this<br />

sentence through already existing values and arguments, as in example (13).<br />

220


(13) give-information+existence+object (provider=we, object-spec=<br />

(quantity=1, bed-spec=king-bed))<br />

In this representation, the segment un d’extragran is represented by the<br />

subargument [bed-spec=], used to represent the types of beds.<br />

4.7 Excess of Conceptual Granularity<br />

The Interchange Format is a formalism intended to express the meanings of<br />

different parts of a sentence. In some cases, however, the representation of this<br />

meaning is too specific with respect to one or another of the languages in the system.<br />

This is especially common in regard to representing modifiers such as adjectives.<br />

For example, (14) shows two different IF values that, with respect to Catalan and<br />

Spanish, can both be taken to mean the same thing: ‘old.’ The corresponding terms<br />

for both values is vell in Catalan and viejo in Spanish. The difference between the<br />

English lexical counterparts has less to do with semantics as such, but rather with<br />

their distributional properties.<br />

(14) [ancient]<br />

[antique]<br />

In this case then, the IF values are too specific, since the meaning they convey<br />

could be included under one value. The solution is to introduce an IF class value<br />

for values [ancient] or [antique], and then to map the Catalan or Spanish lexeme<br />

to or from this class value. In translating from English to Catalan or Spanish, one<br />

simply moves from the particular value to the class value, since there are no possible<br />

equivalents associated with the particular values. When translating from Catalan or<br />

Spanish into English, the English appropriate lexeme is selected on the basis of class<br />

of head element modified.<br />

5. Conclusions<br />

This presentation began with a brief description of the FAME Speech-to-Speech<br />

Machine Translation System, and the results of a user-oriented evaluation of the<br />

system for both voice (with ASR) and clean audio-transcription inputs. It was observed<br />

that the interlingual translation component performs very well when used on clean<br />

input and that, as expected, worsens in performance when used on spoken input.<br />

Nonetheless, we are satisfied with system performance, although we acknowledge that<br />

further work should be done, especially in order to improve the ASR throughput.<br />

Next, Interchange Format (IF) was introduced, and we examined a number of<br />

language-particular problems that arose while applying IF to the representation of<br />

221


Catalan. In each case, we described the solutions we used, or propose to use, to<br />

overcome these problems, including improvements to IF that should widen its coverage<br />

and make it easier to be used by developers.<br />

lines:<br />

In the future, we hope to continue to develop the system along three general<br />

• We would like to implement the changes and improvements proposed for<br />

IF, and see how they work and in which way they help to widen the coverage of the<br />

interlingua;<br />

• We would like to improve the ASR component of our translation system,<br />

and try to find solutions to overcome possible problems due to spontaneous speech<br />

and disfluencies; and,<br />

• We also expect to extend the coverage of our grammars and lexica,<br />

not only to other areas of the travel domain, but also to other domains such as<br />

medicine.<br />

6. Acknowledgments<br />

This research has been partially financed by the FAME (IST-2001-28323) and ALIADO<br />

(TIC2002-04447-C02) projects. We would especially like to thank Climent Nadeu and<br />

Jaume Padrell, for all their help and support in numerous aspects of the project.<br />

We are also grateful to other UPC colleagues, such as Josè B. Mariño and Adrià de<br />

Gispert, and to our colleagues at CMU, Dorcas Alexander, Donna Gates, Lori Levin, Kay<br />

Peterson and Alex Waibel, for all their feedback and assistance.<br />

222


“C-STAR.” <strong>Online</strong> at http://www.c-star.org.<br />

“FAME.” <strong>Online</strong> at http://isl.ira.uka.de/fame.<br />

223<br />

References<br />

Gavaldà, M. (2000). “SOUP: A Parser for Real-world Spontaneous Speech.” Proceedings<br />

of the 6th International Workshop on Parsing Technologies (IWPT-2000), Trento,<br />

Italy.<br />

Holzapfel, H. et al. (2003). FAME Deliverable D3.1: Testbed Software, Middleware<br />

and Communication Architecture.<br />

Lavie, A. et al. (2002). “A Multi-Perspective Evaluation of the NESPOLE! Speech-to-<br />

Speech Translation System.” Proceedings of ACL-2002 Workshop on Speech-to-Speech<br />

Translation: Algorithms and Systems. Philadelphia, PA, 121-128.<br />

Levin, L. et al. (2002). “Balancing Expressiveness and Simplicity in an Interlingua<br />

for Task based Dialogue.” Proceedings of ACL-2002 Workshop on Speech-to-Speech<br />

Translation: Algorithms and Systems. Philadelphia, PA, 53-60.<br />

Metze, F. et al. (2002). “The NESPOLE! Speech-to-Speech Translation System.”<br />

Proceedings of HLT-2002, San Diego, California.<br />

Searle, J. (1969). Speech Acts: An Essay in the Philosophy of Language. Cambridge,<br />

UK: Cambridge University Press.<br />

Taddei, L. et al. (2003). NESPOLE! Deliverable D17: Second Showcase Documentation.<br />

http://nespole.itc.it.<br />

“The Open Agent ArchitectureTM.” <strong>Online</strong> at http://www.ai.sri.com/~oaa.<br />

Tomita, M. & Nyberg, E.H. (1988). “Generation Kit and Transformation Kit, Version 3.2,


User’s Manual.” Technical Report CMU-CMT-88-MEMO, Center for Machine Translation,<br />

Carnegie Mellon University, Pittsburgh, PA.<br />

Woszczyna, M. et al. (1993). “Recent Advances in JANUS: A Speech Translation System.”<br />

Proceedings of Eurospeech-1993, Berlin.<br />

224


Computing Non-Concatenative<br />

Morphology: The Case of Georgian 1<br />

225<br />

Olga Gurevich<br />

Georgian (Kartvelian) is a less commonly studied language, with a complex, non-<br />

concatenative verbal morphology. This paper examines characteristics of Georgian<br />

that make it a challenge for language learners and for current approaches to<br />

computational morphology. We present a computational model for generation and<br />

recognition of Georgian verb conjugations, and describe one practical application of<br />

the model to help language learners.<br />

1. Introduction<br />

Georgian (Kartvelian) is the official language of the Republic of Georgia and claims<br />

about 4 million native speakers. Georgian morphology is largely synthetic, with<br />

complex verb forms that can often express the meaning of a whole sentence. Georgian<br />

has sometimes been called agglutinative (Hewitt 1995), but such classification does<br />

not fully describe the complexity of the language.<br />

Descriptions of Georgian verbal morphology emphasise the large number of<br />

inflectional categories, the large number of elements that a verb form can contain,<br />

the dependencies between the occurrence of various elements, and the large number<br />

of regular, semi-regular, and irregular patterns of formation of verb inflections. All of<br />

these factors make computational modeling of Georgian morphology a rather daunting<br />

task. To date, no successful large-scale models of parsing or generation of Georgian<br />

are available.<br />

In this paper, I propose a computational model for parsing and generation of a<br />

subset of Georgian verbal morphology that relies on a templatic, word-based analysis<br />

of the verbal system, rather than assuming compositional rules for combining<br />

individual morphemes. I argue that such a model is viable, extensible, and capable of<br />

capturing the generalisations inherent in Georgian verbal morphology at various levels<br />

of regularity.<br />

1 This research was in part supported by the Berkeley Language Center. Thanks to Mark Kaiser, Claire<br />

Kramsch, Lisa Little, Nikolas Euba, David Malinowski, and Sarah Roberts for many hours of productive<br />

discussion and wonderful suggestions, and to Aaron Siegel for technical support. I am eternally grateful<br />

to Vakhtang Chikovani and Shorena Kurtsikidze for help in creating the website, and for introducing me<br />

to Georgian. I alone am to blame for any errors and omissions.


I begin with a brief overview of Georgian verbal morphology, emphasising the<br />

factors that complicate its computational modelling. I present an analysis grounded<br />

in word-based approaches to morphology and Construction Grammar, and suggest that<br />

this type of analysis lends itself more easily to computational implementations than<br />

analyses that assume morpheme-based compositionality. Following a brief overview<br />

of existing approaches to computational morphology, I propose a model for Georgian<br />

and describe it in detail. The model is currently implemented as a cascade of finite-<br />

state transducers (Beesley & Karttunen 2003), but probabilistic and connectionist<br />

extensions or alternative implementations are plausible. Finally, I describe a practical<br />

application of this model for language learning: an online database of Georgian verb<br />

conjugations.<br />

2. An Overview of Georgian Verbal Morphology<br />

The morphosyntax of Georgian verbs is characterised by a variety of lexical<br />

(irregular), semi-regular, and completely regular patterns. The verb forms themselves<br />

are made up of several kinds of morphological elements that recur in different<br />

formations. These elements can be formally identified in a fairly straightforward<br />

fashion; however, their function and distribution defy a simple compositional analysis,<br />

but instead are determined by the larger morphosyntactic and semantic contexts in<br />

which the verbs appear (usually tense, aspect, and mood) and by the lexical properties<br />

of the verbs themselves. The combination of morphosyntactic and lexical factors also<br />

determines the case marking on the verb’s arguments.<br />

The specific types of morphological elements and peculiarities in their function<br />

and distribution are described below. The main point of this section is that a language<br />

learner and a computational model are faced with patterns in which formal elements<br />

(morphs) do not have identifiable, context-independent meanings that can be combined<br />

compositionally to form whole words. Rather, they must contend with a variety of<br />

patterns at various degrees of regularity. In computational terms, this amounts to a<br />

series of rules of varying specificity, backed up by defaults.<br />

The linguistic analysis at the core of the computational model splits Georgian<br />

verbs into several lexical classes. The lexical classes are described on the basis of<br />

example paradigms, using frequent verbs belonging to each class. This is in contrast<br />

to a more rule-oriented description in which lexical classes may be identified by some<br />

morphological or syntactic feature. In the rest of this section, I argue that an example-<br />

226


ased description is the only one plausible for learners of Georgian, and provides a<br />

good basis for computational modeling as well.<br />

2.1 Series and Screeves<br />

Georgian verbs inflect in tense / mood / aspect (TAM) paradigms called screeves<br />

(from mck’rivi ‘row’). There are a total of eleven screeves in Modern Georgian,<br />

although only ten are actively used. Screeves can be grouped into three series based<br />

on morphological and syntactic commonalities, as in Table 1:<br />

Table 1 – Series and Screeves<br />

Series I Series II Series III<br />

Present sub-series Future sub-series (aorist) (perfect)<br />

Present Future<br />

Perfect<br />

Aorist<br />

Imperfect Conditional Pluperfect<br />

Present subjunctive Future subjunctive Aorist subjunctive (Perf. subj.) *<br />

Knowing the series and screeve of a verb form is essential for being able to conjugate<br />

it. Screeve formation exhibits a number of lexical, semi-regular, and regular patterns,<br />

some of which are examined below.<br />

Georgian verbs are often divided into four conjugation classes, based mostly on<br />

valency (cf. Harris 1981). For now, I will concentrate on transitive verbs; it will be<br />

necessary to mention the other classes (unergative, unaccusative, and indirect) in the<br />

discussion of case-marking below. The structure of a verb form can be described using<br />

the following (simplified) template:<br />

(Preverb 1 )-(Pron1 2 )-(PRV 3 )-root 4 -(TS 5 )-(Scr 6 )-(Pron2 7 ) 2<br />

The approximate function of each element is as follows:<br />

• Preverb – marks aspectual distinctions, lexically associated with each<br />

verb (similar to verbal prefixes in Slavic or German).<br />

• Pron1 – Prefixal pronominal agreement slot.<br />

• PRV – pre-radical vowel slot, serves a variety of functions in different<br />

contexts.<br />

• Root – the only required part of the verb form.<br />

• TS – Thematic Suffix. Participates in the formation of several tenses,<br />

predicts certain inflectional properties of the verb.<br />

* The Perfect Subjunctive is almost never used in contemporary Georgian.<br />

2 Cf. Hewitt 1995.<br />

227


• Scr – Screeve marker. This is a screeve (tense) ending which may depend<br />

on verb class and agreement properties.<br />

• Pron2 – suffixal agreement slot.<br />

The preverb, root, and thematic suffix must be lexically specified in all cases,<br />

although their distribution follows a somewhat regular pattern described in the next<br />

section. Other elements in the template are distributed according to more or less<br />

regular principles, although some lexical exceptions do exist.<br />

The templatic composition of the Georgian verb forms suggests, at first blush, an<br />

agglutinative structure. However, a closer examination of the morphological elements<br />

in the verbal template and their function provides evidence against such an analysis. In<br />

particular, the morphological elements do not have identifiable meanings independent<br />

of context, and their meanings do not compositionally comprise the meanings of the<br />

words in which they participate. As argued in Gurevich (2003), the morphological<br />

elements of Georgian cannot be thought of as morphemes, or smallest meaningful<br />

elements of form. Rather, word-level constructions determine both the meaning of<br />

the whole word, and the collection of morphological elements that comprise the word.<br />

This combination of templatic morphological structure and non-compositional meaning<br />

construction makes Georgian inflectional morphology look non-concatenative.<br />

As an illustration, let us examine the formation of the verb xat’va ‘paint’ in<br />

Table 2. The screeves (and, more generally, series) govern the distribution of the<br />

morphological elements.<br />

I<br />

Table 2: Screeves of xat’va ‘paint’<br />

Series Screeve 2SgSubj, 3Obj form<br />

Present xat’-av ‘You paint’<br />

Pres. subseries Imperfect xat’-av-di ‘You were painting’<br />

Pres. Subj. xat’-av-de ‘You should paint’<br />

Future da-xat’-av ‘You will paint’<br />

Fut. subseries Conditional da-xat’-av-di ‘You would paint’<br />

Fut. Subj. da-xat’-av-de ‘If you could paint’<br />

II<br />

Aorist<br />

Aor. Subj.<br />

da-xat’-e ‘You painted’<br />

da-xat’-o ‘You have to paint’<br />

III<br />

Perfect<br />

Pluperfect<br />

da-g-i-xat’-avs ‘You have painted’<br />

da-g-e-xat’-a ‘You should have painted’<br />

In addition to the multitude of morphological elements in any given verb form, the<br />

distribution and lexical dependency of the elements makes a learner’s task difficult.<br />

Preverbs, thematic suffixes and screeve endings present particular difficulties.<br />

228


The preverbs form a closed class of about eight. A preverb (da- for the verb ‘paint’)<br />

appears on forms from the Future subgroup of series I, and on all forms of series II<br />

and III in transitive verbs. The preverbs are by origin spatial prefixes that now mark<br />

perfective aspect. However, the presence of a preverb on a verb form signals more<br />

than just a change in aspect. For example, the preverb differentiates the Conditional<br />

from the Imperfect, and the meaning of the two screeves differs in more than aspect.<br />

An additional difficulty is in the lexical connection between prefixes and verb roots,<br />

similar to the verbal prefixes in Slavic or German. Table 3 demonstrates some of<br />

the lexically-dependent morphological elements, including several different preverbs<br />

(row ‘Future’).<br />

Similarly, thematic suffixes (otherwise known as screeve suffixes or screeve<br />

formants) form a closed class and are lexically associated with verb roots. In general,<br />

thematic suffixes do not appear to have independent meaning. Rather, they serve to<br />

mark the inflectional class of the verb, because they determine certain patterns of<br />

inflectional behavior in different screeves.<br />

On transitive verbs, thematic suffixes appear in all series I forms. Their behavior<br />

in other series differs by individual suffix: in series II, most suffixes disappear, though<br />

some seem to leave partial ‘traces’. In series III, all suffixes except –av/-am disappear<br />

in the Perfect screeve; and in Pluperfect, all suffixes disappear, but the inflectional<br />

ending that takes their place does depend on the original suffix (rows ‘Present’ and<br />

‘Perfect’ in Table 3).<br />

The next source of semi-regular patterns comes from the inflectional endings in<br />

the individual screeves and the corresponding changes in some verb roots (row ‘Aorist’<br />

in Table 3).<br />

Finally, another verb form relevant for learners is the masdar, or verbal noun,<br />

which is the closest substitute of the infinitive in Georgian. The masdar may or may<br />

not include the preverb and/or some variation of the thematic suffix (last row in Table<br />

3). The formation of the masdar is particularly important, as it is the reference form<br />

listed in most Georgian dictionaries, even though it might not even start with the<br />

same letter as an inflected verb form.<br />

Table 3: Lexical Variation<br />

‘Bring’ ‘Paint’ ‘Eat’<br />

Present igh-eb-s xat’-av-s ch’-am-s<br />

Future c’amo-ighebs da-xat’avs she-ch’ams<br />

Aorist,3Sg Subject c’amoigh-o daxat’-a shech’am-a<br />

Perfect c’amough-ia dauxat’-avs sheuch’am-ia<br />

Masdar (verbal noun) c’amo-gh-eba da-xat’-va ch’-am-a<br />

229


In many cases, the inflectional endings and root changes can be determined if we<br />

know the thematic suffix of the verb (cf. the painstakingly detailed description of such<br />

patterns in Hewitt 1995). However, there are exceptions to most such connections,<br />

and learning the patterns based on explicit rules seems virtually impossible.<br />

On the other hand, screeve formation in some instances presents amazing<br />

regularity. Thus, the Imperfect and First Subjunctive screeves are regularly formed<br />

from the Present. Similarly, the Conditional and Future Subjunctive are formed from<br />

the Future. And for most (though not all) transitive verbs, the Future is formed from<br />

the Present via the addition of a preverb.<br />

Additionally, the number of possible combinations of inflectional endings, root<br />

changes and other irregularities is also finite, and some choices tend to predict other<br />

choices in the paradigm of a given verb (e.g. the selection of thematic suffix or Aorist<br />

2Sg Subj ending often predicts the Aorist Subjunctive ending). Although the rule-<br />

based analysis is unproductive, Georgian verbs can be classified according to several<br />

example paradigms, or inflectional (lexical) classes. This is similar to the inflectional<br />

class distinctions made in Standard European languages; the major difference is that<br />

the number of classes is much greater in Georgian than in other languages. One such<br />

classification is presented in Melikishvili (2001), distinguishing seventeen inflectional<br />

classes for transitive verbs alone, and over sixty classes overall. While the exact<br />

number of inflectional classes is still in question (see the discussion in section 4.4),<br />

the general example-based approach seems the only one viable for Georgian.<br />

The next section deals with subject and object agreement, a completely regular<br />

yet non-concatenative phenomenon.<br />

2.2 Subject and Object Agreement<br />

A Georgian verb can mark agreement with both its subject and its object via a<br />

combination of prefixal and suffixal agreement markers, as in Table 4:<br />

Subj<br />

Table 4: Agreement in Present<br />

230<br />

OBJECT<br />

1SG 1PL 2SG 2PL 3<br />

1SG -- -- g-xat’av g-xat’av-t v-xat’av<br />

1PL -- -- g-xat’av-t g-xat’av-t v-xat’av-t<br />

2SG m-xat’av gv-xat’av -- -- xat’av<br />

2PL m-xat’av-t gv-xat’av-t -- -- xat’av-t<br />

3SG m-xat’av-s gv-xat’av-s g-xat’av-s g-xat’av-t xat’av-s


3PL m-xat’av-en gv-xat’av-en g-xat’av-en g-xat’av-en xat’av-en<br />

The distribution and order of attachment of agreement affixes has been the subject<br />

of much discussion in theoretical morphological literature (Anderson 1992; Halle &<br />

Marantz 1994; and Stump 2001). To simplify matters for the computational model, I<br />

assume here that the prefixal and suffixal markers attach to the verb stem at the same<br />

time, and indicate the combined subject and object properties of a paradigm cell.<br />

While the prefixal markers and the suffix –t appear in all screeves, the suffixes<br />

in 3Sg and 3Pl Subject forms are screeve-dependent (cf. row ‘Aorist’ in Table 3).<br />

These suffixes therefore belong to the semi-regular patterns, while the rest of the<br />

agreement system is completely regular.<br />

Another difficulty arises in series III for transitive verbs. Here, the subject and<br />

object agreement appears to be the inverse of that in series I and II (Table 5; notice<br />

the different designation of rows and columns). This phenomenon, called inversion,<br />

corresponds to a reverse case marking of the nominal arguments (see next section).<br />

Several analyses have been proposed suggesting that, in inversion, the semantic subject<br />

corresponds to a ‘surface’ indirect object, and the semantic object corresponds to a<br />

‘surface’ subject (Harris 1981). However, a simple difference in linking does not fully<br />

explain the paradigm composition. In inverted paradigms, plural number agreement<br />

is still sensitive to the semantic arguments (namely, the semantic subject / agent<br />

triggers plural agreement regardless of other agreement or case-marking facts).<br />

Object<br />

Table 5: Agreement in Perfect<br />

231<br />

Subject<br />

1SG 1PL 2SG 2PL 3<br />

1SG -- -- g-xat’av g-xat’av-t v-xat’av<br />

1PL -- -- g-xat’av-t g-xat’av-t v-xat’av-t<br />

2SG m-xat’av gv-xat’av -- -- xat’av<br />

2PL m-xat’av-t gv-xat’av-t -- -- xat’av-t<br />

3SG m-xat’av-s gv-xat’av-s g-xat’av-s g-xat’av-t xat’av-s<br />

3PL m-xat’av-en gv-xat’av-en g-xat’av-en g-xat’av-en xat’av-en<br />

2.3 Subject and Object Case Marking<br />

Case marking of nominal arguments in Georgian is not constant, but depends on<br />

the conjugation (valency) class of the verb and the series / screeve of the verb forms.<br />

Transitive verbs can follow one of three patterns, depending on series:<br />

(1) k’ac-i dzaγl-s xat’avs<br />

man-NOM dog-DAT paint.Pres.3SgSubj


“The man paints / is painting the dog.” (Series I, Present – Pattern A)<br />

(2) k’ac-ma dzaγl-i daxat’a<br />

man-ERG dog-NOM paint.Aor.3SgSubj<br />

“The man painted the dog.” (Series II, Aorist – Pattern B)<br />

(3) k’ac-s dzaγl-i t’urme dauxat’avs<br />

man-DAT dog-NOM apparently paint.Perf.3SgSubj<br />

“The man has painted the dog.” (Series III, Perfect – Pattern C)<br />

Table 6 demonstrates the case-marking patterns by series for all four conjugation<br />

classes. Only transitive and unergative (active intransitive) verbs show variability by<br />

series. Unaccusative verbs always follow Pattern A (similar to the standard nominative/<br />

accusative pattern in European languages), and indirect verbs always follow Pattern<br />

C (the inverse pattern). In order to assign correct case marking, a learner of Georgian<br />

must recognise the conjugation class of each verb, as well as the series / screeve for<br />

some of the verb classes.<br />

Table 6 – Case-Marking Patterns<br />

Series Transitive Unaccusative Unergative Indirect<br />

I A A A C<br />

II B A B C<br />

III C A C C<br />

2.4 Summary<br />

The formation of the screeves exhibits several irregular, semi-regular, and regular<br />

patterns. The morphological elements in the Georgian verb template are easy to<br />

identify, suggesting an agglutinative structure. However, closer inspection reveals<br />

that the morphological elements may not have easily identifiable meanings or<br />

functions (cf. preverbs, thematic suffixes, and screeve endings). Moreover, even if<br />

we manage to find meanings for these elements, the meanings will not predict the<br />

distribution of such elements across different verbs, verb types, and screeves. Such<br />

non-compositionality in meaning makes Georgian more similar to morphologically<br />

non-concatenative languages such as Arabic and Hebrew.<br />

On the basis of the data above, it is argued in (Gurevich 2003) and (Blevins<br />

forthcoming 2006) that a word-based morphological theory is more appropriate for<br />

Georgian. In such a theory, word formation is determined by whole-word patterns,<br />

such that the whole word carries morphosyntactic properties, and they need not be<br />

assigned to individual morphemes. Gurevich (forthcoming 2006) suggests that such<br />

232


patterns may be represented as constructions, or form-meaning pairings in which the<br />

elements of form need not match the elements of meaning one-to-one. The analysis is<br />

based on insights of Construction Grammar (Fillmore 1988; Goldberg 1995). It is argued<br />

that the main organising unit and the best level for morphosyntactic constructions in<br />

Georgian is the series. The series provides a base for expressing the more or less<br />

regular patterns of Georgian morphosyntax. The less regular and more lexicalised<br />

information, on the other hand, is best expressed using inflectional (lexical) classes<br />

of verbs.<br />

3. Approaches to Computational Morphology<br />

3.1 Standard Assumptions and Difficulties Presented by Georgian<br />

Many contemporary approaches to computational morphology are based on, or can<br />

be easily translated into, finite-state networks (FSN). In such approaches, an arc in the<br />

FSN often corresponds to a phoneme or morpheme, and the recognition or generation<br />

of each arc advances the state in the network. Many approaches, including Beesley &<br />

Karttunen (2003), are implemented as two-way finite-state transducers (FST) in which<br />

each arc corresponds to a mapping of two elements, for example, a phoneme and its<br />

phonetic realisation, or a morpheme and its meaning. As a result, FST morphology<br />

very often assumes morpheme-level compositionality, the idea that the meaning of<br />

a word is compositionally made up from the meanings of its constituent morphemes.<br />

FST morphology has, for the most part, been applied to concatenative morphological<br />

systems like Finnish, although there have been some recent applications to templatic<br />

morphology such as Arabic (Beesley & Karttunen 2003).<br />

As demonstrated in the previous section, assumptions of morphemic compositionality<br />

do not serve well to describe the verbal morphology of Georgian. The Georgian verb<br />

forms are made up of identifiable morphological elements (i.e., elements of form), but<br />

the meaning of these elements is not easily identifiable, and does not stay constant in<br />

different morphosyntactic contexts.<br />

A computational system appropriate for Georgian should be able to accommodate<br />

the templatic nature of Georgian verb forms and its patterns of regularity and sub-<br />

regularity. Overall it should be able to describe the following:<br />

• Meaning carried by a whole word form rather than by individual<br />

morphemes;<br />

• Lexical root alternations and suppletion;<br />

233


• Lexical class-dependent screeve formation (e.g. the endings in the<br />

Aorist);<br />

• The dependency between the formation of some screeves from that of<br />

others (e.g. the Imperfect from the Present); and<br />

• The multiple exponence of agreement, that is, the use of suffixes and<br />

prefixes simultaneously, and the simultaneous expression of subject and object<br />

agreement.<br />

The linguistic analysis of Georgian verbal morphology suggested in the previous<br />

section relies on insights from Construction Grammar. Unfortunately, there are<br />

currently no computational implementations of CG capable of handling complex<br />

morphological systems. Bryant (2003) describes a constructional syntactic parser,<br />

based on general principles of chart parsing. However, this parser cannot yet handle<br />

morphological segmentation, and adapting it for Georgian would require substantial<br />

revision.<br />

Fortunately, FST tools for computational morphology have advanced to the point<br />

where they can handle some aspects of non-concatenative morphology. The next<br />

section briefly describes the approach in Beesley & Karttunen (2003) and what makes it<br />

a possible candidate for modelling at least a subset of Georgian verbal morphology.<br />

3.2 Xerox Finite-State Morphology Tools<br />

Beesley & Karttunen (2003) present the state-of-the-art set of tools for creating<br />

finite-state morphological models. The book is accompanied by implementations of<br />

the two Xerox languages: xfst (designed for general finite-state manipulations) and<br />

lexc (designed more specifically for defining lexicons). Since our goal was to reproduce<br />

morphotactic rules of word formation rather than the structure of the lexicon, xfst<br />

was used.<br />

Xfst provides all of the basic commands for building up single or two-level finite-<br />

state networks (i.e., transducers), such as concatenation, intersection, and so forth.<br />

In addition, xfst has several built-in shortcuts that make network manipulation<br />

easier, such as various substitution commands. Xfst distinguishes between words of a<br />

natural language (composed of single characters) and multi-character symbols, used<br />

in our model to indicate morphosyntactic properties such as person or number. Each<br />

completed arc in a finite-state network compiled using xfst represents a mapping<br />

between a set of morphosyntactic and semantic properties (on the upper side) and a<br />

full word form that realises those properties (on the lower side).<br />

234


Another very useful feature of xfst is the ability to create scripts with several<br />

commands in a sequence. The later commands can operate on the output of earlier<br />

commands, and can thus create a cascade of finite-state transducers. Xfst also provides<br />

convenient ways of outputting all the words recognised by a given transducer, which<br />

proved very useful in the creation of the online reference (see section 5). An updated<br />

version of xfst (Beesley & Karttunen forthcoming 2006) also includes support for utf-<br />

8.<br />

While finite-state technology is very good at generating and recognising regular<br />

expressions, it has a harder time capturing other features of natural language such<br />

as non-concatenative morphological structure. The next section describes some<br />

adaptations that allow FST to handle many of the non-concatenative patterns in<br />

Georgian.<br />

In addition, FST is not designed to represent a dynamic, living mental lexicon of<br />

an actual speaker. It does not provide any mechanisms for probabilistic decisions, or<br />

for recognition and generation of novel inflectional forms. The concluding section<br />

discusses some possible future developments in this area.<br />

4. Computational Model of the Georgian Verb<br />

4.1 General Idea<br />

As argued above, Georgian verb morphology can be described as a series of patterns<br />

at various levels of regularity. Most of the patterns specify particular morphosyntactic<br />

or semantic properties of verb forms and the corresponding combinations of elements<br />

in the morphological templates. In the model proposed here, screeve formation is<br />

viewed as lexical or semi-regular, and pronominal agreement is viewed as completely<br />

regular.<br />

Screeve formation for different conjugation classes (transitive, unergative,<br />

unaccusative, and inverse) is fairly different in Georgian, and so each conjugation class<br />

is implemented as a separate network. Nevertheless, the principles for composing<br />

each network are the same.<br />

The model is implemented as a cascade of finite-state transducers, that is, as<br />

several levels of FST networks such that the result of composing a lower-level network<br />

serves as input to a higher-level network. The levels correspond to the division of<br />

templatic patterns into completely lexical (Level 1) and semi-regular (Level 2). Level<br />

3 contains completely regular patterns that apply to the results of both Level 1 and<br />

Level 2. The result of compiling Level 3 patterns is the full set of conjugations for the<br />

235


verbs whose lexical information is included in Level 1. The FST model can be used<br />

both for the generation of verbal inflections and for recognition of complete forms.<br />

In general, the most specific or irregular information is contained at the lower<br />

levels. The higher levels, by contrast, contain defaults that apply if there is no more<br />

specific information. The verbs explicitly mentioned in the lexical level (Level 1) are<br />

representative examples of lexical classes, as posited by the linguistic analysis in<br />

section 2. Through the use of diacritics and replacement algorithms, other verbs are<br />

matched to their lexical classes and are included in the resulting network.<br />

The main advantage of this implementation is in the separation of lexical,<br />

or irregular, verb formation patterns from the semi-regular or completely regular<br />

patterns. The initial input to the FST cascade includes only the necessary lexical<br />

information about each verb and verb class; the computational model does the rest<br />

of the work.<br />

The model described here served as the basis for an online reference on Georgian<br />

verb conjugation, described in section 5. This practical application underlies some of<br />

the specific choices in implementing the model.<br />

The current implementation of the model focuses on transitive verbs; however,<br />

there are obvious ways of extending the model to apply to other verb classes.<br />

4.2 Level 1: The Lexicon<br />

The first level of the FST model contains lexically specific information. There are<br />

two separate networks. The first network contains information about the gloss and<br />

masdar or the verb stem.<br />

The second network contains several complete word forms for each verb stem,<br />

providing all the lexically-specific information needed to infer the rest of the<br />

inflections. For the most regular verbs, these are:<br />

• Present screeve, no overt agreement (corresponds to 2Sg Subject, 3Sg<br />

Object;<br />

• Future screeve, no overt agreement;<br />

• Aorist screeve, no overt agreement;<br />

• Aorist, 3Sg Subject, no overt object agreement; and<br />

• Aorist Subjunctive.<br />

Some verbs need additional forms in order to describe their paradigms:<br />

• Present screeve, 3Pl Subject (most verbs have the ending –en, but some end<br />

in –ian); and<br />

• Perfect screeve.<br />

236


The inflected forms are represented as two-level finite-state arcs, with the verb<br />

stem and morphosyntactic properties on the upper side, and the inflected word on the<br />

lower side, as in Figure 1. The purpose of the stem is to uniquely identify each verb.<br />

Verb roots in Georgian are often very short and ambiguous; therefore a combination<br />

of the verb root plus thematic suffix was used. In some cases, even this combination<br />

is be insufficient to identify the verb uniquely; in such cases, the preverb may be<br />

necessary as well. It is only important that the verb stem can be uniquely matched in<br />

the network containing glosses; thus, the stem has no theoretical significance in this<br />

model.<br />

Another challenge is posed by the non-concatenative nature of verb agreement.<br />

Recall from section 2 that verb agreement is realised by a pre-stem affix and a final<br />

suffix. Since many of the word forms in Level 1 contain preverbs, the agreement affix<br />

would need to be infixed into the verb form at a later level. Beesley & Karttunen<br />

provide some fairly complex mechanisms for doing infixation in FST; however, the<br />

fixed position of the agreement affixes in the Georgian verb template allows for a<br />

much simpler solution. The forms on Level 1 contain a place holder “+Agr1” for the<br />

prefixal agreement marker (Figure 1), which is replaced by the appropriate marker in<br />

the later levels.<br />

The Level 1 network is produced via scripts from a table of verb forms containing only<br />

the necessary lexical information. Redundancy in human input is thus minimised.<br />

237


4.3 Level 2: Semi-regular Patterns<br />

Figure 1 – Simplified FST Script<br />

The purpose of Level 2 is to compile inflectional forms that are dependent on other<br />

forms (introduced in Level 1), and to provide default inflections for regular screeve<br />

formation patterns.<br />

An example of the first case is the Conditional screeve, formed predictably from<br />

the Future screeve. The FST algorithm is as follows:<br />

• Compile a network consisting of Future forms;<br />

• Add the appropriate inflectional suffixes (-di for 1st and 2nd person<br />

subject, -da for 3rd person subject);<br />

• Replace the screeve property “+Fut” with “+Cond”; and<br />

• Add the inflectional properties where needed.<br />

The replacement of screeve properties is done using the ‘substitute symbol’<br />

command in xfst; other operations are performed using simple concatenation<br />

commands.<br />

An example of the latter is the addition of 3Pl Subject forms of the Present screeve.<br />

The default suffix is –en, which is added to all verbs unless an exception is specified<br />

at Level 1. The basic algorithm is as follows:<br />

• Compile a network of Present forms, excluding the forms for which both<br />

3Pl Subject forms are already specified;<br />

238


• Add the suffix –en; and<br />

• Add the morphosyntactic property “+3PlSubj”.<br />

All of the patterns defined at Level 2 are then compiled into a single network,<br />

which serves as input to Level 3.<br />

4.4 Level 3: Regular Patterns<br />

The purpose of Level 3 is to affix regular inflection, namely, subject and object<br />

agreement. As described in section 2, agreement in Georgian is expressed via a<br />

combination of a prefix and a suffix that are best thought of as attaching simultaneously<br />

and working in tandem to express both subject and object agreement. Thus, the<br />

compilation of Level 3 consists of several steps, each of which corresponds to a<br />

paradigm cell.<br />

In each step, all of the word forms from Level 2 are taken as input. The place<br />

holder for the pre-stem agreement affix is then replaced by the appropriate affix (in<br />

some cases, this is null), and the appropriate suffix is attached at the end, as in Figure<br />

1. The resulting networks are then compiled into a single network.<br />

The only difficulty at this level arises when dealing with the ‘inverted’ screeves<br />

(Perfect and Pluperfect). As demonstrated in section 2, the morphological agreement<br />

in these screeves is sensitive to the case-marking of the nominal arguments, which<br />

is the reverse of the regular pattern. However, the composition of the agreement<br />

paradigm is sensitive to the semantic roles played by the arguments: plural number<br />

agreement is still triggered by the semantic agent. In this case, the computational<br />

implementation was motivated by the practical application of the model to the online<br />

reference. A separate set of paradigm cells was created for the inverted tenses,<br />

interpreting the properties ‘Subject’ and ‘Object’ as semantic. The resulting FST<br />

network thus shows no relation between inverted and non-inverted forms (i.e., it<br />

does not capture the generalisation behind inversion). Such an interpretation was<br />

sufficient for the purposes of the conjugation reference. However, the model could<br />

easily be amended to incorporate a different analysis of inversion that relies on the<br />

distinction between semantic and morphological arguments.<br />

4.5 Treatment of Lexical Classes<br />

The input to Level 1 contains a representative for each lexical class, supplied with a<br />

diacritic feature indicating the class number. Other verbs that belong to those classes<br />

could, in principle, be inputted along with the class number, and the FST model could<br />

239


substitute the appropriate roots in the process of compiling the networks. There are,<br />

however, several challenges to this straightforward implementation:<br />

• Verbs belonging to the same class may have different preverbs as well as<br />

different roots, thus complicating the substitution;<br />

• For many verbs, screeve formation involves stem alternations such as<br />

syncope or vowel epenthesis, again complicating straightforward substitution;<br />

and<br />

• Suppletion is also quite common in Georgian, requiring completely<br />

different stems for different screeves.<br />

As a result, even for a verb whose lexical class is known, several pieces of<br />

information must be supplied to infer the complete inflectional paradigm. The FST<br />

substitution mechanisms are fairly restricted, and so the compilation of new verbs<br />

is currently done using Java scripts performing simple string manipulations. Such<br />

an implementation still makes use of the division into lexical classes. The scripts<br />

make non-example verbs look like example verbs in Level 1 of the FST network by<br />

creating the necessary inflected forms, but the human input to the scripts need only<br />

include the information necessary to identify the lexical class of the verb. Future<br />

improvements to the computational model may include a more efficient method of<br />

identifying lexical classes within FST itself.<br />

The exact number of lexical classes has not been established with full certainly.<br />

Melikishvili (2001) relies entirely on morphological characteristics of verb inflection<br />

and categorises verb forms into sixty-three different classes; seventeen of those are<br />

for transitive verbs. This classification, however, makes some distinctions that can be<br />

merged in the computational model; for example, certain types of non-productive<br />

stem extensions can be considered part of the lexically specified verb stem.<br />

Another issue is the psychological reality of the lexical classes. A pilot survey of<br />

morphological productivity, conducted with adult speakers of Georgian, suggests that<br />

speakers conjugating nonce verbs rely more on frequent inflectional patterns than on<br />

a rule-based comparison with existing verbs based on morpho-phonological similarities<br />

with the nonce verbs (Gurevich forthcoming 2006). Such a reliance on frequency is<br />

not reflected in Melikishvili’s classification. The computational model proposed here<br />

takes a small step in this direction by relying on frequent verbs as example paradigms;<br />

however, the model does not have any built-in way to accommodate the relative<br />

frequency of different inflectional patterns. The concluding section suggests some<br />

possible improvements for the future.<br />

240


4.6 Case Frames<br />

As described in Section 2, another complicating feature of the Georgian verb is<br />

the variability of case-marking patterns for the verb’s arguments. For the purposes<br />

of the online conjugation reference, it was necessary to present the case-marking<br />

information with each verb. Fortunately, the case marking patterns depend almost<br />

entirely on the conjugation class and TAM series of the verb 3 . Since the goal of the<br />

online reference is to describe the morphosyntactic patterns of Georgian, it was<br />

sufficient to simply mention the case-marking pattern for each verb type.<br />

If the purpose of the morphological transducer is to supplement a syntactic<br />

parser, the case-marking information could be represented as a feature structure and<br />

associated with each verb type.<br />

4.7 Summary<br />

The computational model presented here accommodates many properties of<br />

Georgian verbal conjugation that make it challenging: the templatic structure of the<br />

verb forms; the non-concatenative nature of word meaning construction; the large<br />

number of irregular and semi-regular word formation patterns; and the interaction<br />

between word formation and case marking on the verb’s arguments. The model<br />

crucially relies on classification of verbs into lexical classes with example paradigms<br />

for each class. The two-level mappings inherent in FST mean that the model can be<br />

used for generation as well as recognition.<br />

5. Practical Application: An <strong>Online</strong> Reference<br />

5.1 Purpose<br />

The computational model described here serves as the basis for an online reference<br />

on Georgian verb conjugation. The goal of the online reference is to aid the learners<br />

of Georgian in a number of ways:<br />

verbs;<br />

• It provides complete conjugation tables for two hundred frequently-used<br />

• The verb database can be searched using any verb form or its English<br />

translation; and,<br />

• For many verb forms, real-life examples from the Internet, as well as<br />

audio and video sources, are provided (along with translations).<br />

3 One of the exceptions is the verb ic’is ‘he/she knows.’ Although this verb is transitive, its subject must<br />

always be in the Ergative and its object, in the Nominative.<br />

241


• Several types of exercises are available on the website; answers are<br />

automatically checked for correctness.<br />

• The online reference is meant as an addition to the classroom or self-<br />

study using a textbook, such as (Kurtsikidze forthcoming 2006).<br />

5.2 Website Design<br />

The website is divided into four sections: ‘Verb Conjugation’, ‘Examples’,<br />

‘Exercises’, and ‘Resources’.<br />

The section on verb conjugation is the core of the reference tool. It provides<br />

complete tables of verb conjugations, accessible through browsing by individual verb<br />

(in Georgian or in English), or by searching. The conjugated forms are produced using<br />

the FST model described in the previous section; the forms are then automatically<br />

inputted into a MySQL database and displayed on the website using PHP. In addition to<br />

displaying verb forms, the site allows the user to search for a given verb form, using<br />

the recognition capabilities of the FST network. This search capacity demonstrates a<br />

major advantage of online resources over print.<br />

Many of the verb forms are accompanied by handpicked examples of usage from<br />

print sources (mainly online newspapers and chat rooms), audio (from recorded<br />

naturalistic dialogues), and movie clips. The examples are provided as complete<br />

sentences and short paragraphs; translations are available for all examples. Audio<br />

and video examples are likewise accompanied by transcriptions and translations. I am<br />

very grateful to Vakhtang Chikovani for finding and translating the examples.<br />

The ‘Examples’ section of the website provides a different way to access the print,<br />

audio, and video examples. This can be done through browsing by verb, or by searching<br />

(again, in Georgian or in English).<br />

The ‘Exercises’ section contains several different types of exercises to provide<br />

additional practice for using and conjugating verbs. Many of the exercises are<br />

generated based on the conjugated forms or the handpicked examples, and so the<br />

correctness of the answers can be checked automatically.<br />

Finally, the ‘Resources’ section contains links to various online and bibliographical<br />

resources about Georgian, as well as technical suggestions for using Georgian fonts.<br />

The website will be operational in spring 2006; anyone interested in using it should<br />

contact this author.<br />

242


6. Conclusions and Further Work<br />

This paper represents a first attempt at modelling Georgian verbal morphology<br />

using easily available, off-the-shelf technology such as FST. Using some adaptations to<br />

accommodate the templatic and non-compositional structure of the Georgian verbs,<br />

we were able to make significant progress and produce one practical application of the<br />

computational model for language learners. In short, the model provides a convenient<br />

method for representing the existing lexicon for computational applications such as<br />

parsing or generation.<br />

Naturally, each technology has its drawbacks. FST provides no way to incorporate<br />

frequency information about the Georgian lexicon, and, in general, is not an accurate<br />

model for how verbs are learned. Unfortunately, creating a statistically sensitive model<br />

of the Georgian lexicon is not currently an easy proposition, as there are no available<br />

corpora of Georgian, and no immediate ways of obtaining statistical distributions.<br />

This project will develop in several ways in the future. First, the existing model<br />

will be enriched with more verb types and more inflectional parameters (such as the<br />

use of pre-radical vowels and productive passivization and causativization processes).<br />

Second, I plan to explore ways to incorporate statistical information into the model,<br />

either through the use of connectionist networks or by putting numerical transition<br />

probabilities on the different arcs in the FST transducers. The eventual goal would be<br />

to create a model that can be used for learning Georgian verb conjugations, which<br />

could produce a finite-state network of complete word forms. We also hope that this<br />

model and the online reference and collection of examples can serve as the basis for<br />

the creation of a corpus of spoken Georgian. Information collected in the corpus can<br />

then be used to inform and improve future computational models.<br />

243


244<br />

References<br />

Anderson, S.R. (1992). A-Morphous Morphology. Cambridge, England; New York:<br />

Cambridge University Press.<br />

Beesley, K. & Karttunen L. (forthcoming 2006). Finite-State Morphology. Second<br />

Edition. Cambridge/ New York: Cambridge University Press.<br />

Beesley, K. & Karttunen L. (2003). Finite-State Morphology. Cambridge / New York:<br />

Cambridge University Press.<br />

Blevins, J.P. (forthcoming 2006). Word-Based Morphology. Journal of Linguistics.<br />

Boeder, W. (1969) “Über die Versionen des georgischen Verbs.” Folia Linguistica 2,<br />

82-152.<br />

Fillmore, C.J. (1988) “The Mechanisms of ‘Construction Grammar’.” BLS 14, 35-55.<br />

Goldberg, A.E. (1995). Constructions: A Construction Grammar Approach to Argument<br />

Structure. Chicago: University of Chicago Press.<br />

Gurevich, O. (2003). “On the Status of the Morpheme in Georgian Verbal Morphology.”<br />

Berkeley Linguistic Society 29, 161-172.<br />

Gurevich, O. (forthcoming 2006). Constructional Morphology: The Georgian Version.<br />

PhD Dissertation, UC Berkeley.<br />

Halle, M. & Marantz, A. (1994). “Distributed Morphology and the Pieces of Inflection.”<br />

Hale K. and Keyser S.J.(eds) (1994). The View from Building 20: Essays in Linguistics<br />

in Honor of Sylvain Bromberger. Where:, MIT Press.<br />

Kurtsikidze, S. (forthcoming 2006). Essentials of Georgian Grammar. München: LINCOM<br />

Europa.


Melikishvili, D. (2001). Kartuli zmnis ughlebis sist’ema [Conjugation system of the<br />

Georgian verb]. Tbilisi: Logos presi.<br />

Stump, G.T. (2001). Inflectional Morphology: A Theory of Paradigm Structure.<br />

Cambridge, New York: Cambridge University Press.<br />

245


The Igbo Language and Computer Linguistics:<br />

Problems and Prospects<br />

247<br />

Chinedu Uchechukwu<br />

Computer Linguistics is a wholly undeveloped and an almost unknown area of research<br />

in the study of Nigerian languages. Two major reasons can be given for this state of<br />

affairs. The first is the lack of training of Nigerian linguists in this discipline, and the<br />

second is the general newness of computer technology in the country as a whole. This<br />

situation, however, is most likely to change as a result of the increasing introduction<br />

of computer technology in the country, and in the institutions of higher learning in<br />

particular. Such a change is highly promising and most welcome, but it also makes<br />

obvious three main aspects of computer technology that have to be properly addressed<br />

before one can speak with confidence of computer linguistics in connection with any<br />

Nigerian language. These three aspects are: appropriate font programs, good input<br />

systems, and compatible software. This paper looks at the Igbo language in the light of<br />

these points. Section 1, which serves as an introduction, presents the major problems<br />

confronting the language with regard to its realisation in the new technology. Section<br />

2 presents the strategies adopted to take care of these problems. Section 3 examines<br />

the benefits of such strategies on the development of an Igbo corpus and lexicography,<br />

as well as the issue of computer linguistic tools (like spell checkers) for the language.<br />

Finally, section 4, the conclusion, examines the prospects of full-fledged computer<br />

linguistics in the Nigerian setting.<br />

1. Introduction<br />

There are several issues that constitute a major hinderance to the development of<br />

computer linguistics in Nigeria. These range from the implementation problems that<br />

confront the information technology policies of the different Nigerian governments,<br />

to the lack of harmony between such policies and the Nigerian educational systems,<br />

as well as the effects of all these on the different Nigerian languages.<br />

With regard to the government policy, Nigeria has already had two different<br />

computer-related policies. The first was the Nigerian Computer Policy of 1988, and<br />

the second is the newly enacted Nigerian National Policy for Information Technology<br />

(IT) of 2001. As regards the Nigerian National Computer Policy, a comparison of<br />

its goals with actual practice has revealed that the policy itself has not been fully


implemented at all. While Yusuf (1998) sees its lack of success as resulting from<br />

teachers’ incompetence, Jegede & Owolabi (2003) have come to the following<br />

conclusions in their own survey: the policy’s software and hardware stipulations are<br />

completely outdated and not maintained, its teachers’ in-service training provision<br />

has never been practiced, and the stipulated number of computers per secondary<br />

school is rarely to be found. All these can only but confirm the conclusion that “the<br />

current pronouncements are obsolete and need to be updated within the dynamic<br />

world of computers” (Jegede & Owolabi 2003: 9). The recent Nigerian National<br />

Policy for Information Technology has not fared any better. With the assumption that<br />

information technology can be said to have started more intensively in the country<br />

with the return of democracy in 1999 (Ajayi 2003), the failure of the previous national<br />

computer policy is often overlooked, so that the failings of the older policy are simply<br />

being repeated. But some of what is being presented as the ‘achievements’ of the new<br />

policy actually diverts attention from simple core issues that need to be addressed<br />

before such achievements can be effective nationwide. The first example of such an<br />

achievement is a concentration of energy on the readily visible Internet access and on<br />

IT workshops for high officials of the government at the federal level (while leaving<br />

the average civil servants of the lower cadre to find the means of helping themselves).<br />

However, it is especially those of the lower cadre who are surely to be involved<br />

in the actual implementation/execution of the policies. The second example is the<br />

establishment of the National Information Technology Development Agency (NITDA)<br />

with the sum of about $10 million (Nigerian National Policy for Information Technology<br />

2001:vii). One of the agency’s achievements is its ‘Nigerian Keyboard’ project, which<br />

only led to the production of a downloadable single keyboard dll file for the Microsoft<br />

operating system (NITDA: http://www.nitda.org/projects/kbd/index.php). It shall be<br />

shown below how the effort that is being made by the private sector, with little or<br />

no financial support, is yielding more benefit. Finally, just like the computer policy of<br />

1988, the more recent policy has also not contributed much to the educational sector.<br />

In his examination of the impact of the most recent policy on Nigeria’s educational<br />

system, Yusuf (2005) sees it as not providing any specific provision for (or application<br />

to) education, being market driven, dependent on imported software, and without<br />

any specific direction at the institutional levels. His conclusions are that:<br />

The need for integration in teaching and learning, the need for quality professional<br />

development programs for pre-service and serving teachers, research, evaluation<br />

and development, and the development of local context software are not addressed<br />

(Yusuf 2005:320).<br />

248


The overall conclusion from this overview is that the two policies, as well as their<br />

implementation, have not contributed much to the Nigerian educational system.<br />

While agreeing with the analysis of the authors cited above, I would add, however,<br />

that one should also bear in mind that “nobody can give what he does not have.”<br />

For example, one should not expect the bureaucrats, who have always had their<br />

secretaries type their letters, to suddenly understand and fully implement IT policies<br />

that they did not encounter in the course of their training. The same also applies to<br />

the institutions of higher learning. Here one should not, for example, expect lecturers<br />

in the Departments of Linguistics, who did not do any computer-based research work<br />

in the course of their studies, to suddenly start supervising PhD projects in computer<br />

linguistics. In other words, two groups are involved: the civil service and the teaching<br />

force. The civil service is a system that has operated over many decades without the<br />

help of computer technology and that consequently can neither fully appreciate nor<br />

implement the computer technology-related policies as they affect the educational<br />

sector. That also explains why much of the input into making the computer technology<br />

immediately relevant to the Nigerian educational system is coming from other channels<br />

than the federal civil service system itself. The second side is the teaching force within<br />

the Nigerian educational institutions. The majority of Nigerian linguists of the past and<br />

present generation were not trained in the area of computer-based research for the<br />

simple reason that the ordinary typewriter usually was the best machine available to<br />

them at the time they were trained. It is obvious that these scholars would train their<br />

successors in line with what they knew. For this simple reason, it is unjustifiable to<br />

expect them, as well as those they trained, to suddenly start teaching computational<br />

linguistics. The conclusion is that, just like within the civil service, the proper use<br />

of computer technology within the educational sector, especially with regard to the<br />

Nigerian languages, has to come through other channels than those established by<br />

the government, and would involve the inevitable but voluntary contribution of both<br />

private institutions and individuals. This is the simple reality that confronts most<br />

Nigerian languages today.<br />

Finally, the above state of affairs in both the civil service and the educational<br />

sector can be described as human resources related. It is also from this angle that<br />

most analysts of the computer and IT policies of the Nigerian National Computer<br />

Policy (1988) and the Nigerian Policy on Information Technology (2001) have looked<br />

at it. A further confirmation of this is the drive of one single state of the federation,<br />

Jigawa State, to simply ignore the slowly grinding federal structure and seriously<br />

249


invest in information and communication technology hardware, as well in the training<br />

of its civil servants and teachers. This exemplary position, which other states of the<br />

federation are now imitating 1 , was facilitated through an agreement between the<br />

Jigawa state government and Informatics Holdings of Singapore in 2001 2 . A further<br />

boost to the effort of this single state is the recent W 21 (Wireless Internet) award<br />

assigned to the State Governor, Alhaji Ibrahim Saminu Turaki, by the Wireless Internet<br />

Institute of the United States of America, which is an international recognition of<br />

Jigawa State’s investment in ICT and human resources development. However, taking<br />

care of the human resources issue does not also simultaneously take care of the<br />

technical needs of the Nigerian languages; on the contrary, it makes these technical<br />

needs ever more apparent. Thus, the limited increase in computer literacy is enough<br />

to make apparent that the Nigerian languages are confronted with two main problems:<br />

(1) an appropriate input system in the form of keyboards; and, (2) the fonts for a<br />

satisfactory combination of diacritics for the scripts of the individual languages.<br />

In the next section, I shall give an overview of the effort made so far to take care<br />

of these two problems for the Igbo language.<br />

2. Computer-Related Problems and their Solutions<br />

The input system and the appropriate font to display the inputed characters are<br />

so intertwined that progress in one cannot take place without a similar progress in<br />

the other. For the Igbo language (as well as other Nigerian languages), this has meant<br />

a constant pendulum movement between fixing the input system and fixing the font.<br />

But generally, the effort to solve the input problem for the Igbo language and other<br />

Nigerian languages had two main phases: the typewriter phase and the computer<br />

phase.<br />

2.1 The Typewriter Phase<br />

The typewriter phase was engineered by Kay Williamson and her colleagues, first<br />

at the University of Ibadan and later at the University of Port Harcourt. This involved<br />

removing certain foreign symbols on the standard typewriter keyboard and replacing<br />

them with special symbols used in Nigerian languages, like the hooked letters of<br />

Hausa, or by diacritics like tone marks and sub-dots. The diacritic keys were changed<br />

to become ‘dead’ keys, so that the diacritic (tone mark or subdot or both) was typed<br />

before the letter which bears it. With the start of the National Computer Policy in<br />

1988 this effort no longer had any further support and consequently came to an end.<br />

1 http://www.onlinenigeria.com/articles/ad.asp?blurb=117<br />

2 http://www.e-lo-go.de/html/modules.php?name=News&file=article&sid=7512<br />

250


2.2 The Computer Phase<br />

The computer phase, on the other hand, witnessed different stages. The first<br />

stage, from 1985 onwards, was not concerned with a physical input system (i.e. a<br />

keyboard), but mainly with the development of the appropriate font that could be<br />

inputted with the available English keyboard. This effort was led by Victor Manfredi<br />

on behalf of the Journal of the Linguistic Association of Nigeria (JOLAN), supported by<br />

Edward Ọgụejiofor, a Macintosh programmer in Boston. The first version of the font<br />

was called JolanPanNigerian; it was expanded to include symbols for other major<br />

languages of West Africa, and was consequently renamed PanKwa. The main drawback<br />

with PanKwa was its restriction to Macintosh computers, which were not used in<br />

Nigeria; it has also not been possible to adapt it to other operating systems such<br />

as DOS, Windows or Linux (for further details see Uchechukwu 2004). This situation<br />

remained unchanged until 2000.<br />

The next stage has been aided through a convergence of some favourable factors<br />

during the 21 st century, including the availability of virtual keyboards, the founding<br />

of Unicode, as well as the drive towards the development of a physical keyboard for<br />

Nigerian languages. All these, however, have led to four major lines of effort: the<br />

aforementioned Nigerian Keyboard Project of the National Information Technology<br />

Development Agency (NITDA), the independent endeavours of KỌYIN/Lancor, Alt-I,<br />

and my collaboration with Andrew Cunningham (http://www.openroad.net.au).<br />

2.2.1 The Nigerian Keyboard Project<br />

The keyboard layout produced by NITDA is messy. It is not fully Unicode compatible<br />

and does not provide the means for adding some diacritics, like a macron. It is<br />

not surprising that not much work has gone into the project since the release of<br />

its downloadable dll file for the Microsoft OS. Finally, the effort mentioned in the<br />

sections below can only buttress the point that NITDA’s Nigerian Keyboard Project has<br />

become another white elephant.<br />

2.2.2 The KỌYIN Keyboard<br />

The KỌYIN keyboard is not a Nigerian keyboard layout project, but a business<br />

venture of LANCOR Technologies of Boston, MA in the United States. The LANCOR<br />

Multilingual Keyboard Technology (LMKT) (http://www.lancorltd.com/konyin.html) has been<br />

used by the company to create multilingual keyboards for different languages. The<br />

251


KỌYIN keyboard involved the use of the LANCOR Multilingual Keyboard Technology<br />

(LMKT) to create a physical multilingual keyboard for the Nigerian languages.<br />

Some observable changes that the keyboard has undergone can be summarised as<br />

follows. First of all, for the characters (especially the vowels) that are combined with<br />

diacritics (in the form of definite symbols placed under the vowels), the company<br />

initially used the Yoruba Ife System, which involved the use of a short vertical line<br />

under the appropriate character. This was demonstrated on the company’s website<br />

with an instruction on how to key-in the Yoruba name of the president that contained<br />

such characters. Later, the vertical line was replaced with a dot, which is more<br />

widespread in the scripts of other Nigerian languages, including Igbo. This could be<br />

seen as an improvement, as it would also increase the marketability of the company’s<br />

keyboard in Nigeria.<br />

2.2.3 The ALT-I Keyboard<br />

The African Languages Technology Initiative (Alt-I) can be described as an<br />

organisation whose aim is to appropriate modern ICTs for use in African Languages.<br />

The company hopes to achieve this through advocating ICT and also delivering ICT-<br />

related services. But more relevant here is the organisation’s effort to produce a<br />

Yoruba keyboard. The organisation listed some of its achievements in this regard on<br />

its website. These include:<br />

• The production of an installable keyboard driver, which it hoped to start<br />

marketing by 2004;<br />

• Demonstrations (including installation) of its keyboard driver at the<br />

following seven universities in the Western part of Nigeria: University of Ibadan;<br />

Olabisi Onabanjo University, Ago-Iwoye; University of Ilorin; Lagos State University,<br />

Ojo; University of Lagos; Adekunle Ajasin University, Akungba; and Obafemi<br />

Awolowo University, Ile Ife;<br />

• The endorsement of its keyboard by the Yoruba Studies Association of<br />

Nigeria (YSAN) at the 2003 Annual Conference of the Yoruba Studies Association of<br />

Nigeria between the 4 th and 8 th November 2003; and,<br />

• The 2003 IICD award of the African Information Society Initiative (AISI) on<br />

Local Content Applications.<br />

Finally, the organisation has started to reach out to other Nigerian languages outside<br />

the Western and predominantly Yoruba-speaking part of the country. There is no doubt<br />

that the aim is a ‘Nigerian Keyboard.’ A hint in this direction is the comparison of the<br />

KỌYIN keyboard with the ALT-I keyboard: “the Alt-I keyboard is superior to the Lancor<br />

252


product on the grounds that Alt-I considered a lot of human factor engineering and<br />

other social issues, which Lancor seems to have overlooked in their keyboard design”<br />

(Adegbola 2003). I do not know the details of the issues between the two efforts, but<br />

with a population of about 120 million the Nigerian market is large enough for more<br />

keyboards. I now turn to the development of the Igbo keyboard.<br />

2.2.4 The Igbo Keyboard<br />

This is simply an effort that arose through my collaboration with Andrew<br />

Cunningham. The effort is not supported by any business or charity organisation. From<br />

the outset, the focus was to find a solution that would exploit the already available<br />

keyboard layouts and adapt them for the Igbo language without building a physical<br />

keyboard from scratch.<br />

There are many virtual keyboards on the net that could be altered to that effect,<br />

but Tavultesoft’s (www.tavultesoft.com) ‘Keyman’ program was found to be the best.<br />

Two possible physical keyboards came into consideration at the initial stage: the<br />

German keyboard and the English keyboard. The drawback of the English keyboard is<br />

the requirement to hold down or combine not less than three different keys in order<br />

to realise a single subdotted character. Such a method is tedious and not particularly<br />

appealing. That is why I chose the German keyboard. The special German characters<br />

can thus be replaced with specific Igbo characters as shown in Table 1 below.<br />

The third column of Table 1 shows a further combination of the subdotted<br />

characters with tone marks. Through the collaboration with Andrew Cunningham,<br />

all these and many other changes (especially with regard to the consonants) were<br />

incorporated and used to build an Igbo keyboard layout that can freely be downloaded<br />

from the Tavultesoft website. Later a similar keyboard map was also made for the<br />

English keyboard for people who have access only to the English keyboard. But as<br />

has already been pointed out, the users of the English keyboard simply have to cope<br />

with the tedious key combinations. I have therefore donated the Tavultesoft keyboard<br />

program, together with physical German keyboards, to the Department of Nigerian<br />

Languages at the University of Nigeria, Nsukka, as well as to some other Igbo scholars;<br />

since then I have been receiving feedback that was further incorporated to refine<br />

the program for both the average user as well as for the linguist’s most complicated<br />

needs. This has led to the development of the second version of the program. Due to<br />

the use of the English language in Nigeria, the third version of the program has now<br />

been made QWERTY-based like the English and Danish keyboards, thus replacing the<br />

253


QWERTZ layout of the German keyboard. In addition, it also has a much better display<br />

of the characters than is shown in the third column of Table 1. Like the previous<br />

versions, the program shall also be freely available.<br />

Table 1: Some Igbo Characters<br />

Finally, while the Igbo Keyman keyboard is a virtual keyboard, its transformation<br />

into a physical Igbo keyboard shall be taken up at the appropriate time. For the time<br />

being, it has contributed to taking care of the language’s input system. The problem<br />

of the appropriate font programs to go with the keyboard has also been taken care<br />

of through the increasing number of Unicode-based font programs. Thus, the two<br />

aspects of (1) an appropriate input system and (2) the fonts mentioned at the end of<br />

section 1 of this paper have been addressed. The next step is to use these facilities to<br />

tackle specific linguistic problems of the language. In the next section, I shall present<br />

my efforts in this direction, especially with regard to the development of a corpus<br />

model for the Igbo language.<br />

3. Computer Technology and the Igbo Corpus Model<br />

Of all the different activities involved in developing the Igbo Corpus Model 3<br />

obtaining the proper OCR software was indeed difficult, but more difficult was (and<br />

still is) finding an adequate corpus development, manipulation or query system and<br />

the use of the software to process Igbo texts.<br />

Some of the programs I initially experimented with were either theory dependent,<br />

required some manipulation of the system, or required personally writing the internal<br />

components needed; all this would involve an initial preoccupation with more<br />

theoretical issues than with the practical development of the corpus itself. It is for<br />

this reason that I chose the following three pieces of software: WordSmith, Corpus<br />

Presenter, and the Word Sketch Engine (WSE). Some factors influenced my choice.<br />

3 Partly supported through the Laurence Urdang Award 2002 of the European Association for<br />

Lexicography.<br />

254


These programs are relatively theory neutral, have friendly GUI, and are explicit in<br />

their claim to be Unicode-based.<br />

With regard to the Igbo texts, one can differentiate between two types. The first<br />

type is made up of texts without tone marks:<br />

Ị ka nọrịrị Obinna? Obinna tụgharịrị hụ Ogbenyeanụ...<br />

The second type is tone marked like the text below:<br />

Ị̀ kà nọ̀rị̀rị̀ Óbìńnà? Óbìńnà tụ̀ghàrị̀rị̀ hụ́ Ógbènyèánụ<br />

A typical Igbo text written by native speakers for fellow native speakers is usually<br />

not tone-marked, simply because many find it tedious, although tone marking Igbo<br />

texts would make a great deal of difference. However, for any serious linguistic work<br />

or research, the Igbo texts are usually tone-marked. The tone-marked Igbo text above<br />

was produced with version 2.0 of the Igbo Keyman program. The rendition in version 3,<br />

which is soon to be released, is much better. However, I used mainly Igbo texts without<br />

tone marks in my work with the above-named three corpus programs. I examined<br />

them based on (1) how they handled the text input; and, (2) how they handled Igbo<br />

words, with and without diacritics. I shall present the programs individually.<br />

3.1 WordSmith<br />

The basic problem encountered with WordSmith is that not ALL components of the<br />

software are able to handle the Igbo texts appropriately.<br />

For text input, it is not possible to add words with diacritics, either directly through<br />

the Igbo Keyman keyboard or simply by copying and pasting. For example, a keyboard<br />

input of the word ọgụ results in ‘ügß’ (see the WordSmith-Concordance screenshot),<br />

while pasting the word only yields a blank in the entry box. In both cases, activating<br />

‘Search’ does not yield any results.<br />

Figure 1: <strong>Text</strong> Input in WordSmith<br />

In terms of processing an entry, the concordance component of the software does<br />

not function properly. As long as a word to be searched does not bear any diacritics<br />

255


(like a subdot), the program sorts it appropriately, as can be seen in the concordance<br />

of ONWE.<br />

Generally, WordSmith 4.0 is much better than its previous version, but it still has<br />

the problem of not being fully Unicode compatible at all levels. Its use for the Igbo<br />

language is therefore extremely limited.<br />

3.2 Corpus Presenter<br />

Figure 2: Concordance Search for ‘onwe’ in WordSmith<br />

The problems encountered here are the same as in WordSmith: not all components<br />

are fully Unicode-based.<br />

Inputing the Igbo word ndị with the Keyman program, or simply by copying and<br />

pasting, yields nd?, the same result as in WordSmith (This can be seen in the Image of<br />

the Corpus Presenter Search):<br />

Figure 3: <strong>Text</strong> Input in Corpus Presenter<br />

256


In its Dataset component, the program processes an Igbo text in a different manner.<br />

The text is usually well displayed as a dataset, as can be seen from the screenshot of<br />

the Corpus Presenter Dataset.<br />

Figure 4: Corpus Presenter Dataset<br />

But switching to the ‘<strong>Text</strong> Retrieval Level’ for the manipulation of the displayed<br />

text simply turns the characters with subdots into question marks. This can be seen<br />

from the screenshot <strong>Text</strong> Retrieval component.<br />

Figure 5: Corpus Presenter <strong>Text</strong> Retrieval<br />

Finally, one major point of difference between WordSmith and Corpus Presenter<br />

is in the making of word lists. While WordSmith can do it without loss of data, Corpus<br />

257


Presenter simply leaves out ALL the characters with diacritics. This can be seen in the<br />

screenshot of the Corpus Presenter and WordSmith wordlists.<br />

Figure 6: Corpus Presenter Word List Figure 7: WordSmith Word List<br />

Generally, the two programs are very good for manipulating texts of European<br />

languages, with Corpus Presenter also having the further advantage of the capacity<br />

for POS-tagging. But with regard to the Unicode scripts of an African language like<br />

Igbo, they both have their limitations.<br />

I shall now turn to the next program, which is the most promising, the most user-<br />

friendly, and the most suitable for use in teaching corpus linguistics at an elementary<br />

or advanced stage.<br />

3.3 Word Sketch Engine (WSE)<br />

WSE is so far the only relatively theory-neutral program that has been able to<br />

handle Igbo texts without tone marks. The different components of the program are<br />

also reliable. For example, words could be keyed in or copied into the Concordance<br />

component of the program without any loss or distortion. This applies to both words<br />

without subdots and those with subdots, as can be seen from the two screenshots:<br />

258


Figure 8: A Word Without Subdots in WSE<br />

Figure 9: A Word With Subdots in WSE<br />

The collocation analysis of WSE does not present any difficulties; neither does it distort<br />

the characters of the language. This can be seen in the collocation screenshot:<br />

259


Figure 10: Collocation in WSE<br />

Finally, the texts processed with the three different programs are Igbo texts<br />

without tone marks. The way each program handles such a text determines the extent<br />

to which the program can be taken into consideration for texts with greater character<br />

combination and complication. This simply means that, at the level of texts without<br />

tone marks, WordSmith and Corpus Presenter are of limited use. For WSE, however,<br />

an investigation is still to be made into its further use for processing Igbo texts that<br />

have been complicated through the combination of more diacritics with the subdotted<br />

words. In the next section, I shall briefly discuss the effect of the above problems on<br />

Igbo lexicography, as well as the efforts to develop a spellchecker for the language.<br />

3.4 Igbo Lexicography and the Igbo Spellchecker<br />

The three lexicographic works of the language to be examined here are Williamson’s<br />

Igbo-English Dictionary (1972), Echeruo’s Igbo-English Dictionary (1998), and Igwe’s<br />

Igbo-English Dictionary (1999). The spellchecker is a project that I am currently<br />

working on with Kevin Skannel (http://borel.slu.edu/nlp.html).<br />

Each of the three dictionaries have imprints of the technological stages of the time<br />

when they were written. Williamson’s dictionary was produced with the typewriter. Its<br />

legacy is the imprint it has left on Igbo orthography. A particular tone of the language<br />

known as the ‘downstep’ was marked in her dictionary through placing the dash ‘-’ on<br />

the sound segments that incorporate the downstep. This means the following forms<br />

for the vowels without subdots: ā ē ō ī. The same was also done for the subdotted<br />

vowels. But as was pointed out in section 2 above, this was achieved through the<br />

physical adjustment of the typewriter. Through such a method, Igbo texts can be<br />

properly tone marked with the old typewriter.<br />

The presentation of the Igbo characters in Echeruo’s dictionary has been strongly<br />

influenced by the available fonts within the Microsoft operating system. The author<br />

260


simply used the German umlauted vowels (shown in the first column of Table 1 above)<br />

instead of the Igbo subdotted vowels. But, with his method, an Igbo word like ụ̀tọ́<br />

‘sweet, sweetness’, whose tone is indicated on the word itself, is written as ütö<br />

[LH]. This simply means marking the tone extra. Such a method, however, becomes<br />

irrelevant when one tries to use it for the representation of a fully tone-marked Igbo<br />

text. The method has not found much acceptance (Uchechukwu 2001), but the author<br />

has also agreed to change it in line with the existing orthography (Anyanwu 2000).<br />

There is no doubt that this would become easier for him as a result of the growing<br />

improvements in computer technology, including the freely available Igbo keyboard<br />

and Unicode-based font programs.<br />

Igwe’s dictionary was written in line with the existing orthography; however, the<br />

author’s effort involved the use of the ‘Symbols’ windows within Microsoft’s Word to<br />

painfully click on the individual Igbo characters of his 850-page dictionary! The only<br />

mark the method left on his work can be seen in the combination of the vowel with<br />

a tone mark. Within the Microsoft 95/98 system used by the author, this lower case<br />

vowel is automatically changed into an upper case vowel through such a combination.<br />

Thus, a combination of the vowel with the high tone symbol (accent acute) <br />

yields ; and a combination with the low tone symbol (grave accent) is realized<br />

as . This is regardless of whether the letter occurs at the beginning of a word,<br />

in the middle, or at the end. But with the current improvements in the different<br />

operating systems, as well as the available Igbo keyboard, such problems can now be<br />

completely addressed.<br />

The spellchecker is still an on-going project between Andrew Scannel and myself.<br />

For the time being, this is restricted to Aspell and Igbo texts without tone marks. The<br />

development of a spellchecker for fully tone-marked Igbo texts shall be taken into<br />

consideration at the appropriate time.<br />

4. Prospects for Computer Linguistics<br />

The above situation of the Igbo language has not only highlighted the stages<br />

involved in the struggle of the language with modern technologies, but also how this<br />

development can be enhanced.<br />

The texts of the Igbo language without tone marks can be processed to some extent<br />

by some corpus processing or development systems. Such texts are of little research<br />

significance, however, since they cannot graphically represent the very phenomena<br />

that are of interest to both the ordinary linguist and the computational linguist.<br />

Compared with the situation of such matters just a few years back, it is already a<br />

great advancement to have some software that can handle Igbo texts without tone<br />

261


marks. But a further step in the direction of laying a good foundation for future<br />

computer linguistics within the Nigerian setting requires the different sophisticated<br />

corpus software, acoustic phonological system, spellchecker and so on to be in a<br />

position to handle the scripts of the language, whether with or without tone marks.<br />

The conclusion is that through developing the Igbo keyboard, as well as through<br />

the availability of freely downloadable Unicode-based fonts, the problem confronting<br />

the average Nigerian language is now solely a software problem and no longer the<br />

problem of a physical keyboard or an operating system. Two additional points support<br />

this conclusion. First of all, the same Keyman program can be used to write more<br />

keyboard maps for other Nigerian languages, thus making it unnecessary to invest<br />

in building a physical keyboard from scratch. All that the users with different native<br />

languages within the Nigerian setting need to do is simply click on their language<br />

keyboard map. This solution is not likely to change, because the English keyboard<br />

has become part and parcel of the computer hardware within the Nigerian setting.<br />

Thus, the production of a physical keyboard for a Nigerian language would definitely<br />

involve an expansion of the physical English keyboard. The second point is the present<br />

effort by Andrew Cunningham and Tavultesoft to further port the Keyman program<br />

into the Linux operating system. This should make available to Linux users the same<br />

keyboard facility that the Keyman program has provided for the Windows operating<br />

system. The effect would be to have the appropriate software also within the Linux<br />

OS. Both developments can only but further the point that the keyboard and font<br />

stages have been addressed. Computer linguistics within the Nigerian setting now<br />

faces the problem of developing the necessary programs that make use of the facilities<br />

presently available.<br />

Finally, the picture of the Igbo language presented here shows that the current<br />

excitement about the new technology should not make us overlook the simple fact<br />

that the three necessary elements for the development of computational linguistics<br />

for any African language are: (1) an appropriate font program; (2) a good input system;<br />

and (3) compatible computer programs. Thus, the development of computational<br />

linguistics for an average African language depends on the extent to which these three<br />

aspects have been taken care of for the respective language.<br />

262


263<br />

References<br />

Adegbola, T. (2003). 2003 Annual report on the activities of African Languages<br />

Technology Initiative (Alt-I). http://alt-i.org/2003Report.doc.<br />

Ajayi, G.O. (2003). “NITDA and ICT in Nigeria.” Paper presented at the 2003 Round<br />

Table Talk on Developing Countries Access to Scientific Knowledge. (The Abdus Salam<br />

ICTP, Trieste, Italy, 23 October 2003.<br />

Anyanwu, R.J. (2000). “Echeruo, Micheal J.C. 1998. Igbo-English Dictionary: A<br />

Comprehensive Dictionary of the Igbo Language, with an English-Igbo Index.”<br />

Frankfurter Afrikanistische Arbeitspapier. 12: 147-150.<br />

Echeruo, M.J.C. (1998). “Igbo-English Dictionary: A Comprehensive Dictionary of the<br />

Igbo Language, with an English-Igbo Index.” Yale: Yale University Press.<br />

Egbokhare, F.O. (2004). Breaking Barriers: ICT-Language Policy and Development.<br />

Dugbe, Ibadan: ODU’A Printing & Publishing Company Ltd.<br />

Jegede, P.O. & Owolabi, J.A. (2003). “Computer Education in Nigerian Secondary<br />

Schools: Gaps Between Policy and Practice.” Meridian: A Middle School Computer<br />

Technologies Journal. Raleigh, NC: NC State University, 6(2).<br />

http://www.ncsu.edu/meridian/sum2003/nigeria/print.html.<br />

Nigeria National Computer Policy (1988). Lagos: Federal Ministry of Education.<br />

Nigerian National Policy for Information Technology (IT) (2001).<br />

http://www.nitda.gov.ng/docs/policy/ngitpolicy.pdf<br />

Uchechukwu, C. (2001). “Echeruo na Eceruo, kedu nke ka mma…?” KWENU, 1(8), 16-<br />

22.<br />

Uchechukwu, C. (2004). “The Representation of Igbo with the Appropriate Keyboard.”


Paper presented at the International Workshop on Igbo Meta-Language. (University of<br />

Nigeria, Nsukka, 18 April 2004).<br />

Yusuf, M.O. (1998). “An Investigation into Teachers’ Competence in Implementing<br />

Computer Education in Nigerian Secondary Schools.” Journal of Science Teaching and<br />

Learning, 3(1 & 2), 54-63.<br />

Yusuf, M.O. (2005). “Information and Communication Technology and Education:<br />

Analysing the Nigerian National Policy for Information Technology.” International<br />

Education Journal, 6(3), 316-321.<br />

264


Annotation of Documents for Electronic<br />

Editing of Judeo-Spanish <strong>Text</strong>s: Problems and<br />

Solutions<br />

Soufiane Roussi and Ana Stulic<br />

The result of an interdisciplinary process comprising Linguistics, Information and<br />

Computer Sciences, this contribution consists of modelling the annotated electronic<br />

editing of Judeo-Spanish 1 texts written in Hebrew characters, following the principle<br />

of document generation in a collaborative work environment. Our approach is based<br />

on the concept of digital document annotation that places mark-up at any text level,<br />

starting with the character resulting from the transcription. Our point of view is that the<br />

annotations of a ‘translated/interpreted’ document can have two different purposes:<br />

to interpret (to add new mark-up in order to propose a different interpretation from<br />

the one formulated at the beginning); and, to comment (make a comment on the<br />

interpretation done by another author). Our aim is to make it possible for the reader/<br />

user to interact with the document by adding his own interpretation (translation)<br />

and/or comments on an interpretation made by another author. We present a model<br />

for the description of annotation in response to our problem.<br />

1. Introduction<br />

In this paper, we will explore the problem of digital document annotations in<br />

application to the very specific problem of building a Judeo-Spanish corpus. We will<br />

briefly present the interest of such an enterprise, together with some difficulties<br />

related to building corpora in general, as well as those specific to the Judeo-Spanish<br />

case. Considering the recent developments in information technology (IT), we will take<br />

into account the concepts of digital documents, automatic generation of documents,<br />

and production of digital documents in collaborative mode, and then, apply them to<br />

our problem. Finally, we will propose a prototype model for Judeo-Spanish corpus<br />

building in the context of a collaborative environment. This proposal offers some<br />

conceptual and methodological solutions based on existing technologies, but leaves<br />

open the question of technical realisation.<br />

1 The language of the Sephardic Jews, who, after being expelled from Spain at the end of 15th century,<br />

settled in the greater Mediterranean area.<br />

265


2. Building a Judeo-Spanish Corpus<br />

2.1 The Research Value of a Judeo-Spanish Corpus<br />

Judeo-Spanish is the language spoken by the Sephardic Jews, who, after being<br />

expelled from Spain at the end of 15th century, settled in the greater Mediterranean<br />

area. It represents a variety of Spanish that has followed its own development path<br />

since late 15th century (though not without any contact with the Iberian Peninsula),<br />

and is relatively well documented. Many original documents in Judeo-Spanish, such<br />

as manuscripts, books and other printed material have been conserved. Linguistic<br />

fieldwork from the beginning of 20th century also yielded a certain number of oral<br />

transcriptions.<br />

From a linguistic point of view, Judeo-Spanish is very interesting because it offers<br />

numerous possibilities for comparative and historical linguistic analysis, in that<br />

peninsular Spanish is itself very well documented in terms of the pre-expulsion period.<br />

Equally, the original sources are of great value for historical and cultural research.<br />

Unfortunately, in many countries where it was kept alive for centuries, Judeo-<br />

Spanish has been in progressive decline since the beginning of the 20th century, and,<br />

at present, it is no longer spoken in many cities of Balkan Peninsula (formerly the<br />

centres of Judeo-Spanish culture). Therefore, the editing of original Judeo-Spanish<br />

sources can also contribute to the preservation of knowledge about this language.<br />

The approach adopted here in treatment of Judeo-Spanish documents has been<br />

primarily oriented to their usage as a corpus for linguistic research, but it can be<br />

extended to other uses as well.<br />

2.1 General Problems of Linguistic Corpora Editing<br />

In the humanities (philology, history, literature and linguistics), the word ‘corpus’<br />

has traditionally referred to any body of texts that are used for research. In modern<br />

linguistics, it refers most often to relatively large collections of texts that represent<br />

a sample of a particular variety or use of language(s). Language corpora can be in the<br />

form of manuscripts, paper-printed, sound recordings (spoken corpora), or machine-<br />

readable. Nowadays, it has become common to think of it especially in this latter<br />

form- and not without reason. The development of computer-readable corpora has<br />

enlarged the possibilities of linguistic research by simplifying search tasks, and making<br />

possible the use of large portions of texts.<br />

Compilations of electronic corpora, especially those with historical dimensions,<br />

rely upon existing written documents, which are frequently philological editions. In<br />

266


electronic corpora, regardless of whether the electronic text is made from a source<br />

document (such as a manuscript or original edition) or a philological edition, the<br />

authors of the corpus must develop annotation strategies in order to represent the<br />

source document from which the text is derived. The information commonly provided<br />

in annotations concerns metadata about the digital document itself (as well as the<br />

source document), but it can also deal with the linguistic properties of the text, such<br />

as parts of speech, lexemes, prosody, phonetic transcription, semantics, discourse<br />

organisation, co-reference, and so forth. Designed to be global (extending to the<br />

entire corpus) and universal in their validity, most types of linguistic annotations<br />

represent complex tasks that are economically very expensive and inconvenient for<br />

lesser-used language corpora that are designed principally for scientific research. On<br />

the other hand, de Haan’s proposal of problem-oriented tagging offers another point<br />

of view (de Haan 1984). In this approach, the users take a corpus, either annotated<br />

or unannotated, and add to it their own form of annotation, oriented to their own<br />

research goal. This type of annotation seems very promising in the context of specific<br />

research needs, provided that the annotational system in question is supplied with a<br />

dynamic and interactive dimension.<br />

Although in language corpora building, emphasis is frequently placed on developing<br />

software search possibilities and linguistic annotations, one crucial question remains:<br />

Where do the texts come from? Though reproducing the raw text of a source document<br />

may seem like a simple task, it often isn’t, especially when dealing with ancient texts,<br />

or texts in writing systems other than Latin character set. In current corpora building<br />

practices, the old philological problems are still of current interest. In the traditional<br />

philological paper editions, the text is determined by the editor (on the basis of the<br />

source document[s]), together with the critical apparatus and the writing system,<br />

and the reader/researcher can only accept the editor’s interpretation. In electronic<br />

corpora, the approach is similar: while the source document remains inaccessible for<br />

practical reasons, whatever the editorial choices of the corpus author are (including<br />

critical apparatus, writing system and annotations), the user cannot intervene or<br />

adapt them to his/her own purposes.<br />

On the other hand, source documents have never been more accessible in technical<br />

terms, albeit on condition that they are available in digital form, as an image or as<br />

a sound file. The possibility of consulting the digital image of a source document<br />

in parallel to its electronic transcription would enable the researcher to be critical<br />

of the editor, and would resolve some of the philological problems caused by the<br />

necessarily arbitrary choices one is forced to make in a philological edition.<br />

267


2.2 Specific Problems in the Judeo-Spanish Context<br />

The most salient specificity of Judeo-Spanish texts is the writing system in which<br />

they were composed, and it represents, at the same time, the most important difficulty<br />

in their editing and computer processing.<br />

The Judeo-Spanish documents produced in the post-expulsion period were<br />

commonly written in an adaptation of Hebrew script (in this context its distinctive<br />

Rashi version is frequently used, but the square Hebrew script is equally found; the<br />

difference between the two is only in the form of the letters). The practice of using<br />

Hebrew script for texts in Romance languages was already very common before the<br />

expulsion (see Minervini 1992).<br />

In the history of writing systems, the adaptation of a script originally designed for<br />

one language into the writing system of another language is a cultural phenomenon<br />

that has been frequently repeated; it leads to the development of new conventions<br />

adapted for the target language that involve the use of graphemes coming from the<br />

source language’s writing system.<br />

In its original form, the Hebrew script made no use of vocalic graphemes, because<br />

in most of the cases, the realised linguistic contrast was of grammatical and not<br />

lexical character. In order to avoid certain ambiguities, some letters progressively<br />

acquired vocalic meaning in certain contexts in Biblical Hebrew (yod, waw, he and<br />

aleph). The fully vocalised writing system of Hebrew was designed much later, and<br />

has been mostly reserved to the texts of the Bible (for more details see Sampson<br />

1997:123-129).<br />

Similar to the Hebrew writing tradition, in Spanish texts, the fully vocalised script<br />

was reserved only to translations of the Bible and sacred texts, but in other texts<br />

the usage of letters with vocalic meaning was extended to all contexts, with the<br />

particularity that two letters, yod and waw, could denote two vowels each, /e/<br />

and /i/, and /o/ and /u/ respectively. Also, a diacritic sign (of different shape in<br />

different times and traditions) has been introduced above certain letters to create<br />

new graphemes for consonants that had no counterpart in Hebrew writing system, or<br />

that lacked phonological value in Hebrew (Sampson 1997:123-129).<br />

Nevertheless, the history of this adaptation shows many variations in application<br />

of conventions. One of the sources of variation comes from the possibility to use<br />

different letters for the same phoneme (this kind of variation is found even within the<br />

same text). Although some of the basic principles of adaption of Hebrew script for the<br />

Spanish language have probably been trasmitted over generations, the reading and<br />

268


writing of Hebrew represented the only constant knowledge, so conventions applied<br />

to Spanish could be updated at any time. On the other hand, Judeo-Spanish evolved<br />

phonologically, and this was also reflected in the writing system (Pascual Recuero<br />

1988).<br />

The main, but not the only problem, in the editing Judeo-Spanish texts, is the<br />

underspecified use of vowel graphemes. This becomes even more complex if we<br />

consider the fact that the vowel system has suffered some modifications and that<br />

reconstruction on the basis of 15 th century Spanish can not be completely reliable<br />

(especially if we take into account the fact that 15 th century Spanish is known through<br />

the different variations it presented, including in the vowel system).<br />

Two approaches are possible: (1) conserving the original script conventions in<br />

every way, by transliteration of original documents, which means replacing each<br />

grapheme by another one; and, (2) interpreting vowel graphemes, by transcription,<br />

which means specifying the vowels where their presence is indicated. If carried out<br />

in the traditional sense of a philological edition, both have their advantages and<br />

drawbacks. The transliteration conforms to the source, so that the researcher can rely<br />

on its fidelity, but it doesn’t make the text more accessible in terms of intelligibility.<br />

The transcription is certainly more intelligible, but the choices of interpretation of<br />

vowels done on the basis of reconstruction are determined (and fixed forever) by the<br />

transcriber, and fidelity to the source is lost 2 . The need for both translitteration and<br />

transcription as research tools has been recognised in the study of Judeo-Spanish texts;<br />

they are both used in different contexts, and sometimes even the parallel versions<br />

of texts are proposed, as in the edition of Jewish medieval texts from Castilla and<br />

Aragon by Laura Minervini (1992); also, a similar solution is proposed independently<br />

in Stulic (2002) for the editing of a 19 th century Judeo-Spanish newspaper El amigo<br />

del puevlo.<br />

In this paper, we wanted to model a solution for the electronic editing of such texts<br />

that could encompass both approaches, and maybe even offer something more as a<br />

research tool. Although we took as a starting point a very concrete Judeo-Spanish text<br />

from the 19 th century, the problems we are endeavouring to solve could derive from<br />

any text edition where a choice between the intelligibility of the text and the fidelity<br />

to the source is imposed.<br />

2 The text transcribed for syntactic analysis wouldn’t be useful, for example, for research on phonological<br />

issues or writing system history.<br />

269


3. The Digital Document and Annotation<br />

3.1 What is a Digital Document?<br />

The beginning of the year 2000 showed an increasing spread of online environments,<br />

which has been facilitated by the use of databases for the storage of content, the<br />

automatic generation of digital documents, and the use of interactive ‘fill-in’ forms. It<br />

is this favorable situation, together with numerous developments in Web technologies,<br />

that leads us to the creation of a Judeo-Spanish corpus in a digital environment.<br />

First of all, what is a digital document? Following the definitions in use among French<br />

scholars, it can be defined on three levels: as an object (material or immaterial), as<br />

a sign (meaningful element) and as a relation (communication vector). As an object,<br />

a digital document can be defined as “a data set organised in a stable structure<br />

associated with formatting rules to allow it to be read both by its designer and its<br />

readers” (Pédaugue 2003). As a sign, a digital document is “a text whose elements<br />

can potentially be analysed by a knowledge system in view of its exploitation by a<br />

competent reader” (Pédaugue 2003). From the social perspective, a digital document<br />

can be viewed as “a trace of social relations reconstructed by computer systems.”<br />

(Pédauque 2003).<br />

For our approach, the perspective of a digital document as an object (material<br />

or immaterial) seems the most relevant. As such, the digital document opens three<br />

principal issues: storage, plasticity, and its programmability (Rouissi 2004). It is this<br />

latter characteristic that captures our attention here. Dealing with the annotation of<br />

electronic documents, our aim is to make use of solutions made possible by automatic<br />

generation.<br />

In a functional approach, a digital document can be considered as any other<br />

document. In documentalist theory, the definition of a document emphasised the<br />

function (something that serves as an evidence) more than the actual physical form<br />

(paper, stone or antilope) (see Schürmeyer 1935; Briet 1951; Otlet 1990). The irrelevance<br />

of support became only more evident in the case of the digital document. Buckland<br />

writes “The algorithm for generating logarithms, like a mechanical educational toy,<br />

can be seen as a dynamic kind of document unlike ordinary paper documents, but still<br />

consistent with the etymological origins of “docu-ment”, a means of teaching - or, in<br />

effect, evidence, something from which one learns” (Buckland 1998). In accordance<br />

with this point of view, and considering the digital document as a research tool, we<br />

would like to explore its programmable possibilities.<br />

270


3.2 Automatic Generation of Electronic Documents<br />

Current Web technology offers the possibility of creating a document upon request.<br />

In this context, the result of an execution of a computer program is a document: a<br />

Web page obtained on the basis of one or multiple resources (program, database,<br />

cascading style sheets, etc.). The automatic generation of Web documents is based<br />

on the use of scripting programming languages whose execution takes place on the<br />

server. These technologies accelerate the treatment, making it possible to surpass the<br />

limits of HTML (Hypertext Markup Language), which remains static, and only permits<br />

the handling of the document’s layout. Equally, a connection is established with the<br />

databases where the information to be furnished in a given context is stored. Thanks<br />

to these principles of functioning, and with an appropriate editing program, individual<br />

appropriation by Information Technology non-professionals has developed. The ease<br />

with which it is possible to reuse documentary resources existing on the Web only<br />

contributes to an even greater development of digital document production. The<br />

correction, modification and adding possibilities facilitate digital document production<br />

in an autonomous mode. Production in autonomous mode is defined by a user, who<br />

elaborates the content and defines its layout for his personal use, or for that of other<br />

users. This production is realised with the help of suitable IT material and programs.<br />

The autonomous mode means that the user has all the creative freedom (choice of<br />

layout, colours, fonts, format, file names, diffusion and storage hosting, etc.), but<br />

also presupposes that he has all the technical competence needed. On the other hand,<br />

by conserving the autonomy of production while facilitating the exchange, the semi-<br />

autonomous mode can be applied in a collaborative environment where production<br />

rules need to be followed. It is in this context that we wish to situate our work on<br />

digital document annotation. Our aim is to put at the user’s disposal (in this case,<br />

scholars working on Judeo-Spanish documents) a tool modelled on the principle of<br />

semi-autonomous mode production of digital documents.<br />

In this context, the rules are defined at technical level (layout rules and data<br />

structuration) but the decision to produce and to publish is made by the user (Rouissi<br />

2005). This implies of course that he can be identified by the system, and that the<br />

maintenance and technical assistance service is provided. The user doesn’t produce<br />

in an isolated manner, but in a collaborative work environment, which somewhat<br />

constrains his production, but, on the other hand, offers a conception of the whole<br />

and facilitates the integration of individual work.<br />

271


The emergence of online environments, and within them, the possibility of<br />

producing and uploading digital documents, has changed the role of the user from<br />

passive user/reader to active user/author. In semi-autonomous mode, what is defined<br />

in advance concerns the common vocabulary, the visual aspect, and the structure of<br />

digital documents 3 .<br />

The principal advantages of production in semi-autonomous mode are:<br />

• the durability of the system and the possibility of its evolution (the<br />

contents can evolve more easily);<br />

• the autonomy of handling (the use of ‘fill-in’ forms allows for the handling<br />

of the data by the users themselves);<br />

• the minimal technical competence that is required (the systems remain<br />

intuitive and easy to handle); and,<br />

• the common vocabulary.<br />

It is in terms of these principles that we envisage the development of the digital<br />

document annotation model for Judeo-Spanish corpus edition.<br />

3.3 Production in a Collaborative Mode<br />

Production in a collaborative mode, already widely present in different forms as<br />

collective websites nourished by individual contributions (forums, blogs, wikis, etc.),<br />

seems particularly suited for the collaborative work of specialists of the documents<br />

in discussion. In this sense, annotation can play an important role in the evaluation,<br />

interpretation and production of a document, which thereby becomes itself dynamic<br />

and subject to evolution. The final (and collective) document obtained is the result<br />

of the contribution of individual fragments (but not necessarily the sum of the<br />

contributions), and it can also result from choices the user has made.<br />

Annotation is something added to the document. It can be a remark, a comment,<br />

or, in our case, even a proposition of interpretation. Already in 1945, Vannevar Bush<br />

envisaged for the Memex (a device which was supposed to create links between related<br />

topics in different research papers) that the owner could add his own comments (Bush<br />

1945). More recently, numerous developments in annotation management systems<br />

appeared with real promise in the direction of sharing and exchanging information.<br />

Several (wide public) office programs (some versions of MS-Word) or the W3C<br />

project Annotea in the Web domain (Annotae 2005) are just some of the examples<br />

of applications aimed at sharing annotations. A more exhaustive list can be found in<br />

Perry (2005).<br />

3 We are not dealing here with the application of the XML (eXtensible Mark-up Language) which plays an<br />

important role in the data structuring and data exchange.<br />

272


There are two types of annotations: semantic annotations (in the sense of<br />

standardised metadata Web annotations) and free annotations. The former are<br />

attached to the actual work on the Semantic Web, and are based on the development<br />

of metadata and/or ontologies used in the description of the document, with the<br />

purpose of facilitating their localization, identification and automatic recognition.<br />

Without neglecting this important issue, we will focus here on free annotations,<br />

because they are used – in philological and linguistic analysis – to interpret and<br />

comment upon documents, and we therefore consider that they can constitute an<br />

important factor in the development of collaborative digital production, improving,<br />

at the same time, communication among the specialists of the domain in question.<br />

From our point of view, annotation can have two purposes. The first concerns the<br />

interpretation of the original document (how to translate or read it). The second<br />

adds comments to the one part of the document and/or to the interpretation already<br />

made. The annotation can be placed at several levels. The global level concerns the<br />

annotation made on the whole of the document put into discussion. This annotation<br />

can be based on free comments made on the entire document or can represent a<br />

reaction to the global annotation already made. We consider it useful, for the sake of<br />

analysis, that the zone of annotation is freely marked in the document. The smallest<br />

mark up unity is the character; therefore, a part of the word, a word, a line, several<br />

lines, a paragraph, or several paragraphs can also be the target of the mark up.<br />

Collaborative work is situated in the context of semi-autonomous production. Every<br />

member of this collaborative community participates in a responsible way, benefits<br />

from the result of the work of the community, and receives feedback for his work.<br />

Two models of work where everyone’s autonomy can be expressed are: cooperative<br />

work (wherein everyone accomplish a part of the work and shares it with others) and<br />

collaborative work (wherein several autonomous individuals work together in order<br />

to produce collectively). We’ll see how the concept of a digital document and its<br />

production can help in our case.<br />

4. Prototype Model for the Descriptions of Annotation<br />

4.1 General Properties<br />

Considering the problems related to the treatment of Judeo-Spanish texts and<br />

to the building of a corpus, and taking into account the theoretical approach to the<br />

digital document (particularly from the point of view of collaborative work), we<br />

propose a model that can respond to the needs we have identified. Our work focuses<br />

on the definition of needs without making choices that could constrain future program<br />

273


implementation. In this sense, our contribution is situated on the analytic level prior<br />

to the concrete realisation of project.<br />

One of the first preoccupations is the constitution of documentary corpus. In order<br />

to achieve this, the model must be conceived as a digital repository of documents<br />

that are described with specifications that are sufficiently fine-grained, but open<br />

to interoperability. The collaborative dimension must take into consideration the<br />

management of users. Our intention is to describe annotations and to build a typology<br />

of annotations that will appear as the system begins to function.<br />

We will resume briefly here the requirements that the model should satisfy:<br />

• the source document should be accessible, as a transliterated version, or,<br />

ideally, as a collection of image files;<br />

• the transcribed version is given as a starting point of discussion/<br />

analysis;<br />

• the metadata annotations (according to the widely accepted standard,<br />

Dublin Core) are provided with the transcribed version and image files;<br />

• the authorised user can add free annotations on a global or any other text<br />

zone level, starting with the character; they may include new interpretation<br />

(may include corrections) and/or comments; and,<br />

• the authorised user can export the result of his or another’s work by<br />

making choices to use or not the annotations that he or another has made.<br />

The annotation management system has still to be developed, and it will use, as a<br />

basis, the model here presented.<br />

4.2 Data Description Model<br />

We have seen that the particularity of Judeo-Spanish texts serves as the source of<br />

many methodological and technical problems. Among them, we’ll concentrate on the<br />

annotation that represents the form of document production in collaborative mode.<br />

The annotation can help to handle various interpretations, as well as the comments<br />

made by reader-users. Here, we see the possibility of developing real support for<br />

documentary information and communication: the document in question remains the<br />

carrier of different contributions, and represents, at the same time, the archive of<br />

the exchanges made, as well as the basis for different reading possibilities.<br />

We offer here a proposal of a data description model adapted to our needs in the<br />

study of Judeo-Spanish texts. We have chosen the representation model based on the<br />

Codd’s relational model (Codd 1970), which presents the meaningful data in the form<br />

of relations, as grouped properties, and as a whole. The choice of representation<br />

274


inspired by the relational model is guided by the desire to conserve the freedom in<br />

the definition of the necessary fields (type, size, etc.) in the future implementation.<br />

The relational model for the annotation of Judeo-Spanish documents is constituted<br />

of four relations. The primary keys are in bold and underlined, the foreign keys are in<br />

bold and followed by the # sign.<br />

• ANNOTATION (annotation_num, annotation_date_creation,<br />

annotation_date_lastmodified, annotation_ comment_title,<br />

annotation_comment_text, annotation_language, annotation_position_begin,<br />

annotation_position_end, annotation_status, annotation_commented_num#,<br />

annotation_type_num#, document_num#, author_num#).<br />

The relation ANNOTATION, whose identifier annotation_num (primary key) has to<br />

be created automatically, conserves the trace of the date of its creation (annotation_<br />

date_creation), as well as the date of the last modification (annotation_date_<br />

lastmodified). An annotation is described equally by the language of the author with<br />

the property annotation_language. The status carried by the property annotation_<br />

status allows the author to say whether the annotation is considered as active or<br />

not: value 1 is for active and public (default value), 0 signifies inactive or private<br />

(i.e., reserved to its author, who considers it of no utility for the public while in draft<br />

status, or for some other reason). The annotation can be deactivated, because it<br />

evolves over time and has no permanent character (notion of duration of annotation).<br />

The contribution added by the identified author (primary key author_num) over the<br />

given document (document_num) has a short title (annotation_comment_title),<br />

which will be used for the publication of the lists of comments, and a text field<br />

(annotation_comment_text) of variable size. The position of the annotation in the<br />

given text/document is determined by the starting point (annotation_position_begin)<br />

and the ending point (annotation_position_end). In the case where the annotation<br />

concerns the whole document and not only one of its fragments, position is indicated<br />

in the following manner: annotation_position_begin = annotation_position_end = 0.<br />

The annotation_commented_num property allows for the formal identification<br />

of the annotation on which the comment is made. In the opposite case, where the<br />

annotation is not made over another annotation (without a link to another annotation),<br />

the value of annotation_commented_num is 0. The type of annotation can be specified<br />

(otherwise the value 0 is attributed) with the annotation_type_num property, which<br />

points to the common vocabulary shared by the members of the community.<br />

• ANNOTATION_TYPE (annotation_type_num, annotation_type_<br />

vocabulary, annotation_type_description, annotation_type_mode_edit)<br />

275


The property annotation_type_mode_edit indicates (with the value 0 or 1) whether<br />

the annotation aims (or not) at proposing that a text be substituted for the one<br />

that is initially put into discussion. This kind of annotation corresponds to the<br />

editing action in a document.<br />

Some of the examples of annotation_type_vocabulary values would be: interpret,<br />

comment, refuse, confirm, accept, and so forth. This vocabulary can be established<br />

also a posteriori with the observation of users’ practices, and with their help. The<br />

addition of elements in the future table can be made from proposals of the users.<br />

The annotation_type_description allows for the inclusion of additional information,<br />

and for making the chosen vocabulary more precise.<br />

• AUTHOR (author_num, name, email, login, password).<br />

The AUTHOR relation describes the users that are authorised to interact with<br />

the document, to bring annotations, or to modify the existing ones (the author can<br />

only modify his own annotations). Whoever wants to propose a contribution must<br />

be identified.<br />

• DOCUMENT (document_num, title, creator, keywords, description,<br />

publisher, contributor, date, resourcetype, format, identifier, source, language,<br />

relation, coverage, rightsmanagement).<br />

The DOCUMENT properties follow the recommendations of the Dublin Core<br />

Metadata Initiative (Dublin Core 2005). The identifier document_num can serve for<br />

the denomination of different document resources (the original document can be<br />

presented as a collection of image files, but also as a transcription in ASCII format).<br />

The export formats envisaged here are HTML, XML or even <strong>PDF</strong>. Some difficulties are<br />

still to be overcome, since our annotation in theory allows for mark up overcrossing.<br />

The documents are uploaded by the administrator on the proposal of one of the<br />

members of the community.<br />

The program implementation must bring into consideration the different<br />

applications that are possible. The process of document annotation consists of two<br />

complementary phases. The first one comprises the contributive action of adding or<br />

modifying annotations (on the entire document or on one of its fragments). The second<br />

one concerns reading through the exploitation of existing annotations. The reading<br />

possibilities include the choice of exporting and saving files in different formats.<br />

5. Conclusion and Prospects<br />

The work on a documentary corpus as specific as Judeo-Spanish texts opens many<br />

questions concerning the design of the proposed electronic model.<br />

276


In the context of this particular type of document, there is a necessity to share<br />

the results of study, remarks, comments and interpretation among the members of a<br />

relatively small and geographically dispersed scientific community. A possible solution<br />

is to develop a tool for the management and rationalisation of individual work.<br />

In our approach to the problem, we have taken as a theoretical basis recently<br />

developed concepts related to digital documents, focusing chiefly on the programmable<br />

aspect of a digital document. Taking into account that the documents in question can<br />

be considered as digital documents (over which it is possible to act), we have worked<br />

on the modelling of contributions that can be added to these objects of study. This<br />

has led us to a model that describes the annotations made on documents collected in<br />

a digital repository.<br />

The program implementation, which is still to be executed, must be situated in<br />

a full Web approach in order to satisfy the conditions of collaborative work and to<br />

remain easy to use with the help of the Web navigator.<br />

Some questions that will certainly appear in the implementation phase are not<br />

accounted for by the proposed model, such as how to apply a modification starting<br />

from one sequence (the annotation that proposes that one sequence be replaced by<br />

another) to the whole document, or should all users have the same profile and same<br />

possibilities to act within the documents.<br />

In this sense, the proposed model leaves many questions to be answered, but the<br />

direction in which we are pointing seems rather promising.<br />

277


“Annotea Projet.” <strong>Online</strong> at http://www.w3.org/2001/Annotea.<br />

Briet, S. (1951). Qu’est-ce que la documentation? Paris: EDIT.<br />

278<br />

References<br />

Buckland, M.K. (1998). “What is a ‘Digital Document’?” Document numérique 2, 221-<br />

230.<br />

Codd, E. F. (1970). “A Relational Model of Data for Large Shared Data Banks.”<br />

Communications of the ACM. June 1970, 13(6), 377-387.<br />

Dublin Core. Dublin Core Metadata Element Set, Version 1.1:Reference Description.<br />

http://dublincore.org/documents/dces/.<br />

de Haan, P. (1984). “Problem-oriented Tagging of English Corpus Data.” Aarts, J. & W.<br />

Meijs (eds) (1984). Corpus Linguistics. Amsterdam: Rodopi, 123-139.<br />

Marshall, C.C. (1998). “Toward an Ecology of Hypertext Annotation.” Proceedings of<br />

‘Hypertext 98’. New York: ACM Press.<br />

http://www.csdl.tamu.edu/~marshall/ht98-final.pdf.<br />

Minervini, L. (1992). Testi giudeospagnoli medievali (Castiglia e Aragona). 2. Napoli:<br />

Liguori Editore.<br />

Otlet, P. (1990). International Organization and Dissemination of Knowledge: Selected<br />

essays. (FID 684). Amsterdam:Elsevier.<br />

Pascual Recuero, P. (1988). Ortografía del ladino. Granada: Universidad de Granada,<br />

Departamento de los Estudios Semíticos.<br />

Perry, P. (2001). “Web Annotations.”<br />

http://www.paulperry.net/notes/annotations.asp.


Pédauque, R.T. (2003). “Document: Form, Sign and Medium, As Reformulated for<br />

Electronic Documents.” Version 3, July 8, 2003.<br />

http://archivesic.ccsd.cnrs.fr/documents/archives0/00/00/05/94/sic_00000594_<br />

02/sic_00000594.html.<br />

Rouissi, S. (2005). “Production de document numérique en mode semi-autonome<br />

au service de la territorialité.“ Colloque Les systèmes d’information élaborée. Ile<br />

Rousse, juin 2005.<br />

Rouissi, S. (2004). Intelligence et normalisation dans la production des documents<br />

numériques. Cas de la communauté universitaire. PhD Thesis, Bordeaux 3 University.<br />

Sampson, G. (1997). Sistemas de escritura. Análisis lingüístico. Barcelona: Gedisa.<br />

First published in 1985. Writing Systems. London: Hutchinson.<br />

Stulic, A. (2002). “Recherches sur le judéo-espagnol des Balkans: l’exemple de la<br />

revue séfarade ‘El amigo del puevlo’.“ (I, 1888, Belgrade). MS thesis, Bordeaux 3<br />

University.<br />

Schürmeyer, W. (1935). “Aufgaben und Methoden der Dokumentation.” Zentralblatt<br />

für Bibliothekswesen 52, 533-543. Reprinted in FRA 78, 385-397.<br />

Röscheisen, M., Mogensen, C. & Winograd, T. (1997). Shared Web Annotations As A<br />

Platform for Third-Party Value-Added Information Providers: Architecture, Protocols,<br />

and Usage Examples. Technical Report CSDTR/DLTR. http://dbpubs.stanford.<br />

edu:8091/diglib/pub/reports/commentor.html<br />

Bush, V. (1945). “As we may think.” The Atlantic Monthly. July 1945. http://www.<br />

theatlantic.com/doc/194507/bush.<br />

Wynne, M. (2003). Writing a Corpus Cookbook.<br />

http://ahds.ac.uk/litlangling/linguistics/IRCS.htm.<br />

279


“W3-CorporaProject” (1996-1998).<br />

<strong>Online</strong> at http://www.essex.ac.uk/linguistics/clmt/w3c/<br />

corpus_ling/content/introduction.html.<br />

280


Il ladino fra polinomia e standardizzazione:<br />

l’apporto della linguistica computazionale<br />

Evelyn Bortolotti, Sabrina Rasom 1<br />

Dolomite Ladin is a polynomic language: it is characterised by a rather large variety of<br />

local idioms, that have been undergoing a process of normalisation and standardisation.<br />

This process can be supported by the development of computer-based infrastructures<br />

and tools. The efforts of the major Ladin institutions and organisations have led to the<br />

creation of lexical and terminological databases, electronic dictionaries, concordancer<br />

tools for corpora analysis, and, eventually, to the development of spell-checkers and<br />

‘standard adapters/converters’.<br />

1. Introduzione<br />

Il ladino delle Dolomiti (Italia) è caratterizzato da una grande varietà interna, che<br />

ha reso necessario un intervento di normazione e standardizzazione, nel rispetto del<br />

carattere polinomico della lingua stessa.<br />

Nelle cinque valli ladine dolomitiche si vanno formando lingue di scrittura, o<br />

standard di valle. Alcuni idiomi di valle sono piuttosto unitari ed è stato sufficiente<br />

codificarli, ma in Val Badia (con Marebbe) e in Val di Fassa la loro varietà ha portato<br />

alla proposta di una normazione che si sovrapponesse alle sottovarianti di paese. Ad<br />

esempio il badiot unitar, basato principalmente sull’idioma centrale (San Martin),<br />

ma aperto anche a elementi provenienti da idiomi di altri paesi, e similmente<br />

il fascian standard, orientato verso l’idioma cazet, la cui scelta come variante<br />

standard è giustificata anche dal fatto che questo idioma è molto più vicino nelle sue<br />

caratteristiche linguistiche agli altri idiomi dolomitici.<br />

Infine si è sentito il bisogno di elaborare un livello ancora più alto di standardizzazione<br />

valido per l’intera Ladinia, sulle orme del Rumantsch Grischun, dando il via<br />

all’elaborazione del Ladin Dolomitan, o Ladin Standard (LS).<br />

Dal punto di vista della polinomia quindi, da una situazione linguistica molto<br />

differenziata, si è passati prima a un livello più alto di normazione che consente,<br />

prendendo la valle come unità di riferimento, di raccogliere più varietà in una norma<br />

unica. A seguire si è raggiunto un terzo livello che permette di avere a disposizione<br />

1 I paragrafi “Introduzione” e “Risorse e infrastrutture linguistiche e lessicali” sono stati scritti da Evelyn<br />

Bortolotti; il paragrafo “Correttori ortografici con adattamento morfologico” è stato scritto da Sabrina<br />

Rasom.<br />

281


un unico idioma di riferimento, una norma o lingua standard per tutte e cinque le<br />

vallate.<br />

La standardizzazione ha riguardato in un primo tempo la forma grafica: si è cercato<br />

di adottare una grafia il più possibile comune alle diverse varianti ladine dolomitiche,<br />

per garantire, nella diversità, il riconoscimento dell’appartenenza alla stessa famiglia<br />

linguistica e un maggiore grado di coesione e uniformità del sistema.<br />

L’utilizzo della lingua ladina scritta nelle scuole, nelle pubbliche amministrazioni,<br />

nella stampa ecc. comporta, a seconda del grado di standardizzazione del ladino<br />

utilizzato, un più o meno marcato sforzo di avvicinamento alla norma da parte<br />

dello scrivente e richiede una grande consapevolezza delle differenze fra la propria<br />

sottovarietà e lo standard utilizzato nello scrivere.<br />

Di fondamentale importanza in questo processo di standardizzazione è stato<br />

e continua a essere l’apporto della linguistica computazionale. La diffusione<br />

della tecnologia informatica permette infatti la creazione e lo sviluppo di risorse<br />

linguistiche e di infrastrutture di supporto al trattamento automatico della lingua,<br />

soprattutto nell’ambito della lessicografia moderna e tradizionale e della terminologia<br />

settoriale basate su corpora e della standardizzazione linguistica. Inoltre favorisce la<br />

realizzazione di strumenti di aiuto alla scrittura che facilitino il passaggio verso la<br />

norma standard.<br />

Nel caso del ladino delle valli dolomitiche, i vari progetti relativi all’informatizzazione<br />

delle risorse lessicali e allo sviluppo di strumenti per il trattamento automatico sono<br />

stati portati avanti attenendosi al principio di conservazione e valorizzazione della<br />

ricchezza e della varietà in una visione unitaria. Questo principio deriva dalla riflessione<br />

teorica del linguista còrso Jean-Baptiste Marcellesi, che per primo ha introdotto il<br />

concetto di “lingue polinomiche” (Langues Polynomiques) [Chiorboli 1990].<br />

Le principali istituzioni coinvolte nei progetti di modernizzazione e di trattamento<br />

automatico del ladino promossi o realizzati in collaborazione con l’Istitut Cultural<br />

Ladin “Majon di Fascegn” sono l’Union Generela di Ladins dles Dolomites e l’Istitut<br />

Ladin “Micurà de Rü”.<br />

I principali obiettivi perseguiti in campo linguistico computazionale sono:<br />

• l’informatizzazione del patrimonio lessicale ladino con la creazione di<br />

una banca dati generale lessicale ladina (BLAD), di banche dati strutturate delle<br />

varietà locali e di una banca dati centrale dello standard;<br />

• l’elaborazione di dizionari degli standard di valle (per il fassano standard:<br />

DILF “Dizionario Italiano – Ladino Fassano / Dizionèr talian-ladin fascian”, per<br />

il badiotto unitario: Giovanni Mischì, Wörterbuch Deutsch – Gadertaslisch /<br />

282


Vocabolar todësch – ladin (Val Badia); per il gardenese: Marco Forni, Wörterbuch<br />

Deutsch – Grödner-ladinisch / Vocabuler tudësch – ladin de Gherdëina) e del ladino<br />

standard (DLS “Dizionar dl Ladin Standard”) anche in versione elettronica e alcuni<br />

consultabili online;<br />

• la raccolta di glossari terminologici, parzialmente consultabili online<br />

(glossari di ambiente, botanica, materie giuridico-amministrative, medicina,<br />

architettura e costruzioni, pedagogia, musica e trasporto turistico);<br />

• la creazione di corpora elettronici analizzabili tramite un’apposita<br />

interfaccia: il web-concordancer;<br />

• la realizzazione di strumenti informatici per facilitare l’uso e<br />

l’apprendimento delle varianti standard: dizionario elettronico, e-learning,<br />

correttori ortografici e adattatori per il fassano standard e per il Ladin Standard.<br />

2. Risorse e infrastrutture linguistiche e lessicali<br />

2.1 BLAD: Banca dac Lessicala Ladina<br />

La banca dati BLAD consente l’accesso:<br />

• allo SPELL base, il database che raccoglie circa 15.000 schede con LS e idiomi<br />

di valle (lessico prevalentemente moderno), da cui è stato elaborato il DLS;<br />

• alle banche dati locali di lessico tradizionale, per un totale di circa 90.000<br />

schede, in cui sono confluiti i dati raccolti dai dizionari e dai database di<br />

lessico patrimoniale (per il fassano: Dell’Antonio 1972, Mazzel 1995, De Rossi<br />

1999; per il badiotto: Pizzinini-Plangg 1966; per il gardenese: Lardschneider-<br />

Ciampac 1933 e 1992, Martini 1953; per il fodom: Masarei in stampa; per<br />

l’ampezzano: Comitato 1997) (descrittivi);<br />

• alle banche dati dei dizionari moderni (normativi), per un totale di circa<br />

250.000 schede (DILF, Mischì, Forni);<br />

• alle banche dati terminologiche elaborate nell’ambito del progetto TERM-LeS,<br />

in cui sono raccolte circa 16.000 schede.<br />

283


Fig. 1: L’interfaccia di ricerca della banca dati BLAD: la ricerca può essere effettuata in<br />

italiano, tedesco, LS e negli idiomi di valle.<br />

Fig. 2: Esempio di scheda: dal pannello “Idioms”, in cui accanto al lemma in LS vengono<br />

riportate la traduzione italiana e tedesca e le forme corrispondenti negli idiomi di valle, si<br />

ha anche accesso alle singole banche dati locali.<br />

2.2 I dizionari normativi: le versioni elettroniche online del DILF e del DLS<br />

Il DILF e il DLS sono strumenti linguistici la cui accessibilità e semplicità d’uso<br />

consentono la facile consultazione di risorse lessicali di grande importanza per le<br />

284


persone che si trovano a dover scrivere in fassano standard o in ladino standard.<br />

Nel DILF (Dizionario Italiano – Ladino Fassano / Dizionèr talian-ladin fascian) il<br />

repertorio lessicale tradizionale registrato nei dizionari descrittivi è stato integrato<br />

con un’ampia selezione di voci moderne il cui uso è ampiamente documentato nella<br />

produzione linguistica contemporanea. Questa versione elettronica, corrispondente<br />

alla seconda edizione cartacea (2001), è stata realizzata con la collaborazione dell’ITC-<br />

IRST di Trento, avviata nell’ambito di progetti relativi al trattamento automatico e<br />

allo sviluppo di infrastrutture informatiche per il ladino (progetto “TALES”, iniziato<br />

nel 1999).<br />

Fig. 3: DILF online: esempio di ricerca dal ladino fassano all’italiano, con visualizzazione<br />

dei risultati.<br />

285


Fig 4: Esempio di scheda di lemma con traducenti ladini e fraseologia<br />

Accanto al DILF, è disponibile anche la versione online del Dizionar dl Ladin<br />

Standard (DLS). A differenza dei consueti dizionari bilingui, il DLS registra i lemmi in<br />

ladino standard con accanto i termini corrispondenti negli idiomi di valle, dai quali<br />

la forma standard è stata ricavata secondo un articolato complesso di criteri. Inoltre<br />

viene riportato il traducente sia in italiano che in tedesco, lingue di adstrato delle<br />

valli ladine dolomitiche.<br />

Fig. 5: Interfaccia di ricerca: la parola può essere ricercata in ognuna delle varianti<br />

registrate nel dizionario e nei traducenti italiani e tedeschi<br />

286


Fig. 6: Esempio di scheda di lemma LS con traduzione italiana<br />

e tedesca e forme locali corrispondenti<br />

2.3 Il progetto TERM-LeS: Standardizzazione lessicale e terminologia per le<br />

lingue ladina e sarda 2<br />

Il progetto, condotto negli anni 2001-2003, ha previsto l’elaborazione di terminologia<br />

moderna e la creazione di banche dati terminologiche in ladino standard nei settori in<br />

cui l’uso della lingua ladina è obbligatorio (amministrazione) e in altri settori rilevanti<br />

per la realtà territoriale (architettura e costruzioni, ambiente, medicina, botanica,<br />

musica, pedagogia, trasporto turistico). Alcuni di questi glossari sono stati realizzati nel<br />

quadro del progetto Linmiter, promosso dalla Direzione Terminologia e Industrie della<br />

Lingua (DTIL) dell’Unione Latina, in coordinamento con altre minoranze linguistiche<br />

europee neolatine.<br />

2 Il ladino e il sardo sono le lingue oggetto dello stesso progetto di standardizzazione terminologica e<br />

lessicale finanziato dalla Comunità Europea tra il 2001 e il 2003.<br />

287


Fig. 7: Esempio di scheda terminologica: l’interfaccia di lavoro permette una visione<br />

sinottica sullo standard e sulle varianti. Da essa è inoltre possibile accedere direttamente<br />

alle banche dati degli idiomi di valle, ai dizionari moderni e ai corpora testuali.<br />

Tanto nell’elaborazione lessicografica quanto in quella terminologica in ladino<br />

standard, la polinomia, la varietà e la diversità degli idiomi ladini, è la base di<br />

partenza per la standardizzazione; la lingua standard attinge quindi dalle varianti<br />

locali riassumendole in una norma comune, mirando nel contempo a essere non<br />

uno strumento per soffocare le differenze, ma al contrario un tetto, un ombrello di<br />

protezione contro gli influssi e le interferenze esterne, e un punto di collegamento fra<br />

i diversi idiomi per permetterne uno sviluppo parallelo e armonico. La struttura delle<br />

banche dati e l’interfaccia di lavoro tengono quindi conto dell’esigenza di avere facile<br />

e immediato accesso a tutte le risorse linguistiche utili e necessarie.<br />

288


2.4 I corpora elettronici<br />

Nell’ambito del progetto TALES sul trattamento automatico della lingua ladina<br />

sono state create delle raccolte organiche di testi ladini, sia nel ladino standard che<br />

nei singoli idiomi. I corpora raccolti (fassano, gardenese, badiotto e ampezzano)<br />

contengono complessivamente circa 6.500.000 parole. I testi selezionati coprono<br />

un periodo che va dal XIX secolo fino ai giorni nostri, con preponderanza di testi<br />

appartenenti alla seconda metà del XX secolo. Per garantire un certo equilibrio fra i<br />

vari generi, sono stati inseriti sia testi letterari (prosa, poesia, teatro, memorialistica,<br />

testi sul folclore e le tradizioni, libri di preghiere), sia testi non letterari (testi giuridici<br />

e amministrativi, modulistica, testi di informazione giornalistica e pragmatici, testi<br />

di divulgazione scientifica e culturale, testi scolastici). Attualmente il corpus fassano<br />

è quello nella fase più avanzata di elaborazione. La sua strutturazione, che fornisce<br />

per ogni testo informazioni rilevanti (data, luogo di provenienza, tipologia testuale,<br />

autore), permette di affinare la ricerca secondo una serie di criteri predeterminati.<br />

I corpora sono consultabili tramite il concordancer, uno strumento elaborato ad<br />

hoc e rivolto anzitutto al linguista e allo studioso del ladino: esso permette l’analisi<br />

dei testi attraverso la ricerca di concordanze, collocazioni e frequenze secondo la<br />

modalità KWIC (Keyword In Context), ossia un sistema che permette di visualizzare la<br />

parola oggetto della ricerca con il suo contesto a corredo.<br />

Una sezione del concordancer è dedicata ai corpora amministrativi bi- e trilingui<br />

allineati: questa raccolta è di particolare utilità nel lavoro di realizzazione di glossari<br />

settoriali.<br />

Il lavoro preliminare per lo sviluppo dello strumento di analisi di corpora è<br />

consistito nella creazione di corpora testuali: i testi selezionati sono stati acquisiti<br />

elettronicamente oppure manualmente e sono stati elaborati rispettando precisi<br />

criteri di archiviazione. In seguito sono stati classificati in base alla loro appartenenza<br />

diatopica (individuazione della variante in cui sono scritti) e diacronica (dalle prime<br />

testimonianze scritte in ladino sino ai testi contemporanei) e alla tipologia testuale<br />

(testi letterari e non letterari con individuazione del genere specifico). Per ogni testo<br />

è stato creato un frontespizio elettronico che riassume tutte queste informazioni:<br />

periodo, autore, genere, nome del file, titolo originale, numero di parole, variante.<br />

Il frontespizio è stato linkato al testo corrispondente, cosicché le informazioni in esso<br />

contenute possano essere utilizzate per circoscrivere la ricerca.<br />

I corpora consultabili attraverso il concordancer si rivelano una risorsa di<br />

fondamentale importanza per diversi campi di applicazione: per lo studio del lessico,<br />

della sintassi e della morfologia, per l’elaborazione di strumenti normativi e didattici,<br />

289


per le operazioni di corpus planning, per i progetti relativi alla standardizzazione<br />

della lingua e per l’elaborazione di banche dati lessicografiche e di terminologia<br />

multilingue.<br />

Fig. 8: Esempio di ricerca nel concordancer: la parola cercata viene visualizzata in un<br />

breve contesto e in rosso per essere facilmente riconosciuta. Anche la parola che la<br />

precede o segue può essere evidenziata in un colore diverso. L’interfaccia di ricerca<br />

permette all’utente di decidere quante parole devono apparire nel contesto.<br />

3. Correttori ortografici con adattamento morfologico<br />

Nell’ambito del progetto SPELL-TALES, nell’anno 2002, l’Istituto Culturale Ladino<br />

ha realizzato il correttore ortografico del ladino fassano in collaborazione con l’ITC-<br />

IRST di Trento e col sostegno finanziario dell’Unione Europea, del Comprensorio Ladino<br />

di Fassa C11 e della Regione Trentino Alto-Adige. L’Istituto Culturale Ladino ha curato<br />

la parte linguistica del progetto riguardante la creazione delle regole morfologiche,<br />

mentre la parte informatica è stata seguita dall’ITC, nella persona del dott. Claudio<br />

Giuliano, che ha elaborato e applicato il programma di generazione delle forme. La<br />

realizzazione del software è poi stata affidata alla ditta Expert System di Modena.<br />

Nel corso del 2003 è stato messo a punto anche il correttore ortografico del ladino<br />

standard – SPELL-checker –, elaborato con le stesse modalità del correttore fassano.<br />

I due software di correzione sono realizzati in ambiente Windows e Macintosh<br />

per tutti gli applicativi Consumer della suite Microsoft Office e sono corredati di<br />

290


installazione automatica e di guida e assistenza all’installazione. Le funzionalità<br />

previste da questi due strumenti, similmente ai correttori ortografici disponibili per<br />

le lingue maggioritarie, prevedono la correzione di errori di digitazione, di ortografia<br />

e di morfologia direttamente durante la redazione di un testo, oppure in un secondo<br />

momento, sottoponendo a verifica un testo già scritto. I correttori ortografici in<br />

questione si basano su forme ricavate dai dizionari di riferimento, rispettivamente<br />

il DILF per il fassano standard e il DLS per il ladino standard; il formario di base<br />

fassano è poi stato implementato con forme ottenute dallo spoglio di alcuni testi<br />

amministrativi e giornalistici (Usc di Ladins) esportati tramite il concordancer e con i<br />

dizionari personalizzati realizzati dagli utenti che hanno usato il correttore per circa<br />

un anno nell’ambito dell’amministrazione. Per quanto riguarda il formario del ladino<br />

standard l’implementazione è avvenuta attraverso il dizionario personalizzato creato<br />

dai redattori del sito Noeles.net e da export delle banche terminologiche.<br />

L’Istituto Culturale Ladino “Majon di Fascegn”, in collaborazione con l’Istituto<br />

Ladino “Micurà de Rü” e con il supporto tecnico-informatico della Ditta Open Lab di<br />

Firenze, sta ora lavorando a una seconda generazione di correttori ortografici delle<br />

varietà dolomitiche (badiotto, fassano, gardenese) e del ladino standard, non più<br />

ancorata alla Suite Office di Microsoft. Si tratta di una scelta all’avanguardia che<br />

prevede la realizzazione di software e sistemi aperti (open source) disponibili in rete<br />

e non più dipendenti da programmi specifici.<br />

Il motore alla base dei correttori delle diverse varianti sarà uno e il sistema totalmente<br />

internazionalizzato: l’interfaccia d’uso a lingua multipla permetterà di scegliere la<br />

lingua stessa di interfaccia e la lingua di correzione all’atto della configurazione. Le<br />

novità pratiche più importanti di questi strumenti stanno in un’accurata ricerca delle<br />

corrispondenze interne al formario, che non si presenterà più come una semplice<br />

lista di forme non ancorate fra loro, bensì avrà una sua coerenza interna, riconoscerà<br />

la categoria grammaticale a cui appartiene ogni forma, la rispettiva forma base<br />

di riferimento, la coniugazione o declinazione e la marca d’uso, per poi suggerire<br />

l’eventuale forma corretta. Inoltre, nel processo di sofisticazione delle opzioni di<br />

correzione che verranno fornite, le varietà ladine inserite nel correttore saranno<br />

corredate da uno specifico algoritmo fonetico – soundslike - che non sarà più quello<br />

Metaphone classico dell’inglese (usato fra l’altro dalla maggior parte dei correttori<br />

ortografici esistenti), ma verrà elaborato sui soundslike specifici delle varietà in<br />

questione, permettendo quindi opzioni di correzione più precise.<br />

291


Fig. 9: Interfaccia del nuovo correttore ortografico open source<br />

accessibile direttamente da internet.<br />

Nel progetto di elaborazione di questa nuova tipologia di strumenti di correzione<br />

l’Istituto Culturale Ladino “Majon di Fascegn” sta sperimentando un’ulteriore<br />

funzione nell’ambito del correttore ortografico open source per l’assistenza a chi<br />

scrive in ladino fassano e in ladino standard. Si tratta di una funzione di adattamento<br />

morfologico che permetterà di passare “automaticamente” dalla variante locale<br />

fassana (cazet, brach, moenat) alla variante fassana standard, oppure dalle varietà<br />

standard di valle (fassano standard, badiotto unificato e gardenese) al ladino standard<br />

durante la digitazione di un testo.<br />

I nuovi strumenti di correzione si rendono quanto mai utili nel momento in cui<br />

una lingua polinomica viene riconosciuta come lingua ufficiale e si ritrova quindi a<br />

dover far fronte alle esigenze della comunicazione in ambito pubblico-amministrativo<br />

e nella scuola. Come è stato già osservato, l’apporto della linguistica computazionale<br />

nel processo di standardizzazione si è rivelato di primaria importanza per facilitare il<br />

passaggio dalla sottovarietà dello scrivente (impiegato, insegnante, studente o semplice<br />

appassionato) a una lingua standard ufficiale e unificata. I correttori ortografici sono<br />

quindi un passaggio fondamentale verso la realizzazione di strumenti ausiliari sempre<br />

più sofisticati per coloro che lavorano ogni giorno con la lingua ladina.<br />

292


293<br />

Bibliografia<br />

Bortolotti, E. & Rasom, S. (2003). “Linguistic Resources and Infrastructures for the<br />

Automatic Treatment of Ladin Language.” Proceedings of TALN 2003. RECITAL 2003.<br />

Tome 2. Batz-sur-Mer, 253-263.<br />

Chiocchetti, N. & Iori, V. (2002). Gramatica del Ladin Fascian. Vigo di Fassa: Istitut<br />

Cultural Ladin “majon di fascegn”.<br />

Chiorboli, J. (ed.) (1991). Corti 90: actes du Colloque international des langues<br />

polynomique. PULA n° 3/4, Université de Corse.<br />

Comitato del Vocabolario delle Regole d’Ampezzo (1997). Vocabolario Italiano -<br />

Ampezzano. Cortina d’Ampezzo: Regole d’Ampezzo e Cassa Rurale ed Artigiana di<br />

Cortina d’Ampezzo e delle Dolomiti.<br />

Dell’Antonio, G. (1972). Vocabolario ladino moenese – italiano. Trento: Grop Ladin da<br />

Moena.<br />

De Rossi, H. (1999). Ladinisches Wörterbuch: vocabolario ladino (brach)-tedesco. A<br />

cura di Kindl, U. e Chiocchetti, F.. Vigo di Fassa: Istitut Cultural Ladin “majon di<br />

fascegn”/Universität Innsbruck.<br />

Forni, M. (2002). Wörterbuch Deutsch - Grödner-Ladinisch. Vocabuler Tudësch – Ladin<br />

de Gherdëina. San Martin de Tor: Istitut Ladin “Micurà de Rü”.<br />

Giuliano, C. (2002). “A tool box for lexicographers.” Proceedings of EURALEX 2002.<br />

Copenhagen: Center for Sprogteknologi (CST), 113-118.<br />

Istitut Cultural Ladin “majon di fascegn”/SPELL (2001). DILF: Dizionario italiano -<br />

ladino fassano con indice ladino - italiano = Dizioner talian-ladin fascian con indesc


ladin-talian. Dizionèr talian – ladin fascian. II ed., 1. rist. Vigo di Fassa: Istitut Cultural<br />

Ladin “majon di fascegn”/SPELL.<br />

Lardschneider-Ciampac, A. (1933). Wörterbuch der Grödner Mundart. (Schlern-<br />

Schriften ; 23). Innsbruck: Wagner.<br />

Lardschneider-Ciampac, A. (1992). Vocabulér dl ladin de Gherdëina: Gherdëina -<br />

Tudësch. Übera. von Mussner, M. & Craffonara, L. San Martin de Tor: Istitut Ladin<br />

“Micurà de Rü”.<br />

Martini, G.S. (1953). Vocabolarietto gardenese – Italiano. Firenze: Francolini.<br />

Masarei, S. (2005). Dizionar Fodom – Talián - Todésch. Colle Santa Lucia: Istitut Cultural<br />

Ladin “Cesa de Jan” - SPELL.<br />

Mazzel, M. (1995). Dizionario Ladino fassano(cazet) – Italiano: con indice italiano-<br />

ladino. 5. ed. riv. e aggiornata (prima ed. 1976). Vigo di Fassa: Istitut Cultural Ladin<br />

“majon di fascegn”.<br />

Mischì, G. (2000). Wörterbuch Deutsch - Gadertalisch. Vocabolar Todësch – Ladin (Val<br />

Badia). San Martin de Tor: Istitut Ladin “Micurà de Rü”.<br />

Pizzinini, A. & Plangg, G. (1966). Parores Ladines. Vocabulare badiot – tudësk. ergänzt<br />

und überarbeitet von G. Plangg. Innsbruck: L.F. Universität Innsbruck.<br />

SPELL (2001). Gramatica dl Ladin Standard. Urtijëi: SPELL.<br />

SPELL (2002). DLS - Dizionar dl Ladin Standard. Urtijëi: SPELL.<br />

294


Schmid, H. (2000). Criteri per la formazione di una lingua scritta comune della ladinia<br />

dolomitica. San Martin de Tor/Vich: Istitut Ladin “Micurà de Rü”/Istitut Cultural Ladin<br />

“majon di fascegn”.<br />

Valentini, E. (2002). Ladin Standard. N lingaz scrit unitar per i ladins dles Dolomites.<br />

Urtijëi: SPELL.<br />

Videsott, P. (1997). “Der Wortschatz des Ladin Dolomitan: Probleme der<br />

Standardisierung.” Iliescu, Maria (Hrsg.) et al.: Ladinia et Romania. Festschrift für<br />

Guntram Plangg zum 65. Geburtstag. Vich/Vigo di Fassa: ICL 149-163. [Mondo Ladino,<br />

21].<br />

295


Il progetto “Zimbarbort” per il recupero del<br />

patrimonio linguistico cimbro<br />

297<br />

Luca Panieri<br />

Some time ago, people living in the mountain territory between the rivers Adige and<br />

Brenta in northern Italy spoke a Germanic language usually known as ‘Cimbro.’ This<br />

language was brought into northern Italy by Bavarian colonists in the Middle Ages.<br />

Surrounded by Italian speakers, and isolated from the rest of the German-speaking<br />

world, Cimbro developed as an autonomous language, preserving many of its original<br />

old German features, but becoming strongly influenced by Italian lexis and syntax as<br />

well.<br />

Since this language is nowadays commonly spoken only in Luserna (a village south<br />

of Trento), the local township has set up a project (presented in this paper) for the<br />

creation of a database of Cimbro lexis.<br />

The main purpose of the project is to create a virtual memory of the Cimbrian language,<br />

where all known records of the Cimbrian language tradition can be stored. The first<br />

written records in Cimbro date back to around 1600, so the aim of the project is to<br />

give back to the Cimbrian language tradition its forgotten historical roots. We are sure<br />

that by looking into the deep historical layers of the language tradition, we will help<br />

the surviving Cimbrian community of Luserna to face the present.<br />

Premessa<br />

Con questo breve contributo si illustrano le linee guida di un progetto strategico<br />

finalizzato al recupero del patrimonio lessicale della tradizione linguistica cimbra.<br />

Tale progetto ha ottenuto l’approvazione del Comune di Luserna (l’isola linguistica<br />

cimbra più consistente), che ha erogato per l’anno in corso un primo finanziamento. Lo<br />

scrivente, membro del Comitato Scientifico dell’Istituto di Cultura Cimbra di Luserna<br />

è stato da esso designato Coordinatore del progetto.<br />

L’Istituto di Cultura Cimbra, mediante la presentazione del progetto al Convegno<br />

Eurac “Lesser Used Languages and Computer” ha inteso soprattutto mettere a<br />

conoscenza gli esperti di linguistica computazionale dell’esistenza di tale iniziativa,<br />

illustrandone i contenuti, le finalità e la sua struttura operativa, allo scopo di sollecitare<br />

eventuali proposte sulle modalità tecniche della sua realizzazione. In tal senso, grazie<br />

all’occasione d’incontro con gli specialisti fornita dal Convegno di Bolzano, i promotori


del progetto sono effettivamente riusciti a suscitare vivo interesse e concrete proposte<br />

di collaborazione per la realizzazione della banca dati lessicale.<br />

Si deve quindi premettere che il presente contributo non è che la trasposizione<br />

scritta della presentazione del progetto, inteso nei termini suddetti. Non si tratta<br />

quindi di un articolo specialistico di contenuto teorico o sperimentale, bensì della<br />

descrizione dell’iniziativa concreta che l’Istituto Cimbro intende promuovere per la<br />

salvaguardia del patrimonio lessicale della propria tradizione linguistica. Abbiamo<br />

demandato agli specialisti d’informatica il compito di indicarci le soluzioni tecnologiche<br />

più opportune alla sua realizzazione e gestione.<br />

Quanto detto sul carattere di questo contributo spiega anche la mancanza quasi<br />

totale di riferimenti bibliografici, che sono tuttavia presenti in misura modesta<br />

nella sola introduzione, essendo essa finalizzata a portare a conoscenza del lettore<br />

la particolare realtà linguistica cimbra. Il resto della trattazione, invece, come già<br />

evidenziato, consiste nella semplice esposizione delle linee guida del progetto.<br />

1. Introduzione<br />

L’idea di questo progetto nasce dalla consapevolezza della situazione precaria in<br />

cui versano le tre isole linguistiche cimbre sopravvissute nei secoli fino ai giorni nostri:<br />

Giazza (VR), Roana-Mezzaselva (VI) e Luserna (TN). In particolare, la condizione<br />

relativamente rosea in cui fortunatamente ancora si trova la varietà cimbra di<br />

Luserna impone l’attuazione di ogni possibile strategia di difesa e consolidamento<br />

del patrimonio linguistico cimbro, essendo diventata Luserna l’ultima roccaforte di<br />

un gruppo etnico un tempo disseminato in tutto il territorio prealpino tra l’Adige<br />

e il Brenta. 1 Tale tradizione fu un tempo capace di trovare originale espressione<br />

letteraria e politico-amministrativa, in particolar modo sull’Altopiano d’Asiago, dove<br />

la Reggenza dei Sette Comuni riuscì a conservare la propria autonomia di governo<br />

locale per molti secoli, sopravvivendo all’avvicendarsi delle potenti signorie dell’Italia<br />

settentrionale e mantenendo una propria fisionomia linguistica e culturale anche nei<br />

confronti del vasto mondo di lingua tedesca, tanto geograficamente vicino. 2<br />

Ai nostri giorni, quando ormai l’area linguistica cimbra si è drasticamente ridotta,<br />

soppiantata quasi ovunque dal dialetto veneto o dalla lingua italiana, ed è rimasta vitale<br />

soltanto a Luserna, insorge la necessità di evitare che il patrimonio lessicale espresso<br />

dalla civiltà cimbra nel corso dei secoli cada per sempre nell’oblìo. Non consideriamo<br />

1 Tra i vari testi consultabili sulla questione dell’origine degli insediamenti “cimbri” e sulla loro lingua<br />

rimane tuttora fondamentale lo studio del grande dialettologo bavarese Johann Andreas Schmeller<br />

(1985).<br />

2 Per una sintesi efficace sulla storia istituzionale della comunità cimbra dei Sette Comuni dell’Altopiano<br />

d’Asiago, basata sulla documentazione, si veda anche Antonio Broglio (2000).<br />

298


ciò solamente un’operazione dettata dal rispetto per la memoria storica di una<br />

civiltà, ma soprattutto un intervento preventivo di rilevante importanza strategica<br />

e finalizzato a salvaguardare la tradizione linguistica cimbra. Oggigiorno infatti la<br />

comunità di Luserna si trova in una situazione di bilinguismo nettamente sbilanciato,<br />

in cui la lingua italiana predomina come mezzo di comunicazione atto a esprimere il<br />

panorama concettuale astratto della cultura moderna, mentre il cimbro è soprattutto<br />

la lingua materna della sfera affettiva, quella che esprime con genuina spontaneità<br />

i moti dell’animo, il sentimento di appartenenza alla comunità e al suo territorio<br />

naturale. Per quanto questa ripartizione complementare dell’uso delle due lingue<br />

possa apparire accettabile, se non addirittura comoda, essa pone il cimbro in posizione<br />

debole nei confronti dell’italiano. I continui stimoli e cambiamenti socio-economici<br />

e culturali del mondo moderno e il loro influsso globalizzante scardinano la coesione<br />

tradizionale delle “piccole patrie” di un tempo e ne catapultano gli appartenenti in un<br />

contesto socio-culturale del tutto diverso e di più ampie dimensioni, il cui baricentro<br />

è al di fuori della stessa comunità che ne subisce l’influenza. Questo mondo si<br />

esprime soprattutto mediante le lingue nazionali della scolarizzazione di massa, come<br />

appunto l’italiano o il tedesco. La lingua cimbra rimane quindi legata e, purtroppo,<br />

confinata all’ambito delle relazioni socio-economiche e dei valori tradizionali della<br />

piccola comunità di un tempo. Ma con gli inevitabili e troppo repentini mutamenti di<br />

prospettiva dovuti alla modernizzazione, la lingua connaturata alla tradizione locale<br />

cede il passo a quella delle relazioni esterne, della cultura tecnologica, scientifica e<br />

amministrativa, sempre più preponderanti.<br />

La sopravvivenza della tradizione linguistica cimbra dipende quindi dalla sua<br />

capacità di rinnovarsi ed espandere il proprio dominio espressivo agli ambiti concettuali<br />

tipici della cultura moderna.<br />

2. Motivazioni strategiche e obiettivi<br />

In considerazione di quanto sopra si è evidenziato, riteniamo necessario intervenire<br />

a tutela della lingua cimbra con un’operazione di consolidamento delle fondamenta<br />

storiche della stessa tradizione linguistica, mediante la realizzazione di una banca<br />

dati globale del patrimonio lessicale cimbro. In essa dovranno confluire i dati lessicali<br />

estrapolati da tutte le fonti scritte disponibili, a partire dalle prime attestazioni<br />

storiche di testi letterari quali il Catechismo cimbro del ‘600 fino ad arrivare alla<br />

lingua cimbra di oggi. L’idea di fondo è quella di creare una sorta di luogo virtuale della<br />

memoria linguistica collettiva della civiltà cimbra, che accolga il maggior numero di<br />

lemmi possibile, derivanti da tutte le varietà storiche del cimbro, oggi rappresentate<br />

dalle tre note isole linguistiche di Giazza, Roana-Mezzaselva e Luserna.<br />

299


Oltre all’indubbio valore storico-documentario, tale operazione, sul piano<br />

strategico, consente di fornire alla lingua cimbra ancora in uso degli utili strumenti<br />

lessicologici per far fronte alla minaccia contingente di progressiva erosione del<br />

vocabolario originario. S’intende con ciò favorire il recupero delle risorse espressive<br />

della tradizione linguistica cimbra nel suo complesso, vedendo in essa il più valido<br />

punto di riferimento per consolidare la lingua di Luserna. Anche in relazione alla<br />

questione attuale della necessità di elaborare un lessico cimbro capace di esprimersi<br />

oltre l’ambito familiare e tradizionale, la sperimentazione di neologismi deve in prima<br />

istanza fare riferimento alla propria tradizione linguistica, sia pure intesa in senso<br />

lato, ancor prima che si faccia ricorso al modello italiano o tedesco. Entrambi sono da<br />

adottare solo se è accertata la mancanza di risorse linguistiche interne.<br />

A tal riguardo si obietterà che attingere dal lessico storico cimbro per supplire<br />

alle deficienze semantiche della parlata attuale negli ambiti concettuali più astratti<br />

dell’espressione linguistica moderna potrebbe sembrare paradossale: come trovare<br />

nell’inventario lessicale del passato soluzioni adeguate a esprimere concetti che in<br />

molti casi non erano stati ancora immaginati da nessuno? Ad esempio, nell’ambito<br />

della tecnologia o in certi nuovi campi del sapere scientifico? Ovviamente non ci<br />

aspetteremo di “ritrovare” nel lessico storico cimbro la parola esatta per ‘computer’<br />

o per ‘ecologia’, ma sicuramente non sarà difficile rendere tali concetti partendo dalle<br />

radici lessicali che per approssimazione semantica e/o per analogia strutturale meglio<br />

si prestano a descriverne il valore. Rimanendo negli esempi citati, considereremo<br />

il computer un ‘calcolatore’, perché tale è la sua funzione preminente, tale la sua<br />

prima denominazione italiana e tale il significato letterale del termine inglese preso<br />

in prestito. Si potrà proporre quindi di designarlo con un termine cimbro derivato<br />

dalla radice verbale tradizionale che indica il concetto di ‘calcolare’. Per quanto<br />

riguarda il concetto di ‘ecologia’ occorrerà partire dalla sua possibile trasposizione in<br />

parole di uso comune che rendano chiaro il concetto, come ‘scienza dell’ambiente’,<br />

‘scienza della natura’. A questo punto avremo riportato il termine “moderno”<br />

negli ambiti concettuali già noti alla tradizione linguistica cimbra di ‘sapere’ e<br />

‘natura’. Ovviamente non si tratterà di imporre con l’autorità le soluzioni teoriche<br />

che si andranno proponendo, esse infatti si potranno realmente affermare nell’uso<br />

quotidiano solo se la comunità linguistica le avvertirà come utili alla comunicazione<br />

spontanea e in armonia con la percezione che ogni parlante nativo ha delle proprie<br />

radici linguistiche.<br />

Certamente, rispetto alle più vaste comunità linguistiche nazionali, quella cimbra<br />

di Luserna, a fronte di tanti svantaggi, presenta almeno il vantaggio di una maggiore<br />

coesione tra le istituzioni e la cittadinanza. Di per sé ciò favorisce l’affermazione di<br />

300


ogni iniziativa intrapresa dalle istituzioni locali, nelle quali il cittadino si rispecchia<br />

direttamente, in un clima di compartecipazione costruttiva. Ciò quindi gioca a favore<br />

anche degli interventi mirati di politica linguistica patrocinati dalle locali istituzioni.<br />

3. Struttura operativa<br />

La realizzazione della banca dati globale del lessico cimbro (progetto Zimbarbort)<br />

si articola essenzialmente nella fase di raccolta delle fonti primarie in lingua cimbra<br />

e nella fase di estrapolazione e inserimento dei singoli dati lessicali nel supporto<br />

informatico della banca dati stessa.<br />

3.1 Raccolta delle fonti<br />

In questa fase si procede al reperimento di ogni tipo di testimonianza linguistica<br />

del cimbro. Pur essendo questa fase logicamente preliminare rispetto a quella<br />

dell’estrapolazione e dell’inserimento dei dati nella banca virtuale, essa sarà<br />

destinata a protrarsi nel tempo fino all’esaurimento delle attestazioni storiche sulla<br />

lingua cimbra e continuerà seguendo a mano a mano gli eventuali sviluppi linguistici<br />

che si producono nel momento attuale. Poiché tale fase costituisce il momento di<br />

acquisizione alla “memoria virtuale collettiva” di ogni espressione lessicale integrata<br />

nella tradizione linguistica cimbra, essa sarà destinata ad arricchirsi progressivamente<br />

di ogni futuro neologismo che eventualmente si affermi nell’uso comune.<br />

A prescindere dall’epoca a cui risalgono, le attestazioni della lingua cimbra si<br />

possono ripartire in due categorie, distinte dal diverso supporto in cui sono state<br />

registrate e tramandate ai giorni nostri:<br />

• Fonti scritte<br />

In quest’ambito rientra la moltitudine di attestazioni scritte in cimbro (interamente<br />

o parzialmente) nell’intero corso della storia, fino al tempo presente. Si tratta di<br />

testi scritti di tipologia e di epoca varia, che comprendono opere letterarie, quali<br />

poesie, racconti popolari o testi liturgici, scritti ad uso privato, quali le epistole, e<br />

opere finalizzate allo studio della lingua cimbra, quali grammatiche, glossari, studi<br />

toponomastici, ecc.<br />

Ai fini del presente progetto si tratterà di individuare e raccogliere tutte le fonti<br />

scritte di cui si ha conoscenza per radunarle fisicamente in originale o almeno in<br />

copia fedele e inventariarle in modo ragionato, onde agevolarne la consultazione.<br />

Tra i criteri di catalogazione figureranno sicuramente il genere (poesia, grammatica,<br />

racconto popolare, epistola, ecc.) e il periodo storico.<br />

• Fonti orali<br />

301


Questo tipo di attestazioni comprendono tutte le registrazioni della voce viva dei<br />

parlanti nativi. Tali fonti sono della massima importanza per lo studio della fonologia<br />

e di tutti i fenomeni caratteristici del linguaggio parlato.<br />

La disponibilità di questo genere di attestazioni si deve al progresso tecnologico<br />

avvenuto negli ultimi cento anni, in cui è andata progressivamente migliorando la<br />

qualità della riproduzione della voce viva, così come sono cambiati e si sono moltiplicati<br />

i supporti di registrazione (supporto radiofonico, magnetico, digitale, ecc.).<br />

Anche in questo caso si tratterà di fare una ricognizione del materiale registrato<br />

esistente e di raccoglierlo in originale o in copia fedele. Esso sarà poi opportunamente<br />

inventariato con criteri che ne favoriscano la consultazione. In questo caso la tipologia<br />

delle attestazioni è però molto più omogenea, sia per epoca (durante l’ultimo secolo<br />

di storia) che per genere (per lo più interviste).<br />

3.2 Estrapolazione e inserimento dei dati nella banca dati<br />

Questa fase può avere inizio dal momento in cui un primo contingente di<br />

attestazioni, scritte e/o orali, sia stato raccolto e inventariato; a seguire la fase di<br />

raccolta e quella di estrapolazione e inserimento dei dati potranno proseguire anche<br />

in contemporanea.<br />

Prima di dare avvio a questa fase è però indispensabile aver stabilito il formato in cui<br />

ogni dato sarà inserito nella banca dati virtuale. Con un termine tecnico chiameremo<br />

record ogni dato lessicale inserito con il suo corredo informativo (fonte di provenienza,<br />

significato in italiano, note grammaticali, fraseologiche, area semantica, riferimenti<br />

incrociati, ecc.).<br />

• Scelta del formato e della struttura del record<br />

Si dovrà porre particolare attenzione alla definizione preliminare dei parametri del<br />

corredo informativo che accompagnerà ogni dato inserito, poiché la scelta influenzerà<br />

la struttura globale della banca dati.<br />

In linea di principio occorre tener presente il maggior numero possibile di<br />

informazioni attribuibile a un elemento lessicale. Dato che la rubricazione all’interno<br />

di ogni record assumerà, nel contesto informatico, la veste di ‘campi’, converrà<br />

attribuire a ogni categoria concettuale potenzialmente rilevante ai fini informativi<br />

un proprio campo. Il record esemplare sarà quello in cui tutti i campi informativi<br />

verranno compilati, ben sapendo che in numerosi casi non saranno disponibili tutti<br />

i dati. Se infatti, ad esempio, tra i parametri informativi accludiamo la trascrizione<br />

fonetica del dato lessicale, il campo destinato a questo parametro rimarrà certamente<br />

vuoto per tutte le voci del lessico cimbro risalenti a periodi storici molto antichi, non<br />

302


essendo possibile stabilire con sufficiente sicurezza l’esatta pronuncia della lingua<br />

dell’epoca.<br />

• Estrapolazione dei dati lessicali<br />

L’operazione di acquisizione dei dati lessicali sarà più o meno complessa, a seconda<br />

della natura delle fonti esaminate. Ciò si rifletterà sul grado d’impegno lavorativo e<br />

sulle diverse competenze richieste allo svolgimento del compito.<br />

Il caso più semplice è quello dello spoglio di un glossario, presentando già la fonte<br />

scritta di partenza i dati lessicali in forma di voci di entrata, con relativa traduzione<br />

e commento informativo. In questo caso l’inserimento dei dati lessicali nella banca<br />

dati può avvenire pressoché contemporaneamente alla loro estrapolazione dal testo<br />

in cui sono stati reperiti. Inoltre, sarà il testo stesso a fornirci importanti informazioni<br />

grammaticali e sul significato del lemma.<br />

Ben più complessa sarà invece l’estrapolazione di dati lessicali derivanti da fonti<br />

orali registrate. Qui l’operazione sarà particolarmente difficile nel caso di registrazioni<br />

di qualità scadente e/o di provenienza dialettale diversa da Luserna. In questo caso<br />

il gruppo di lavoro dovrà cimentarsi nella comprensione di varianti del cimbro ormai<br />

vicine all’estinzione ed essere in grado di individuare, dal contesto di un discorso<br />

parlato, i singoli costituenti lessicali riconoscendone la loro reciproca relazione<br />

grammaticale. Gli operatori dovranno poi trasporre la propria interpretazione dei dati<br />

lessicali in forma scritta, operando una scelta ragionata sulla loro rappresentazione<br />

grafica, e da qui procedere al loro inserimento nella banca dati.<br />

• Inserimento dei dati lessicali nella banca dati<br />

L’operazione d’inserimento dei dati, come abbiamo già evidenziato, presuppone<br />

la creazione di un formato uniforme per tutti i record della banca dati. Per ogni<br />

dato lessicale (lemma) estrapolato sarà creato un record specifico all’interno del<br />

quale il dato sarà corredato di varie annotazioni informative ripartite nei rispettivi<br />

campi. L’operatore, inserito il lemma nel suo record, dovrà riempire i campi con le<br />

informazioni di cui dispone al momento, lasciando vuoti gli altri campi. Ad esempio,<br />

si potrebbe presentare il caso in cui l’operatore inserisca un lemma risalente a una<br />

fonte antica dal cui contesto non sia possibile risalire al genere grammaticale. In tale<br />

circostanza lascerà vuoto il campo relativo all’informazione grammaticale sul genere<br />

dei sostantivi.<br />

Questa procedura lascia aperta la possibilità di successive revisioni dei record,<br />

finalizzate a integrare il corredo informativo dei lemmi ogniqualvolta emergano<br />

nuove informazioni sui medesimi. Per rimanere nell’esempio citato, può darsi il<br />

303


caso che, successivamente, lo spoglio di altre fonti porti alla conoscenza del genere<br />

grammaticale di quello stesso lemma.<br />

Naturalmente la continua acquisizione di fonti da sottoporre ad analisi porta spesso<br />

a estrapolare dati lessicali già noti da altre attestazioni precedentemente esaminate.<br />

La ricorrenza multipla di uno stesso lemma porta automaticamente alla revisione del<br />

record in cui è stato inizialmente inserito, aggiungendovi via via le nuove informazioni<br />

desunte dal contesto della fonte.<br />

Oltre a questa revisione “automatica” in corso d’opera, è tuttavia raccomandabile<br />

affiancare all’operatore che al momento svolge il lavoro d’inserimento dei dati<br />

lessicali un revisore che controlli nell’immediato la compilazione dei record, poiché<br />

in molti casi il grado di completamento delle note informative sui lemmi dipende,<br />

oltre che dal contesto in cui sono stati reperiti, anche dalla competenza specialistica<br />

di chi svolge il compito.<br />

304


305<br />

Bibliografia<br />

Broglio, A. (2000). La proprietà collettiva nei Sette Comuni. Aspetti storico-normativi.<br />

Roana: Istituto di Cultura Cimbra.<br />

Schmeller, J.A. (1985). Über die sogenannten Cimbern der VII und XIII Communen<br />

auf den Venedischen Alpen und ihre Sprache, 1811, 1838, 1852, 1855†, Curatorium<br />

Cimbricum Bavarense, Landshut.


Stealth Learning with an <strong>Online</strong> Dog<br />

(Web-based Word Games for Welsh)<br />

Gruffudd Prys and Ambrose Choy<br />

This paper describes issues surrounding developing web-based word games in a<br />

minority language setting, and is based on experience gained from the development of<br />

a project designed to improve the language skills of fluent Welsh speakers undertaken<br />

at Canolfan Bedwyr at the University of Wales, Bangor.<br />

This project was conceived by the BBC as an entertaining way of improving the<br />

language skills of fluent Welsh-speakers, especially those in the 18-40 age range.<br />

Funded by ELWa, the body responsible for post-16 education and training outside<br />

higher education in Wales, it was to form part of BBC Wales’ “Learn Welsh” website.<br />

The BBC’s Welsh language web pages are immensely popular, attracting a high<br />

proportion of younger Welsh-speakers. A survey conducted by the BBC in April and May<br />

2003 revealed that 43% of the BBC Welsh language online news service “Cymru’r Byd”<br />

belonged to the 15-34 age group, with a high level of workplace usage, peaking at<br />

lunchtimes. The project was to provide this audience with word games, a self marking<br />

set of language improvement exercises, and an online answering service dealing with<br />

grammatical and other language problems. In order to appeal to the target audience,<br />

it was important that they be entertaining and attractive in addition to being<br />

educational. It was also intended that the project should emphasise progressive youth<br />

culture rather than old-fashioned Celtic themes, and this would be incorporated into<br />

the design and feel of the games.<br />

This paper will concentrate specifically on the development of the interactive online<br />

games and puzzles, showing how digital language resources originally created for<br />

previous digital language projects were adapted and recycled, allowing the e-Welsh<br />

team at the University of Wales, Bangor, to produce a working website within a few<br />

short months. It will also detail some of the new innovations created as part of the<br />

project, with a view of building a modularized set of components that will provide a<br />

versatile resource bank for future projects.<br />

307


1. The Ieithgi Name<br />

Welsh has a peculiar word for people intensely interested in language. It is ieithgi,<br />

the literal translation of which would be ‘language dog.’ Perhaps ‘language terrier’<br />

would be a meaningful image for English speakers, as it denotes someone, who,<br />

having got hold of a particularly tasty bone to gnaw, is unwilling to let it go. It may be<br />

a question of some obscure Welsh grammar rule, or the origin of some Welsh place-<br />

name, but the ieithgi will not let the subject drop without knowing the answer.<br />

By coincidence, a project aimed at Welsh learners was using an animated dog,<br />

called Cumberland, and his owner, Colin, to introduce Welsh to new audiences. In the<br />

Colin and Cumberland storyline, Colin has no Welsh, whereas his dog Cumberland, is a<br />

fluent, knowledgeable and slightly pompous Welsh speaker. As Colin and Cumberland<br />

was aimed at the same demographic age group as the Ieithgi project, and possessed<br />

a design that was modern, contemporary and attractive, it was therefore a short step<br />

for Cumberland, the know-all dog in the animated cartoons, to become the namesake<br />

and mascot of the Ieithgi project, on hand to answer questions on Welsh grammar as<br />

well as guide users through the games and exercises.<br />

2. Macromedia Flash and XML<br />

The brief received from the BBC specified that the games were to be created using<br />

Macromedia Flash. Flash is a multimedia authoring program that creates files that can<br />

be played on any computer, Mac or PC, where Flash Player is installed (Macromedia<br />

claim a coverage of 98% of all desktops worldwide).<br />

Flash can combine vector and raster graphics, and uses a native scripting language<br />

called actionscript which is similar to Java. It can communicate with external XML<br />

files and databases, and, when used intelligently, produces small files which are quick<br />

to download. Flash also allows easy collaboration between a software engineer and<br />

a designer.<br />

308


Figure 1: Colin and Cumberland – The BBC Cartoon for Learners<br />

3. Technical Challenges<br />

The main technical challenge posed by the games was the need to adapt game<br />

formulas already existing in English to work with the characteristics of the Welsh<br />

language. This meant that new code specific to the needs of Welsh had to be created.<br />

The lack of ready-made Welsh language components available to form the building<br />

blocks needed to create the word games was a significant disadvantage when compared<br />

with developers creating similar games in a major language. These building blocks for<br />

Welsh had to be created as part of the project.<br />

In order to keep down costs, the project hoped to reuse resources developed<br />

originally for previous digital language projects undertaken by Canolfan Bedwyr. This<br />

is one way that a minority language such as Welsh can keep costs down and make<br />

frugal use of existing components in an attempt to keep pace with greater resourced<br />

languages.<br />

4. Resource Audit<br />

Over the years, as part of its mission to address the needs of Welsh language in<br />

a digital environment, Canolfan Bedwyr has built up a library of language resources,<br />

including digital dictionaries, spelling and grammar checkers as well as the assorted<br />

components such as lexicons and lemmatizers that combine to create such tools. Many<br />

of these resources are either useful or essential when attempting to create games<br />

309


such as Ieithgi; although seemingly quite different, digital dictionaries share many<br />

prerequisites with word games.<br />

As the Ieithgi project was a low budget, tight deadline project, it was imperative<br />

that we make as much use as possible of our existing resources, as opposed to<br />

reinventing the wheel. However, we also recognised that new tools for manipulating<br />

the Welsh language would also have to be forged in order for some aspects of Welsh<br />

to function properly in a digital online setting.<br />

Below is a list of the relevant resources available to Canolfan Bedwyr and the<br />

games in which they would be used:<br />

• Lexicon: To be used in Cybolfa (conundrum) and Dic Penderyn<br />

(hangman)<br />

• Place-name databases (AMR and Enwau Cymru): To be used in Rhoi Cymru<br />

yn ei lle (locate and identify place-names);<br />

• Proverb database: To be used in Diarhebol (guess the proverb);<br />

• Alphabet order sorter: To be used in Cybolfa, Diarhebol, Dic Penderyn,<br />

Pos croeseiriau (crossword) and Ystyrlon (identify the correct meaning).<br />

5. The Games<br />

Six games were to be produced for the Ieithgi project. Of these six, three were<br />

to be open-ended games. These games draw randomly from a large list of words or<br />

phrases each time the game is played, giving the user a fresh challenge every time<br />

they start a new game, and ensuring that the games have enormous replay value.<br />

Each instance of a closed game, on the other hand, must be created manually by a<br />

games designer, and this means in practice that there are fewer unique instances of<br />

closed games than of open games. However, conversely, the content of closed games<br />

can be more complex, as they do not need to be designed to conform to such tight<br />

technical constraints.<br />

Below is a list of the games divided by category:<br />

• Open Ended<br />

•<br />

Dic Penderyn (hangman)<br />

Cybolfa (conundrums)<br />

Diarhebol (guess the proverb)<br />

310


• Closed<br />

Pos croeseiriau (crosswords)<br />

Rhoi Cymru yn ei Lle (locate and identify place-names)<br />

Ystyrlon (identify the correct meaning)<br />

6. Open Ended Games<br />

From a technical point of view, the open-ended games posed the greatest challenge.<br />

Cybolfa, Dic Penderyn and Diarhebol all make use of XML word lists that are used<br />

to supply the games with random words or phrases that test the player’s language<br />

skills.<br />

6.1 Dic Penderyn<br />

Dic Penderyn, named after a Welsh folk hero, is our version of the popular Hangman<br />

game. Drawing a word at random from an XML file, Dic Penderyn gives the person<br />

playing the game ten attempts to guess the word before a set of gallows are built and<br />

a caricature of Colin, Cumberland’s owner, is hung, signalling ‘Game Over.’<br />

From an educational point of view, Dic Penderyn nurtures spelling ability by having<br />

the player think in terms of the letter patterns present in the language in order to<br />

correctly identify the game word. The game also increases the player’s vocabulary<br />

by sometimes suggesting unfamiliar words (as the word list contains words of varying<br />

degrees of familiarity).<br />

The XML wordlist was drawn from the lexicon compiled for the BBC’s Learn Welsh<br />

dictionary, which had been created previously for the BBC by Canolfan Bedwyr. This<br />

had the bonus of making it possible to link each of the words in the wordlist to a<br />

definition on the dictionary’s Webpage. The link would appear each time the player<br />

failed to identify the word, increasing the educational value of the game by providing<br />

definitions of words that had proved unfamiliar.<br />

The lexicon itself included words taken from Corpws Electroneg o Gymraeg (CEG),<br />

the tagged 1 million word Welsh language corpus developed at the University of Wales<br />

Bangor in the early nineties.<br />

Having a part-of-speech tagged lexicon proved extremely useful as it enables a<br />

game designer to tweak the content of the word list that would be created from it.<br />

After some initial playtesting, it was decided that conjugated verbs would be excluded<br />

from the wordlist. These are sometimes included in English versions of Hangman,<br />

as English has limited conjugation possibilities. In Welsh, however, as in Romance<br />

311


languages, most verbs follow a regular pattern of conjugation, with separate but<br />

regular conjugations for the different persons as well as tenses.<br />

Table 1: Conjugation of rhedeg (to run)<br />

Present Imperfect Past Pluperfect Subjunctive Imperfect<br />

312<br />

Subjunctive<br />

Imperative<br />

rhedaf rhedwn rhedais rhedaswn rhedwyf rhedwn rhed, rheda<br />

rhedi rhedit rhedaist rhedasit rhedych rhedit rheded<br />

rhed, rheda rhedai rhedodd rhedasai rhedo rhedai rhedwn<br />

rhedwn rhedem rhedasom rhedasem rhedom rhedem rhedwch<br />

rhedwch rhedech rhedasoch rhedasech rhedoch rhedech rhedent<br />

rhedant rhedent rhedasant rhedasent rhedont rhedent rheder<br />

rhedir rhedid rhedwyd rhedasid rheder rhedid<br />

As is apparent from Table 1, many of the verb forms above are far too similar for<br />

a player to differentiate between them in a game of hangman. Coupled with the fact<br />

that conjugated verbs seem unfamiliar outside the context of a sentence, this meant<br />

that their inclusion would have made the game too difficult and unrewarding from a<br />

playability perspective.<br />

6.1.1 Mutation<br />

The lemmatizer also allowed us to prevent the initial consonant mutation that is<br />

a feature of Welsh words within sentences from making its way from sentences in the<br />

corpus to words in the word list.<br />

A word such as ci (dog) can have the following mutations:<br />

ci nghi gi chi<br />

For example:<br />

Fy nghi<br />

Dy gi<br />

Ei chi<br />

Eu ci<br />

My dog<br />

Your dog<br />

Her dog<br />

Their dog


Mutations never occur in words when they appear in isolation (as they do in Dic<br />

Penderyn). It was therefore inappropriate to have them included in the XML word<br />

list.<br />

6.1.2 Dic Penderyn XML Word List Example<br />

Sample taken from list of 3,000+ six letter words<br />

6.1.3 Digraphs<br />

<br />

gormes<br />

<br />

<br />

gormod<br />

<br />

<br />

goroer<br />

<br />

The Welsh alphabet contains digraphs:<br />

ch dd ng ll ph rh<br />

These digraphs count as single letters rather than a combination of two separate<br />

letters. This means that Welsh, unlike most other languages that use the Roman<br />

alphabet, has two-character letters in addition to single-character letters.<br />

Take for example the word llefrith (milk), which has eight characters:<br />

L, L, E, F, R, I, T, H<br />

But six letters:<br />

LL, E, F, R, I, TH<br />

6.1.4 Digraph Problems<br />

The existence of digraphs in Welsh creates a number of problems:<br />

• Simple character count functions can’t be used to count letters.<br />

Due to the existence of digraphs, a function that simply counts the number of<br />

characters in a word will not be able to accurately count the number of letters in a<br />

Welsh word. Using a simple character count function to create the XML six letter word<br />

word list would not have included words such as llefrith that have six letters but have<br />

over six characters, and would erroneously include words with less than six letters<br />

313


ut that possessed six characters. In order to be able correctly count the letters, a<br />

digraph filter was created.<br />

Every word in the lexicon had to be passed through the filter. The filter identifies<br />

the characters that form Welsh digraphs (ch, dd, ng, ll, ph, rh) and treats them as<br />

single characters. The amount of letters in a word can then be counted correctly, so<br />

that only six letter words are added to our XML list of six letter words, whether they<br />

contain digraphs or not.<br />

Here is an example of the code used:<br />

public static int welshCharSplit(string word, ArrayList charArray)<br />

{<br />

int letterCount=0,x=0;<br />

charArray.Clear();<br />

word = word.ToUpper();<br />

string digraff = String.Empty;<br />

for (x=0; x


}<br />

}<br />

}<br />

{<br />

}<br />

if (word[x-1] == ‘N’)<br />

{<br />

}<br />

digraff = word[x-1].ToString() + word[x].ToString();<br />

//Check for Dd, Ff, Ll<br />

if ((word[x] == ‘D’)||(word[x] == ‘F’)||(word[x] == ‘L’))<br />

{<br />

}<br />

if (word[x-1] == word[x])<br />

{<br />

}<br />

digraff = word[x-1].ToString() + word[x].ToString();<br />

string buff = String.Empty;<br />

if (digraff != String.Empty)<br />

{<br />

}<br />

else<br />

{<br />

}<br />

charArray.RemoveAt((charArray.Count)-1);<br />

buff = digraff;<br />

buff = word[x].ToString();<br />

charArray.Add(buff);<br />

letterCount = charArray.Count;<br />

return letterCount;<br />

315


This creates a word list containing six-letter words, including digraphs.<br />

• Some combinations of characters can be a digraph or two separate<br />

letters.<br />

Bangor = (compound word of ban + côr) pronounced ‘n-g’<br />

angor ( meaning anchor) pronounced ‘ng’<br />

Fortunately, digraph look-up lists had previously been developed at Canolfan<br />

Bedwyr in order to correctly sort dictionaries according to the Welsh alphabet (ng<br />

follows g in the Welsh alphabet, so ng and n are sorted quite differently). These<br />

look-up lists could then be used to prevent confusion between digraphs and similar<br />

character combinations.<br />

• Inputting letters using the keyboard becomes more complicated.<br />

Designing an online interface that can differentiate elegantly between an inputted<br />

d and an inputted dd is a challenge. There is no support for digraphs in Welsh’s UTF-<br />

8 character set, and no specific Welsh keyboard that has digraph keys (UK English<br />

QWERTY keyboards are generally used). In practice this makes a keyboard-based<br />

approach to inputting awkward, especially when playing against the clock as in many<br />

of the Ieithgi games.<br />

It was decided that a visual interface would be devised in order to allow the user to<br />

input these characters quickly and efficiently. This came in the form of an on-screen<br />

keyboard featuring all the letters of the Welsh alphabet. Although the keyboard takes<br />

up some of the game’s screen space, it gives the player valuable feedback such as<br />

which letters have been chosen and which letters remain, as well as serving as a<br />

visual reminder to users more familiar with the English alphabet that Welsh considers<br />

digraphs to be single letters.<br />

The on-screen keyboard is used in Cybolfa, Diarhebol and Pos Croeseiriau in addition<br />

to Dic Penderyn, shown below:<br />

316


Figure 2: Screenshot showing Dic Penderyn after an Unsuccessful Attempt.<br />

6.2 Cybolfa<br />

The digraphs pose another problem when generating the Welsh words for Cybolfa,<br />

a game where the player must attempt to create words from a jumbled set of letters.<br />

Cybolfa uses the same six-letter XML word list as Dic Penderyn to supply the main<br />

six-letter game word. However, Cybolfa must then scramble the word so that it is<br />

difficult for the player to recognise. In English, this could be done fairly quickly by<br />

scrambling each individual character in a word. The same method can not be applied<br />

to the Welsh words because of the existence of digraphs, therefore the word must<br />

be passed through a filter that identifies any Welsh alphabetical letters in the word<br />

before scrambling it. To return to the earlier example of llefrith, an actionscript<br />

digraph filter within the Flash file identifies the digraphs as distinct letters. so that all<br />

six letters can be identified (LL, E, F, R, I, TH).<br />

Below shows the function called WelshFilter written in actionscript 2.0 in Flash<br />

MX, it receives the word in the form of an array and checks for the existence of any<br />

digraphs and returns the length of the word. If a digraph is found, it merges the<br />

letters to become a single element within the array.<br />

_global.welshFilter = function(WordArray) {<br />

for (x=1; x


{ if ((WordArray[x-1] == “R”) || (WordArray[x-1] == “T”) ||<br />

}<br />

(WordArray[x-1] == “P”) || (WordArray[x-1] == “C”))<br />

{ WordArray[x-1] = WordArray[x-1]+ WordArray[x];<br />

}<br />

WordArray.splice(x,1);<br />

//Check for Ng<br />

if (WordArray[x] == “G”)<br />

}<br />

{ if (WordArray[x-1] == “N”)<br />

{ WordArray[x-1] = WordArray[x-1]+ WordArray[x];<br />

}<br />

WordArray.splice(x,1);<br />

//Check for Dd, Ff, Ll<br />

if ((WordArray[x] == “D”) || (WordArray[x] == “F”) || (WordArray[x] == “L”))<br />

}<br />

{ if (WordArray[x-1] == WordArray[x])<br />

}<br />

{ WordArray[x-1] = WordArray[x-1]+ WordArray[x];<br />

}<br />

WordArray.splice(x,1);<br />

return(WordArray.length);<br />

}<br />

Once both digraphs and single-character letters have been identified as single<br />

elements, the word can be scrambled and displayed to the player in an unfamiliar<br />

letter order whilst still retaining the digraph integrity (TH, F, R, LL, I, E).<br />

6.3 Anagram Maker<br />

As described previously, the word list for the Cybolfa games is derived from the Dic<br />

Penderyn word list. Each time Cybolfa is played, a random six-letter word is drawn<br />

from the list and an anagram maker within the actionscript code generates a list of<br />

all possible anagrams for that word. This is achieved by cross-referencing Canolfan<br />

Bedwyr’s Welsh spellchecker list with the original word’s possible letter combinations.<br />

318


In programming terms, this is done by a one-to-one mapping of letter values to prime<br />

numbers, allowing words to be represented as composite numbers by multiplying<br />

together the primes that map each letter in the word. Words formed from the same<br />

letters, regardless of order, will then map to the same composite number. Therefore,<br />

if a word’s number divides exactly into another word’s, the first word’s letters must<br />

all appear in the second word.<br />

For example, take the word gwelwi.<br />

<br />

gwelwi<br />

<br />

By looking up the spellchecker list and using the anagram checker function, the<br />

following list is generated in XML:<br />

<br />

<br />

<br />

GWELWI<br />

GLIW<br />

ELI<br />

ELW<br />

EWIG<br />

GWELW<br />

GLEW<br />

GWIWI<br />

IGLW<br />

<br />

This list then is passed through and used as one of the games in Cybolfa.<br />

Below is a screenshot of a completed game where polisi was the six-letter gameword.<br />

319


6.4 Diarhebol<br />

Figure 3: Screenshot of Cybolfa Showing all Possible Anagrams.<br />

Diarhebol is in essence very similar to Dic Penderyn, the main difference being<br />

that rather than guessing a random six-letter word, the player must attempt to guess<br />

a Welsh proverb. Once again, players have a limited number of chances to achieve<br />

their objective before the game ends. If needed, a clue is provided in the form of an<br />

English translation of the proverb, and, whilst guessing a whole proverb may at first<br />

seem daunting, the higher probability of a sentence as opposed to a word containing<br />

a specific letter ensures that the game is of a similar level of difficulty.<br />

An XML proverb list replaces the XML word list used by both Dic Penderyn and<br />

Cybolfa, and an example is shown below.<br />

<br />

Yr afal mwyaf yw’r pydraf ei galon<br />

The biggest apple has the rottenest heart<br />

<br />

<br />

Yr euog a ffy heb neb yn ei erlid<br />

The guilty flees when no-one chases him<br />

<br />

<br />

Yr hen a wyr, yr ieuanc a dybia<br />

320


The old know, the young suppose<br />

<br />

<br />

A fo’n ddigwilydd a fo’n ddigolled<br />

The shameless will be without loss<br />

<br />

7. Closed Games<br />

Figure 4: Screenshot of Successful Attempt at Diarhebol<br />

Unlike the open-ended games, which draw their content from a list, the content<br />

for closed games must be manually created in advance due to the more involved<br />

nature of their content.<br />

7.1 Pos Croeseiriau<br />

Pos Croeseiriau is an online Welsh crossword puzzle. Crossword puzzles have been<br />

popular for some time in Welsh language publications such as local papers, where<br />

the custom of representing digraphs as a single letter within a single square has<br />

long been established. Due to the complexity of creating crosswords, both the clues<br />

and the answers have been hardcoded into the code. However, Cysgeir, Canolfan<br />

Bedwyr’s electronic dictionary can be used to aid in the creation of crosswords, as it<br />

can suggest words that contain specific letters in specific positions within the word.<br />

Take for instance a situation where the crossword designer has decided on the two<br />

321


words cyfarth and cynffon to form the answers for 1 and 2 across, and needs a word<br />

that will fit in the space for 1 down:<br />

Figure 5: TITLE<br />

¹_<br />

¹C Y F A R TH<br />

²C Y N FF O N<br />

The designer has to only type in ??F??N??? into Cysgeir (where ‘?’ represents an<br />

empty square) to be provided with a list of compatible words. In this case Cysgeir<br />

provides 32 different words that fulfil our requirements, of which we choose elfennol<br />

‘elementary.’<br />

Figure 6: title<br />

¹E<br />

L<br />

¹C Y F A R TH<br />

E<br />

N<br />

²C Y N FF O N<br />

O<br />

L<br />

Pos Croeseiriau used a slightly modified version of the on-screen keyboard found<br />

in the open-ended games to give players the ability to delete unwanted letters, and<br />

the resulting interface is simple and easy to use despite the complications caused by<br />

the Welsh digraphs.<br />

322


Figure 7: Pos Croeseiriau Screenshot Showing a Completed Game.<br />

7.2 Rhoi Cymru yn ei Lle<br />

Rhoi Cymru yn ei Lle was designed as a game that would educate people as to<br />

the geographical location of Welsh place-names. Players must attempt to drag a<br />

place-name to its correct position on a map, with themed clues relevant to each<br />

place providing some assistance. There are various themes, including sport, religion,<br />

culture, and history, so that the player learns a little about different aspects of their<br />

country as they play, and gain satisfaction from being able to locate an unfamiliar<br />

place on a map.<br />

When creating the content, Cronfa Archif Melville Richards and Enwau Cymru<br />

(developed by Canolfan Bedwyr) were invaluable in aiding in the identification and<br />

placement of place-names and their associated clues. Cronfa Archif Melvllle Richards<br />

is a fully searchable online database of historic Welsh place-name forms that contains<br />

location information and grid references, whilst Enwau Cymru is an online database of<br />

modern Welsh place-names dealing in particular with bilingual place-names and again<br />

giving location information. As with Pos Croeseiriau, due to its complexity, the game<br />

content is coded into the game itself.<br />

323


7.3 Ystyrlon<br />

Figure 8: Ystyrlon Screenshot Showing a Game in Progress<br />

Ystyrlon is similar to the popular game Call my Bluff in that the player is given<br />

an uncommon word (that is hopefully unfamiliar to him), and is then asked to guess<br />

the correct definition from a choice of three. From a technical viewpoint, this is a<br />

very simple game, the hard work being the creation of original content, choosing the<br />

unfamiliar words, and creating humorous and misleading definitions that will entertain<br />

those who play the game and keep them on their toes.<br />

As the content, once created, is quite simple, it is stored as an XML file that is<br />

then referenced by the Flash game file. This aids the production of new games, as it<br />

enables the creation of new content without having to use or understand the Flash<br />

programming application.<br />

324


8. Results<br />

Figure 7: Screenshot of Ystyrlon Following an Incorrect Guess<br />

Usually, academic establishments do not undertake commercial projects such as<br />

Ieithgi, concentrating on research that can then be exploited and taken forward by<br />

the private sector. However, in a minority language situation, the technical expertise<br />

and experience needed to create such language-specific products may not exist in<br />

the private sector, or the financial returns may not be high enough to justify the<br />

investment of time and money. In such a situation, centres such as Canolfan Bedwyr<br />

that see their goal as catering to the needs of a modern, living minority language,<br />

must sometimes fulfil both roles if the language is ever to see such products.<br />

The successful realisation of such a product has been one positive result of this<br />

venture.<br />

The Ieithgi project has also led to the creation of new digital resources, including a<br />

Welsh anagram maker and digraph filter, as well as a process for integrating resources<br />

through XML into Flash; these add to and enhance the resources available to Canolfan<br />

Bedwyr for future projects.<br />

The need to repackage existing digital resources to facilitate further reuse as<br />

part of future projects has also been identified, leading to a new programme of<br />

modularization of lexical components for future projects.<br />

A sure sign of a successful product is one that results in further commissions, and<br />

the success of Ieithgi has resulted in a further commission to develop a similar set<br />

325


of stealth educational Welsh language online games targeted at adults with below<br />

average literacy.<br />

It is hoped by Canolfan Bedwyr that the Ieithgi project will serve as an example<br />

of how to make a little go a long way, and that building up language resources and<br />

corpora can benefit a minority language in more ways than by producing dictionaries<br />

and spellcheckers, allowing existing resources to stretch further.<br />

326


“Archif Melville Richards Historical Place-name Database.” <strong>Online</strong> at<br />

http://www.bangor.ac.uk/amr.<br />

327<br />

References<br />

“Canolfan Bedwyr Website.” <strong>Online</strong> at http://www.bangor.ac.uk/ar/cb/.<br />

“Colin and Cumberland Website.” <strong>Online</strong> at<br />

http://www.bbc.co.uk/colinandcumberland/.<br />

Corpws Electronig o Gymraeg (CEG). “A 1 Million Word Lexical Database and Frequency<br />

Count for Welsh.”<br />

http://www.bangor.ac.uk/ar/cb/ceg/ceg_eng.html.<br />

“Cysgeir Electronic Dictionary Information Website.” <strong>Online</strong> at<br />

http://www.bangor.ac.uk/ar/cb/meddalwedd_cysgair.php.<br />

Davies, G. (2005). “Beginnings, New Media and the Welsh Language.” North American<br />

Journal of Welsh Studies, 5(1).<br />

“Enwau Cymru Modern Place-name Database.” <strong>Online</strong> at<br />

http://www.e-gymraeg.org/enwaucymru.<br />

Hicks, W.J. (2004). “Welsh Proofing Tools: Making a Little NLP go a Long Way.”<br />

Proceedings of the 1st Workshop on International Proofing Tools and Language<br />

Technologies. Greece: University of Patras.<br />

“Learn Welsh-The BBC’s Website for Welsh Learners.” <strong>Online</strong> at<br />

http://www.bbc.co.uk/wales/learnwelsh/.<br />

Prys, D. & Morgan, M. (2000). “E-Celtic Language Tools.” The Information Age, Celtic


Languages and the New Millenium. Ireland: University of Limerick.<br />

“The Ieithgi Website.” <strong>Online</strong> at http://www.bbc.co.uk/cymru/lieithgi/.<br />

328


Victoria Arranz (& Elisabet Comelles, David Farwell)<br />

Speech-to-Speech Translation for Catalan<br />

Alphabetical list of authors<br />

& titles with keywords<br />

Keywords: Catalan, Multilingualism, Speech-to-Speech Translation,<br />

Interlingua.<br />

Ermenegildo Bidese (& Cecilia Poletto, Alessandra Tomaselli)<br />

The relevance of lesser used languages for theoretical linguistics: the case of<br />

Cimbrian and the support of the TITUS corpus<br />

Keywords: Cimbrian, clitics, Wackernagelposition, Agreement, TITUS.<br />

Evelyn Bortolotti (& Sabrina Rasom)<br />

Il ladino fra polinomia e standardizzazione: l’apporto della linguistica<br />

computazionale<br />

Keywords: lessicografia, terminologia, corpus testuale, correttore ortografico,<br />

strumenti per la standardizzazione.<br />

Sonja E. Bosch (& Elsabé Taljard)<br />

A Comparison of Approaches towards Word Class Tagging: Disjunctively vs<br />

Conjunctively Written Bantu Languages<br />

Keywords: word class tagging, Bantu languages, disjunctive writing system,<br />

conjunctive writing system, morphological analyser, disambiguation rules,<br />

tagsets.<br />

Ambrose Choy (& Gruffudd Prys)<br />

Stealth Learning with an on-line dog Keywords: Web-based Word Games for Welsh<br />

Keywords: Stealth learning, Welsh, on-line games.<br />

Elisabet Comelles (& Victoria Arranz, David Farwell)<br />

Speech-to-Speech Translation for Catalan<br />

329


Keywords: Catalan, Multilingualism, Speech-to-Speech Translation,<br />

Interlingua.<br />

David Farwell (& Elisabet Comelles, Victoria Arranz)<br />

Speech-to-Speech Translation for Catalan<br />

Keywords: Catalan, Multilingualism, Speech-to-Speech Translation,<br />

Interlingua.<br />

Olya Gurevich<br />

Computing Non-Concatenative Morphology: the Case of Georgian<br />

Keywords: computational linguistics, morphology, Georgian, non-<br />

concatenative, construction grammar.<br />

Ulrich Heid (& Danie Prinsloo)<br />

Creating word class tagged corpora for Northern Sotho by linguistically informed<br />

bootstrapping<br />

Keywords: POS-tagger, Bantu-languages, Taggerlexicon, Tagging reference cp.<br />

Dewi Jones (& Delyth Prys)<br />

The Welsh National On-line Database<br />

Keywords: terminology standardization, Welsh, termbases, terminology markup<br />

framework.<br />

Cecilia Poletto (& Ermenegildo Bidese, Alessandra Tomaselli)<br />

The relevance of lesser used languages for theoretical linguistics: the case of<br />

Cimbrian and the support of the TITUS corpus<br />

Keywords: Cimbrian, clitics, Wackernagelposition, Agreement, TITUS.<br />

Danie Prinsloo (& Ulrich Heid)<br />

Creating word class tagged corpora for Northern Sotho by linguistically informed<br />

bootstrapping<br />

Keywords: POS-tagger, Bantu-languages, taggerlexicon, tagging reference cp.<br />

330


Luca Panieri<br />

Il progetto “Zimbarbort” per il recupero del patrimonio linguistico cimbro<br />

Keywords: cimbro, lessico, patrimonio linguistico.<br />

Delyth Prys (& Dewi Jones)<br />

The Welsh National On-line Database<br />

Keywords: terminology standardization, Welsh, termbases, terminology markup<br />

framework.<br />

Gruffudd Prys (& Ambrose Choy)<br />

Stealth Learning with an on-line dog Keywords: Web-based Word Games for Welsh<br />

Keywords: Stealth learning, Welsh, on-line games.<br />

Nicoletta Puddu<br />

Un corpus per il sardo: problemi e perspettive<br />

Keywords: corpus planning, corpus design, sardinian, non-standardized languages,<br />

XML.<br />

Sabrina Rasom (& Evelyn Bortolotti)<br />

Il ladino fra polinomia e standardizzazione: l’apporto della linguistica<br />

computazionale<br />

Keywords: Lessicografia, terminologia, corpus testuale, correttore ortografico,<br />

strumenti per la standardizzazione.<br />

Soufiane Rouissi (& Ana Stulic)<br />

Annotation of Documents for Eletronic Edition of Judeo-Spanish <strong>Text</strong>s: Problems<br />

and Solutions<br />

Keywords: electronic corpus, Judeo-Spanish, collaborative production, digital<br />

document.<br />

Clau Solèr<br />

Spracherneuerung im Rätoromanischen: Linguistische, soziale und politische<br />

Aspekte<br />

331


Oliver Streiter<br />

Implementing NLP-Projects for Small Languages: Instructions for Funding Bodies,<br />

Strategies for Developers<br />

Oliver Streiter (& Mathias Stuflesser)<br />

XNLRDF, A Framework for the Description of Natural Language Resources. A proposal<br />

and first implementation<br />

Keywords: XNLRDF, metadata, writing system, Unicode, encoding.<br />

Mathias Stuflesser (& Oliver Streiter)<br />

XNLRDF, A Framework for the Description of Natural Language Resources. A proposal<br />

and first implementation<br />

Keywords: XNLRDF, metadata, writing system, Unicode, encoding.<br />

Ana Stulic (& oufiane Rouissi)<br />

Annotation of Documents for Eletronic Edition of Judeo-Spanish <strong>Text</strong>s: Problems<br />

and Solutions<br />

Keywords: electronic corpus, Judeo-Spanish, collaborative production, digital<br />

document.<br />

Elsabé Taljard (& Sonja E. Bosch)<br />

A Comparison of Approaches towards Word Class Tagging: Disjunctively vs<br />

Conjunctively Written Bantu Languages<br />

Keywords: word class tagging, Bantu languages, disjunctive writing system,<br />

conjunctive writing system, morphological analyser, disambiguation rules,<br />

tagsets.<br />

Alessandra Tomaselli (& Ermenegildo Bidese, Cecilia Poletto)<br />

The relevance of lesser used languages for theoretical linguistics: the case of<br />

Cimbrian and the support of the TITUS corpus<br />

Keywords: Cimbrian, clitics, Wackernagelposition, Agreement, TITUS.<br />

Trond Trosterud<br />

332


Grammar-based language technology for the Sámi languages<br />

Keywords: Sámi, transducers, disambiguation, language technology, minority<br />

languages.<br />

Chinedu Uchechukwu<br />

The Igbo Language and Computer Linguistics: Problems and Prospects<br />

Keywords: language technology, lexicography, computer linguistics, linguistic<br />

tools.<br />

Ivan Uemlianin<br />

SpeechCluster: a speech database builder’s multitool<br />

Keywords: annotation, speech data, Welsh, Irish, open-source.<br />

333


Alphabetical list of contributors<br />

& contact adresses<br />

Victoria Arranz<br />

ELDA-Evaluation and Language<br />

Resources Distribution Agency<br />

arranz@elda.org<br />

Evelyn Bortolotti<br />

Istitut Cultural Ladin “majon di<br />

fascegn”<br />

rep.ling@istladin.net<br />

Ambrose Choy<br />

Canolfan Bedwyr<br />

Univeristy of Wales<br />

a.choy@bangor.ac.uk<br />

David Farwell<br />

Institució Catalana de Reserca i Estudis<br />

Avançats TALP-Centre de Tecnologies i<br />

Aplicacions del Llenguatge i la Parla<br />

Universitat Politècnica de Catalunya<br />

farwell@lsi.upc.edu<br />

Ulrich Heid<br />

IMS-CL, Institut für maschinelle<br />

Sprachverarbeitung<br />

Univerität Stuttgart<br />

uli@ims.uni-stuttgart.de<br />

Luca Panieri<br />

Istituto Cimbro di Luserna<br />

luca.panieri@fastwebnet.it<br />

Danie Prinsloo<br />

Department of African Languages<br />

University of Pretoria<br />

danie.prinsloo@up.ac.za<br />

Ermenegildo Bidese<br />

Università di Verona/ Philosophisch-<br />

Theologische Hochschule Brixen<br />

ebidese@lingue.univr.it<br />

Sonja E. Bosch<br />

University of South Africa<br />

boschse@unisa.ac.za<br />

Elisabet Comelles<br />

TALP-Centre de Tecnologies i<br />

Aplicacions del Llenguatge i la Parla<br />

Universitat Politècnica de Catalunya<br />

comelles@lsi.upc.edu<br />

Olya Gurevich<br />

UC Berkeley<br />

olya@berkeley.edu<br />

Dewi Jones<br />

Language Technologies<br />

Canolfan Bedwyr<br />

University of Wales, Bangor<br />

d.b.jones@bangor.ac.uk<br />

Cecilia Poletto<br />

Padova-CNR<br />

cecilia.poletto@unipd.it<br />

Delyth Prys<br />

Canolfan Bedwyr<br />

Univeristy of Wales<br />

d.prys@bangor.ac.uk<br />

335


Gruffudd Prys<br />

Language Technologies<br />

Canolfan Bedwyr<br />

University of Wales, Bangor<br />

g.prys@bangor.ac.uk<br />

Sabrina Rasom<br />

Istitut Cultural Ladin “majon di<br />

fascegn” (ICL)<br />

lengaz@istladin.net<br />

Clau Soler<br />

Universität Genf<br />

clau.soler@bluewin.ch<br />

Ana Stulic<br />

University of Bordeaux 3 AMERIBER<br />

etchevers@tele2.fr<br />

Elsabé Taljard<br />

University of Pretoria<br />

elsabe.taljard@up.ac.za<br />

Trond Trosterud<br />

Universitetet i Tromsø<br />

trond.trosterud@hum.uit.no<br />

Ivan Uemlianin<br />

Language Technologies<br />

Canolfan Bedwyr<br />

University of Wales, Bangor<br />

i.uemliani@bangor.ac.uk<br />

Nicoletta Puddu<br />

University of Pavia<br />

attel76@hotmail.com<br />

Soufiane Rouissi<br />

University of Bordeaux 3 CEMIC-GRESIC<br />

Soufiane.Rouissi@u-bordeaux3.fr<br />

Oliver Streiter<br />

National University of Kaohsiung<br />

ostreiter@nuk.edu.tw<br />

Mathias Stuflesser<br />

European Academy of Bolzano<br />

mstuflesser@eurac.edu<br />

Alessandra Tomaselli<br />

Università di Verona<br />

alessandra.tomaselli@univr.it<br />

Chinedu Uchechukwu<br />

Universität Bramberg, Germany<br />

neduchi@netscape.net<br />

336

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!