PDF (Online Text) - EURAC
PDF (Online Text) - EURAC
PDF (Online Text) - EURAC
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
LULCL 2005<br />
Proceedings of the Lesser Used Languages and<br />
Computer Linguistics Conference<br />
Bolzano, 27 th -28 th October 2005<br />
Isabella Ties (Ed.)
LULCL 2005<br />
Proceedings of the Lesser Used Languages and Computer Linguistics Conference<br />
Isabella Ties (Ed.)<br />
2006<br />
The proceedings are co-financed by the European Union through the Interreg IIIA Italy-<br />
Switzerland Programme
Bestellungen bei:<br />
Europäische Akademie Bozen<br />
Viale Druso, 1<br />
39100 Bozen - Italien<br />
Tel. +39 0471 055055<br />
Fax +39 0471 055099<br />
E-mail: press@eurac.edu<br />
Nachdruck und fotomechanische<br />
Wiedergabe – auch auszugsweise – nur<br />
unter Angabe der Quelle<br />
(Herausgeber und Titel) gestattet.<br />
Verantwortlicher Direktor: Stephan Ortner<br />
Redaktion: : Isabella Ties<br />
Koordination: : Isabella Ties<br />
Graphik und Umschlag: Marco Polenta<br />
Druck: Fotolito Longo<br />
Per ordinazioni:<br />
Accademia Europea Bolzano<br />
Drususallee, 1<br />
39100 Bolzano - Italia<br />
Tel. +39 0471 055055<br />
Fax +39 0471 055099<br />
E-mail: press@eurac.edu<br />
Riproduzione parziale o totale del<br />
contenuto autorizzata soltanto con la<br />
citazione della fonte<br />
(titolo ed edizione).<br />
ISBN 88-88906-24-X<br />
Direttore responsabile: Stephan Ortner<br />
Redazione: Isabella Ties<br />
Coordinazione: Isabella Ties<br />
Grafica e copertina: Marco Polenta<br />
Stampa: Fotolito Longo
5<br />
Index<br />
Preface ............................................................................................ 7<br />
Spracherneuerung im Rätoromanischen: Linguistische, soziale und<br />
politische Aspekte ..............................................................................11<br />
Clau Solèr<br />
Implementing NLP-Projects for Small Languages:<br />
Instructions for Funding Bodies, Strategies for Developers ..............................29<br />
Oliver Streiter<br />
Un corpus per il sardo: problemi e perspettive ............................................45<br />
Nicoletta Puddu<br />
The Relevance of Lesser-Used Languages for Theoretical Linguitics:<br />
The Case of Cimbrian and the Suport of the TITUS Corpus ...............................77<br />
Ermenegildo Bidese, Cecilia Poletto and Alessandra Tomaselli<br />
Creating Word Class Tagged Corpora for Northern Sotho<br />
by Linguistically Informed Bootstrapping ...................................................97<br />
Danie J. Prinsloo and Ulrich Heid<br />
A Comparison of Approaches to Word Class Tagging:<br />
Distinctively Versus Conjunctively Written Bantu Languages .......................... 117<br />
Elsabé Taljard and Sonja E. Bosch<br />
Grammar-based Language Technology for the Sámi Languages ....................... 133<br />
Trond Trosterud<br />
The Welsh National <strong>Online</strong> Terminology Database ....................................... 149<br />
Dewi Bryn Jones and Delyth Prys<br />
SpeechCluster: A Speech Data Multitool................................................... 171<br />
Ivan A. Uemlianin<br />
XNLRDF: The Open Source Framework for Multilingual Computing ................... 189<br />
Oliver Streiter and Mathias Stuflesser<br />
Speech-to-Speech Translation for Catalan ................................................ 209<br />
Victoria Arranz, Elisabet Comelles and David Farwell
Computing Non-Concatenative Morphology: The Case of Georgian ................... 225<br />
Olga Gurevich<br />
The Igbo Language and Computer Linguistics: Problems and Prospects .............. 247<br />
Chinedu Uchechukwu<br />
Annotation of Documents for Electronic Editing of Judeo-Spanish <strong>Text</strong>s:<br />
Problems and Solutions ...................................................................... 265<br />
Soufiane Roussi and Ana Stulic<br />
Il ladino fra polinomia e standardizzazione:<br />
l’apporto della linguistica computazionale ............................................... 281<br />
Evelyn Bortolotti, Sabrina Rasom<br />
Il progetto “Zimbarbort” per il recupero del patrimonio linguistico cimbro ........ 297<br />
Luca Panieri<br />
Stealth Learning with an <strong>Online</strong> Dog (Web-based Word Games for Welsh) .......... 307<br />
Gruffudd Prys and Ambrose Choy<br />
Alphabetical list of authors & titles with keywords ..................................... 329<br />
Alphabetical list of contributors & contact addresses .................................. 335<br />
6
7<br />
Preface<br />
On behalf of the programme committee for the ‘Lesser Used Languages and<br />
Computer Linguistics’ conference (LULCL 2005), we are pleased to present the<br />
proceedings, which contain the papers accepted for presentation at the Bolzano<br />
meeting on 27th-28th October 2005. The contributions published in this volume deal<br />
with the main aspects of lesser used languages and their support through computer<br />
linguistics, ranging from lexicography to terminology for lesser used languages, and<br />
from computational linguistic applications in general to more specific resources such<br />
as corpora. Some papers deal specifically with Translation Memory Systems, online<br />
dictionaries, Internet Computer Assisted Language Learning (CALL) or Language for<br />
Specific Purposes (LSP).<br />
The choice of the conference theme was strongly influenced by the ambition<br />
to give lesser used languages an opportunity for visibility without taking into<br />
consideration the official number of speakers, but rather the range of technological<br />
resources available for each language. Even though some languages do indeed count a<br />
considerable number of speakers, technology support may be almost nonexistent. It is<br />
therefore remarkable how much has been done in the last decades for languages with<br />
few speakers. The Zimbar speakers are the smallest community represented at the<br />
conference, which counts about 2230 speakers living in Luserna, Roana, Mezzaselva<br />
and Giazza. Despite the small number of native speakers, there are major projects<br />
running on this Germanic language. The first project described here (cf. Bidese et al.)<br />
foresees the storage of Cimbrian textual material in the TITUS Corpus (‘Thesaurus<br />
of Indo-European <strong>Text</strong>ual and Linguistic Materials’), while the second one provides<br />
the guidelines for the Zimbarbort. The latter is a new project on the preservation of<br />
the Zimbar language, during which a database of lexical entries will be created (cf.<br />
Panieri). Both projects represent a substantial contribution to the preservation of<br />
language: through the recovery and storage of textual data they enable researchers<br />
to carry out linguistic analyses from several points of view.<br />
Sparseness of data is one of the main characteristics that many lesser used languages<br />
share with Zimbar. This influences both the choice of methodology and, of course, the<br />
results. Clau Solèrs keynote contribution reflects very well what happens when, for<br />
example, specialised terminology has to be elaborated within a small language. The<br />
lack of native terminology for many LSPs and the influence of bigger official languages,
such as German in this special case, are just some of the problems Rumantsch and<br />
other small languages have to face in order to propose acceptable terminology and<br />
preserve language at the same time. The project on the ‘Welsh National Terminology<br />
Database’ reflects the need to find a means between accepted terminology standards<br />
used for bigger languages (ISO 704 and ISO 860 norms) and language preservation. This<br />
project takes advantage of the similarities between terminology and lexicography, as<br />
existing lexicographical resources and applications are used to enrich the terminology<br />
database.<br />
Another central topic that lesser used languages have in common is the usability<br />
of available data. On the one hand we find the contribution on Judeo-Spanish, where<br />
Roussi & Stulic describe how to transliterate and annotate texts written in Hebrew<br />
characters and, at the same time, allow users to add their own interpretation and<br />
comments. On the other hand, Uchechukwu explains in his contribution the problems<br />
related to appropriate font programmes and software compatibility. On the basis of<br />
the Igbo language he describes what happens when the amount of data is considerable<br />
but not usable (due to the obstacle of accepted format).<br />
Issues of data sparseness and usability determine linguistic research, especially<br />
during the phases of data pre-processing, and the amount of time linguists must<br />
invest in dealing with linguistic research questions. Uemlianin proposes to use<br />
SpeechCluster in order to ensure that linguists can concentrate on linguistic analyses<br />
rather than disperse their efforts with formatting or any other time-consuming manual<br />
processing.<br />
Trosterud emphasises on the importance of open-source technology for projects<br />
on lesser used languages, so as to avoid waste in terms of time and technology, which<br />
must be reinvented every single time for every small language. The same point of<br />
view is stated by Stuflesser and Streiter as they present their intention to use XNLRDF,<br />
a free software package for NLP. Their contribution introduces the existing prototype<br />
and outlines future strategies.<br />
A similar aim is pursued by the invited key-note speaker Oliver Streiter, who focuses<br />
on this topic, providing a detailed overview on available resources and underlining the<br />
importance of mutual support within the research community through data sharing<br />
in standard formats, so as to make it usable and accessible to everybody. One of the<br />
instruments cited and used most often for data sharing is the Internet, as it allows<br />
online storage of data such as dictionaries, language games or terminology data bases<br />
(Jones & Prys). This medium is used by Canolfan Bedwyr to publish the web-based<br />
word games for Welsh, as well as by the Ladin institutions to disseminate their online<br />
8
dictionaries (cfr. Bortolotti & Rasom) meant to improve the language skills of native<br />
speakers.<br />
Several authors file contributions on tools for the elaboration and storage of language<br />
for text analysis and processing of text material with a view to the development of<br />
corpora. Puddu points out the importance of corpora for supporting the development<br />
of lesser used languages and the main problems connected to corpus design, text<br />
collection, storage and annotation for a lesser used language like Sardo (cf. Puddu).<br />
Sardinian, like any other lesser used language, has to cope with problems related to<br />
retrieval of written text, and in this specific case, also with a second problem: the<br />
absence of a standard orthography. The application of a homogeneous tag system, as<br />
well as the use of standards on storage, such as the rules elaborated by the EAGLES<br />
group (XCES), is suggested.<br />
Prinsloo and Heid describe methodology such as the bootstrapping of resources<br />
in order to elaborate language documentation and annotation. They describe the<br />
development of different tools to bootstrap tagging resources for Northern Sotho, and<br />
resources used to identify verbs and nouns for the disambiguation of closed class items.<br />
The Bantu languages and their characteristics are also discussed in the contribution by<br />
Taljard and Bosh, who present the problems encountered when dealing with languages<br />
with different writing systems — in this special case, Northern Sotho and Zulu. The<br />
authors describe the distinct approaches for class tagging according to the different<br />
writing systems.<br />
Examples of knowledge extraction and knowledge engineering are discussed in<br />
the paper on the FAME project, an Interlingual Speech-to-Speech Machine Translation<br />
System for Catalan, English and Spanish developed to assist users in making hotel<br />
reservations. The project includes tools for the documentation of data and elaboration<br />
of the standard Interchange Format (IF).<br />
It is clear from these contributions that nowadays, a variety of approaches and<br />
scientific methodologies are adopted in research on lesser used languages, showing<br />
the vitality of research in this specific area.<br />
Thanks to authors who cover a large variety of projects and technologies, an<br />
overview of the state of the art in research on lesser used languages can be provided,<br />
especially as regards projects on lesser used languages involving computational<br />
linguistics in Europe and the world. Central to the conference are both methodological<br />
issues, prompted by the described strategies for an efficient support of lesser used<br />
languages, and the problems encountered with theoretical approaches developed for<br />
major languages but applied to lesser used languages.<br />
9
The contributions underline the significance of computational linguistics, the<br />
methodologies and strategies followed, and their application on lesser used languages.<br />
It becomes evident how important decisions on international standards are and which<br />
consequences they imply for the standardisation of tools.<br />
This conference would not have been possible without the energy of many people.<br />
First of all we would like to thank the authors, who have provided superb contributes.<br />
Our gratitude goes also to the reviewers and to the scientific committee for their<br />
detailed and inspiring reviews.<br />
10<br />
Isabella Ties
Spracherneuerung im Rätoromanischen:<br />
Linguistische, soziale und politische Aspekte<br />
11<br />
Clau Solèr<br />
In Graubünden, the minority language Romansch has to assert itself in an environment<br />
of bilingualism with German on the one hand, while constantly keeping pace with<br />
the changing needs of its speakers on the other. To fulfill this task, terminological<br />
precision must continually come to terms with both the spoken language and the<br />
existing syntax. Romansch must be able to express a frame of mind that is influenced<br />
by the Germanic element, and neologisms must also adapt to the regional varieties<br />
for the speakers to be able to identify with them.<br />
Due to the limited political and economic importance of the language, as well as<br />
instruction that partly takes place in German only, Romansch is currently lacking the<br />
necessary channels for an efficient diffusion of neologisms.<br />
1. Einleitung<br />
Jede Sprache dient im Alltag als Werkzeug und passt sich ihrer Sprachgemeinschaft<br />
an; dies im Unterschied zu nur historischen oder kultischen Sprachen. Dabei darf<br />
sie sich aber nicht veräußern, nur um modern oder aktuell zu sein. Neben einer<br />
spontanen, gelegentlich unerwünschten Erneuerung – die übliche sich langfristig<br />
ablaufende Sprachentwicklung steht hier nicht zur Sprache – unterliegt die Sprache<br />
Eingriffen aus unterschiedlichsten Richtungen und Kräften und aus verschiedenen<br />
Gründen. Wie geschieht das, was entsteht daraus, wem nützt das und wird sie besser<br />
oder schlechter? Diese Fragen möchte ich besonders aus der praktischen Erfahrung<br />
zu beantworten versuchen und einige Überlegungen dazu anstellen. Vorerst muss<br />
ich das Rätoromanische in Graubünden und dessen Stellung im Hinblick auf die<br />
Sprachanpassung kurz umreißen. Ich wähle bewusst den Ausdruck Anpassung, um<br />
keine Wertung wie Erneuerung, Modernisierung, Einschränkung und Uminterpretation<br />
vorweg-zunehmen.<br />
Das Bündnerromanische ist eine eigenständige, neolateinische Sprache auf<br />
vorrömischer Grundlage. Seit über 1000 Jahren ist es im vielfältigen Kontakt mit dem<br />
Deutschen und während mehreren Jahrhunderten auch mit dem Italienischen (im<br />
Engadin besonders wirtschaftlich und in den katholischen Gegenden religiös bedingt).<br />
Nach dem Anschluss an die schweizerische Eidgenossenschaft 1803 ist die gelebte und<br />
relativ ausgeglichene Dreisprachigkeit der drei Bünde durch das Deutsche als fast
unumschränkte Verkehrs- und Verwaltungssprache ersetzt worden. Die rätoromanischen<br />
Ortsdialekte, die in fünf regionalen Schriftidiomen in geografisch und konfessionell<br />
mehr oder weniger getrennten Gebieten mit einer ursprünglich traditionellen, heute<br />
mehrheitlich touristischen Wirtschaft verwendet werden, haben sich unterdessen<br />
zu einer primär gesprochenen Varietät gewandelt. 35.095 Personen nannten bei der<br />
Volkszählung 2000 (RÄTOROMANISCH 2004:24) das Romanische als ihre Hauptsprache<br />
und insgesamt 60.816 verwenden es im Alltag oder bei der Arbeit, wobei nur zwei Drittel<br />
davon in Graubünden leben und der Rest in der schweizerischen Diaspora. Gemäß<br />
EUROMOSAIC (1996:34) braucht eine Sprachgruppe mindestens 300.000 Mitglieder für<br />
ihre Selbständigkeit und so sind die Aussichten des Romanischen eher düster. In den<br />
Gemeinden mit mehr als 20% Romanischsprechern besuchen zwei Drittel der Schüler<br />
eine maximal vierjährige romanische Grundschule mit anschließender Einführung ins<br />
Deutsche, das in den drei letzten Jahren Unterrichtssprache wird, neben immerhin bis<br />
zu 6 Stunden Romanisch als Fach. Die Mittelschulen in Chur und Samedan ermöglichen<br />
einen zweisprachigen Maturitätsabschluss. In der Pädagogischen Fachhochschule, die<br />
im Unterschied zum bisherigen Lehrerseminar nicht mehr sprachbezogen ist, fehlt eine<br />
entsprechende Unterstützung, wie es in den nur deutschen beruflichen Fachschulen<br />
auch der Fall ist.<br />
Es fehlt noch der linguistische Zustandsbericht. Die traditionelle Einsprachigkeit<br />
mit wenigstens einer Fremdsprache gibt es nur noch bei wenigen, älteren Personen in<br />
entlegenen Ortschaften mit geringer Zuwanderung. Sonst leben die Bündnerromanen in<br />
einer funktionalen, domänenorientierten und personengesteuerten Mehrsprachigkeit<br />
mit jeweils unterschiedlichen Kodes: romanische Ortsmundart gesprochen,<br />
teilweise gelesen, aber selten geschrieben, Schweizerdeutsch gesprochen sowie<br />
Standarddeutsch als Schriftsprache und teilweise gehört. Man wählt die Sprache<br />
relativ wertfrei, und die Phase als Rätoromanisch stigmatisiert war und man daher am<br />
Minderwertigkeitskomplex litt, ist heute mehrheitlich überwunden, und zwar in erster<br />
Linie wegen der hohen Deutschkompetenz der Romanischsprecher, ihrer besseren<br />
Integration in der deutschsprachigen Gesellschaft und letztlich auch wegen der vielen<br />
Zuzügler mit noch selteneren Sprachen.<br />
2. Terminologische Anpassung<br />
Ich wähle bewusst den Begriff Terminologie, der die Neologie und die Uminterpretation<br />
vorhandener Begriffe einschließt. Dabei hat man sich weniger umfangreiche und weit<br />
abgestützte Prozesse vorzustellen, sondern eher zufällige und gelegentlich chaotische<br />
Vorschläge, die nach Möglichkeit gesammelt und verbreitet werden. Viele Einträge<br />
des Pledari Grond in Rumantsch grischun stammen aus allerlei Übersetzungen und<br />
12
Anfragen von Sprachverwendern, der Rest stammt aus den Regionalwörterbüchern<br />
und aus der systematischen Neologie.<br />
3. Das Bedürfnis nach terminologischer Anpassung<br />
Es ist zwar unbestritten, dass keine Sprache von sich aus eine Anpassung braucht,<br />
denn Sprachen handeln nun einmal nicht. Trotzdem hat man seit dem Ende des<br />
19. Jh. immer wieder das Romanische als klagende, leidende oder verschupfte<br />
Sprache anthropomorphologisiert, damit den Rätoromanen ins Gewissen geredet und<br />
zugegebenermaßen einiges erreicht. Vieles ist aber auch verdorben worden (Coray<br />
1993). Es ist die Sprachgemeinschaft mit ihren Anwendern, die eine Sprache den<br />
Bedürfnissen nach gesicherter und rascher Kommunikation anpasst. Als Sprachverwender<br />
gelten grundsätzlich die Sprechenden in ihrem sozialen, wirtschaftlichen und<br />
geistigen Umfeld, solange sie sich nicht ausschließlich als Parteivertreter der Sprache<br />
verhalten. Puristische oder spracherhaltende Gründe sind politisch und gesellschaftlich<br />
begründet und von den Sprachverwendern nur beschränkt getragen. Sie lehnen diese<br />
von der Sprachverwaltung vertretene künstliche Erhaltung ab, wie ihr Verhalten u. A.<br />
gegenüber dem Rumantsch grischun zeigt.<br />
Eine wirkliche Alternative sich sprachlich anzupassen besteht für Sprachverwender<br />
von Minderheitensprachen mit einem asymmetrischen Bilinguismus in einem<br />
Sprachwechsel, der meistens mehrstufig verläuft, auch wenn dieser Sprachwechsel<br />
gerne verschwiegen wird (Solèr 1986:299). Das Englische in bestimmten<br />
Wirtschaftsbereichen gilt heute als direkter Weg, wenn es nicht aus Ermangelung<br />
einer gemeinsamen Sprache gewählt wird.<br />
4. Methoden der terminologischen Anpassung<br />
In der Vergangenheit hat sich das Romanische den Bedürfnissen mehr schlecht<br />
als recht angepasst und ist auch deshalb minorisiert worden. Erst im Zuge der<br />
spätromantischen Nationalbewegung, also seit mehr als hundert Jahren, bemüht<br />
man sich bewusst und systematisch um eine lexikalische Erneuerung. Heute ist das<br />
Romanische terminologisch sehr stark ausgebaut, verglichen mit dem Zustand vor<br />
150 Jahren als „tausenderlei Gegenstände und Thätigkeiten der gebildeten Welt<br />
unbekannt oder doch fremd geblieben [waren] (CARISCH 1848:X). Auch die Syntax<br />
hat sich erneuert und ist eigenständig(er) geworden. In dieser Zeit veränderten sich<br />
die Gesellschaft und Wirtschaft grundlegend. Die obligatorische Volksschule erreichte<br />
erstmals eine ganze Bevölkerungsschicht und konnte die Sprache direkt beeinflussen,<br />
indem alte örtliche Formen verschwanden, wie Jaberg/Jud (1928) bedauerten.<br />
13
Nun gilt es zu erklären, wie die Sprache terminologisch angepasst wird. Neben der<br />
Verwendung einer phonetisch und morphologisch integrierten fremden Bezeichnung<br />
oder gleichzeitig dazu liefert die eigensprachliche Um- und Beschreibung (Periphrase)<br />
die wichtigste spontane terminologische Anpassung. Dieses Vorgehen passt auch<br />
stilistisch fremde oder unverständliche Terminologie an die Umgangssprache an und<br />
steht logischerweise im Widerspruch zur Systematisierung der Fachsprache. Weiterhin<br />
gilt die typisch analytische Parataxe einer Volkssprache, wie es das Romanische im<br />
Grunde genommen ist. Der Vorteil der hohen Verständlichkeit muss mit der Variabilität<br />
erkauft werden. Hierhin gehören auch die zufälligen, spielerischen Volksbildungen<br />
mit üblicherweise nur regionaler und kurzzeitiger Gültigkeit; erwähnt seien vallader:<br />
chasperets für ‘Scheibenwischer’, eigentlich ‘Kasperlefigur’ oder sursilvan: cutgna für<br />
‘Surfbrett, Snowboard’, eigentlich ‘Schwarte (vom Holz oder vom Speck)’.<br />
Systemkonform ist auch die professionelle Terminologie oft periphrastisch<br />
anstatt derivativ und daraus entstehen, je nach dem Definitionsgrad, linguistische<br />
Ungetüme wie ovs da giaglinas allevadas a terra für ganz gewöhnliche ‘Eier (von<br />
Hühnern) aus Bodenhaltung’ oder chapisch da la rullera d’alver da manvella für<br />
‘Kurbelwellenlagerdeckel’, das freilich auch nicht verständlicher ist und als einzelner<br />
Baustein noch kompliziertere Sätze bilden muss.<br />
Ein typischer und traditioneller Terminologieprozess ist die Analogie. Heute weicht<br />
diese endolinguale zugunsten der exolingualen, sich am Deutsch lehnende Bildung<br />
zurück wegen ihrer Nähe zur Denkstruktur der Romanischsprecher. Sie verspricht<br />
mehr Erfolg als eine Herleitung aus dem Französischen als kaum mehr unterrichtete<br />
Fremdsprache oder aus dem Italienischen, das zwar (noch) einen festen Platz in den<br />
Bündner Schulen hat, aber nur eine geringe Bedeutung im Alltag genießt.<br />
Analogien zu romanischen Sprachen liegen in den folgenden Beispielen vor 1 , (vgl. auch<br />
Decurtins 1993, 235-254 passim): Als Alltagsbegriff gilt schambun (oit, frz.) ‘Schinken’.<br />
Der Begriff vl: levatriza ‘Hebamme’ scheiterte als undurchsichtige Bezeichnung für<br />
eine einsetzende Professionalisierung und wurde deshalb periphrastisch zu vl: duonna<br />
da part ‘Geburtsfrau’ rg: spendrera eigentlich ‘Rettende’, dunna da part. Auch<br />
purtantina ‘Tragbahre’ ist kaum verständlich und konnte bara trotz der Homonymie zu<br />
‘Leiche’ nicht ersetzen. Die Ausdrücke guid ‘(Reise-)Führer’ und guidar ‘führen’ sind<br />
seltener als manader ‘Führer, Lenker’ und manar ‘lenken, leiten, führen’. Einsichtig<br />
sind giraplattas ‘Plattenspieler’ und modernisiert giradiscs ‘Diskettenlaufwerk’, das<br />
1 Die romanischen Beispiele sind in Rumantsch grischun (rg); die Regionalformen werden bezeichnet als<br />
sr = sursilvan, st = sutsilvan, sm = surmiran, pt = puter, vl = vallader; Französisch = frz., Italienisch = it,<br />
Oberitalienisch = oit., Rätoromanisch = rtr.<br />
14
aber schon durch ‘CD-Player’ internationalisiert wurde. 2 Die Bezeichnung telefonin<br />
für ‘Funktelefon’ konnte sich gegen natel als Produktname und besonders handy nicht<br />
durchsetzen und das westschweizerische portable ist geographisch und mental schon<br />
zu weit entfernt.<br />
Besonders die ersten grundlegenden Wörterbücher in der ersten Hälfte des 20. Jh.<br />
wählten die Analogie. Ein Teil ihrer Vorschläge konnte sich dank der Verbreitung in<br />
der damals sprachprägenden Schule sowie dem hohen Ansehen des Französischen und<br />
Italienischen durchsetzen und viele Germanismen ersetzen (Solèr 2005).<br />
Wohl immer beeinflusste der Purismus sowohl außer- wie auch innersprachlich die<br />
terminologische Anpassung. Zu Beginn des 20. Jh. fielen besonders im Engadin wegen des<br />
Irredentismus viele Italianismen trotz ihrer linguistischen Nähe zum Rätoromanischen.<br />
Andererseits besteht das Dilemma zwischen neolateinischen Begriffen wie aspiratur<br />
‘Staubsauger’, mochetta ‘Spannteppich’, die aber weniger transparent sind als die<br />
transkodischen tschitschapulvra ‘Staubsauger’ und tarpun stendì eig. ‘gespannter<br />
Teppich’. Und genau diese Nähe schafft viele neue Begriffe, die erst rückübersetzt,<br />
also deutsch gedacht, verstanden werden: maisa da mezdi wörtlich ‘Mittagstisch’ für<br />
‘gemeinsames Mittagessen für ältere Personen’ anstatt gentar cuminaivel.<br />
Zu den produktiven endolingualen Prozessen gehört die Morphemableitung für<br />
die verschiedenen Kategorien. Trotz ihrer grundlegenden Systematik erkennt man<br />
zeittypische Vorlieben. Deverbale Agensbegriffe auf -ader, -atur, -adur sind häufig,<br />
während Formen auf –ari, z.B. sr: attentari ‘Attentäter’, teilweise mit lateinischem<br />
–ARIU-Formen zusammenfallen; splanari ‘Hobelbank’ ist insofern eine Falschbildung,<br />
weil es kein Agens ist und auch nicht zu –ARIU gehört. Die –ist-Formen wie schurnalist<br />
‘Journalist’ sind nur dann erfolgreich, wenn die Variante -cher nicht durch ein<br />
deutsches Analogon gestützt wird. Sonst gilt -ist als puristisch und High-Variante wie<br />
musicist ‘Musiker’, das mit musicher eine Low-Variante erzeugt.<br />
Die Prozesse und teilweise deren Resultat werden gebildet mit -ziun wie furniziun<br />
‘Lieferung’, allontanaziun ‘Entfernung’ und exemziun ‘Befreiung, Entbindung’ auf<br />
ganz unterschiedlicher romanischer Basis oder mit -ada wie zavrada ‘Schafscheide,<br />
Aussonderung’, scuntrada ‘Treffen, Zusammenkunft’ und, ziemlich heterogen, auzada<br />
‘Stockwerk’, genauer ‘Anhebung’.<br />
Auch andere Suffixe sind mehrwertig, so –al in fossal ‘Baugrube, Stadtgraben’, plazzal<br />
‘Baustelle’, aber auch runal ‘Schlepplift’ ohne die –ALIS-Adjektive zu berücksichtigen.<br />
Allgemein bevorzugt das Romanische Periphrasen anstatt der stilistisch markierten<br />
2 Als Abkürzung gilt mehrheitlich „CD“ m/f während disc cumpact im romanischen Radio recht geläufig<br />
ist.<br />
15
Adjektive auf -abel, -ibel, -aivel, -ar, -ari, -ic und unterscheidet sich damit stark vom<br />
Französischen und Italienischen. 3<br />
Mehrdeutig ist auch das Morphem –et als Verkleinerung vegliet ‘(kleiner) Alter’, als<br />
Spezifikation furtget ‘Gabler’, rg: buffet ‘Blasebalg’, sr: suflet analog frz. „souflet“,<br />
it. „soffietto“, sr: stizzet ‘Löschhorn’, rg: durmigliet ‘Siebenschläfer’ als Lehnbildung<br />
bzw. Calque für die kaum verständlichen Formen sr: glis, vl: glira aus lat. GLIS.<br />
Interessant sind die Bildungen auf –era. Während die vom Verb abgeleiteten<br />
durchsichtig und verständlich sind, wie ardera 4 ‘Verbrennungsanlage’, mulschera<br />
‘Melkstand’, cuera ‘Brutkasten’, erweisen sich die vom Nomen gebildeten sehr<br />
undurchsichtig wie balestrera ‘Schießscharte’, das primär mit sr: ballistrar ‘zappeln,<br />
störrisch sein, hapern’ assoziiert wird, oder sie wirken ambivalent wie sutgera<br />
‘Sesselbahn’, bobera ‘Bobbahn’, cruschera sr: ‘Drehkreuz, Kreuz Kreuzworträtsel’,<br />
rg: ‘Fadenkreuz’ und nicht beispielsweise ‘Kreuzung’, das cruschada heißt und<br />
homonymisch ist mit ‘Kreuzzug’. Hier erzeugten die verschiedenen Idiome trotz<br />
der sogenannten avischinaziun miaivla ‘sanften Annäherung’ der 60er Jahre<br />
unterschiedliche Formen, die man zwar gegenseitig verstand, aber nicht zu einer<br />
einheitlichen Sprachform beigetragen haben.<br />
Grundsätzlich kann jede Entlehnung als Basiselement dienen, wobei sie mehr an<br />
psychologische als an linguistische Grenzen stößt. Anstatt neue Verben direkt mit dem<br />
Morphem –ar an fremde, meistens deutsche Stämme zu binden wie bremsar ‘bremsen’,<br />
spizzar ‘mit dem Spitzeisen ausschlagen’, cliccar ‘klicken’, checkar ‘merken’ (über<br />
Deutsch aus dem Englischen), chiffar ‘kiffen’, die die früheren Verben auf –egiar/-<br />
iar ersetzen, bevorzugt man das analytische far il + deutscher Infinitiv far il clichen<br />
‘(den) Klick machen’.<br />
Asyndetische Bildungen sind durchsichtig und treffend wie tirastruvas<br />
‘Schraubenzieher’ und muntastgala ‘Treppenaufzug’, während tilavent ‘Düse’ in<br />
Richtung Wetter weist. ‘Mutterkuh’ vatga-mamma drückt auch in der veränderten<br />
Abfolge von Bestimmtes-Bestimmendes (Determinat-Determinant) das undefinierte<br />
Verhältnis aus. Obgleich analog zu biomassa ‘Biomasse’, ist biopur ‘Biobauer’<br />
gewöhnungsbedürftig aber nötig, weil pur da bio wie pur da latg ‘Milchbauer’ zuerst<br />
auf das Material oder die Herkunft verweist. Regelmäßige Bildungen wie telecumandar,<br />
microcirquit bleiben elitär.<br />
3 Auf –ebel lautet einzig debel „schwach“. Formen wie frz. „grippe aviaire“ und it. „influenza aviaria“<br />
für ‘Vogelgrippe’ sind im Romanischen fast unmöglich und uaulic ‘den Wald betreffend’, selvus ‘waldig’<br />
wirken exotisch.<br />
4 Dazu „muss“ das Pledari Grond eine Periphrase implant per arder rument liefern; die Idiome verwenden<br />
zudem sr: barschera, vl: bruschaduoir.<br />
16
Auch eine Aktualisierung durch Um- und Neudefinition bestehender, nicht mehr<br />
gebrauchter Begriffe ist möglich, trotz der unsicheren Übergangszeit mit Homonymie:<br />
Noch heute wird zavrar nur auf ‘Schafe scheiden’ beschränkt, trotz zavrader<br />
‘Sortwort’ und zavrar ‘sortieren’; man verwendet sortar oder das ungenaue separar<br />
‘trennen’. Eine wirkliche „Herausforderung“ bedeutet eben dessen Bezeichnung: Das<br />
surselvische provocaziun ist vermutlich zu nahe an die deutsche „Provokation“, so dass<br />
man heute vermehrt sfida, eine italienische Entlehnung im Engadinischen, verwendet,<br />
obwohl sfida, und ganz besonders sfidar in Rheinischbünden näher an sfidar, disfidar<br />
‘misstrauen’ liegt. Eine Erweiterung erfuhr das Verb sunar ‘musizieren’, im Engadin<br />
noch ‘Glocken läuten’, durch die Unterdifferenzierung von ‘spielen’ und ‘abspielen’<br />
sunar ina platta, in(a) CD anstatt tadlar, far ir ina platta, in(a) CD. Dem Biologiebegriff<br />
tessì ‘Gewebe’ fehlt das typische Fadenmuster eines gewobenen Tuches, und er ist<br />
deshalb nicht alltagstauglich; stattdessen verwendet man konkret pel ‘Haut’, charn<br />
‘Fleisch’ bis zu material ‘Material’. Auch der Fachausdruck petroli für ‘Erdöl’ wird<br />
nur in der engeren Bedeutung von Lampenbrennstoff ‘Petrol’ wahr genommen und<br />
erfordert infolgedessen ein Calque ieli (mineral) ‘Mineralöl’.<br />
Die gesamtromanische Standardisierung, angestrebt in Rumantsch grischun, zeigt<br />
im Alltag ihre Grenzen wegen einer hohen Heteronymie. Entweder verwendet man<br />
beide Ausdrücke wie taglia/imposta ‘Steuer’, buis/schluppet ‘Gewehr’, entschaiver/<br />
cumenzar ‘beginnen’ oder man vereinfacht unzulässig, indem man glisch für ‘Lampe’<br />
anstatt cazzola in der Surselva verwendet, wo glisch nur ‘Licht’ bedeutet, weil<br />
lampa aus puristischen Gründen ausfällt. Manchmal wird der ursprüngliche Begriff<br />
missverstanden und das Resultat ist unbrauchbar wie plimatsch ‘Kissen’ in bischla da<br />
plimatsch ‘Lagerhülse’ als Umdeutung eines horizontal beweglichen Holzes auf dem<br />
Wagen für eine rotierende Drehbewegung, der rullera ‘Rolllager’, cullanera ‘Kugellager’<br />
entsprechen. Heute schmunzelt man über die Pionierbezeichnung sr: tabla spurteglia<br />
(Gadola 1956:79) für eine ‘elektrische Schalt(er)tafel’ mit Unterdifferenzierung von<br />
‘elektrischer Schalter’ und ‘Schalterfenster’, das inzwischen zu tavla da distribuziun,<br />
cumond berichtigt wurde. Der ganze Bereich der Elektrizität mit ‘Strom’, ‘Spannung’,<br />
‘Hochspannung’ und ‘Starkstrom’ usw. wurde erst nach 1990 für das Pledari Grond<br />
terminologisch aufgearbeitet; umgesetzt ist es kaum, schließlich ist es ziemlich<br />
abstrakt. 5<br />
5. Auswirkungen der terminologischen Anpassung<br />
Außer in offiziellen Bereichen mit einem vorgeschriebenen Sprachgebrauch wie<br />
die dreisprachige Kantonsverwaltung, die Gesetzgebung und die Herstellung von<br />
Schulbüchern, ist die terminologische Anpassung ein Zusammenspiel von glücklichen,<br />
5 Die ersten Fachvorschläge wurden 1917 im Chalender ladin veröffentlicht: Davart l’electricited. Terms<br />
romauntschs per l’electricited, acceptos dalla Commissiun Linguistica, 70-71.<br />
17
überzeugenden Vorschlägen auf der einen Seite und einer erfolgreichen Vermarktung<br />
auf der anderen Seite. Zuerst zur linguistischen Komponente:<br />
6. Linguistische Identifizierung<br />
Seit der Einführung des Rumantsch grischun 1982 bedeutet Terminologie nicht<br />
nur eine lexikalische Erweiterung, teilweise in einer Diglossie, sondern auch einen<br />
Paradigmawechsel hin zum Einheitsstandard. Neben psychologischen und politischen<br />
Hindernissen bestehen auch syntaktisch-semantische Unterschiede. Außer bei<br />
Gesprächen in sektoriellen Sprachen zwischen Fachleuten, sind die betroffenen<br />
Endanwender Laien, die Romanisch praktisch nur sprechen, und deshalb muss die<br />
Fachterminologie folgendes beachten:<br />
• der Begriff muss durchsichtig, transparent sein sowohl<br />
elementar (wörtlich) als auch in der Bedeutung (inhaltlich):<br />
sufflafain ‘Heugebläse’, tirastapun ‘Zapfenzieher’, pendiculara ‘Seilbahn’,<br />
autpledader ‘Lautsprecher’, portasperanza ‘Hoffnungsträger’; problematisch ist<br />
sr: sclausanetg, rg: strasarsa und, trotz des Calques, cirquit curt ‘Kurzschluss’;<br />
camiun-tractor erkennt die ländliche Bevölkerung als ‘Ackertraktor’ und nicht als<br />
modernen ‘Sattelschlepper’, ‘LKW’;<br />
• er muss sich regional und idiomatisch anpassen:<br />
rg: tilastruvas, vl tirascrauvs, sr: tilastrubas konnte in der Lumnezia zu<br />
tre(r)strubas angepasst werden. Schnell verliert sich aber der Grundbegriff,<br />
so für ‘Scheibenwischer’ mit der Vermischung von ‘wischen’, ‘waschen’ und<br />
‘trocknen’ rg: fruschavaiders, fruschets, sr: schubregiaveiders, furschets,<br />
st: furbaveders, sm: siaintaveders, pt: süjaintavaiders, terdschins, vl:<br />
süaintavaiders, terdschins und die schon erwähnten chasperets;<br />
rg: biancaria ‘Weißwäsche’ ist unverständlich im Romanischen mit nur alv als<br />
Benennung für ‘weiß’; üblich sind konkrete Begriffe wie sr: resti da letg, vl: linzöls<br />
‘Bettwäsche’, sr: resti suten ‘Unterwäsche’;<br />
• er sollte weder zur Homonymie noch zu Heteronymie führen:<br />
schluppet/buis ‘Gewehr’ sind regional so verankert, dass keiner<br />
davon sich durchsetzen kann; die ungenügende Unterscheidung<br />
von fittar ‘mieten’ und affittar ‘mieten, vermieten’ erfordert eine<br />
Periphrase prender/dar a fit ‘in Pacht nehmen/geben’;<br />
rg: taglia ‘Steuer’ ist bevorzugt worden, obwohl imposta produktiver<br />
wäre: *impostabel, *impostar, das im PG nur als Part. Perf. contribuziun<br />
imposta ‘auferlegte Leistung’ steht und kaum von rg: impostar<br />
‘aufgeben, einfächern’ als Buchwörter bedrängt würde;<br />
18
vl: cumischiun sindicatoria ‘Geschäftsprüfungskommission’ ist neben sr, rg:<br />
cumissiun da gestiun lediglich eine Scheinopposition, weil gestiun überall ‘Geschäft’<br />
bedeutet, aber trotzdem identifiziert sich die Bevölkerung zunehmend mit solchen<br />
Schibboleths als Gegenreaktion zu einer drohenden Vereinheitlichung;<br />
• darf im Romanischen dem deutschen Diskurs und Geist 6 nicht zuwiderlaufen.<br />
Begriffe wie denticulara ‘Zahnradbahn’ sind offenbar zu wenig einsichtig und<br />
brauchen eine Periphrase viafier a roda dentada, um sich von dentera ‘Zahnspange’,<br />
dentadira ‘Gebiss, Zahnung’ und dentigliun ‘Bartenplatte (beim Wal)’ abzusetzen,<br />
weil viele Morpheme zu schwach und deshalb nicht produktiv sind. Trotzdem<br />
vermochten sich auch eigenständige Begriffe durchzusetzen: runal ‘Schlepplift’,<br />
sutgera ‘Sessellift’; rentier ‘Rentner’ ist umstritten wegen des deutschen Synonyms<br />
‘Ren, Rentier’ und vl: golier vermag den üblichen goli ‘Goali, Torhüter’ kaum zu<br />
vertreiben.<br />
Fast unüberwindliche Hindernisse für eine Standardisierung stellen die idiomatisch<br />
ausgeprägten Bereiche der Speisen und der häuslichen Tierwelt dar. Der Ersatz für<br />
sr: tier ‘Tier’ animal wird besonders im Engadin pejorativ als ‘Viech’ verstanden;<br />
dessen bes-cha wird wiederum mit sr: bestga und besonders bestia ‘Raubtier, Bestie’<br />
gleichgesetzt, denn biestga gilt dort nur für ‘Vieh, Großvieh’ und entspricht nicht vl:<br />
besch ‘Schaf’. Die exemplarische Vielfalt belegt die Bezeichnung der Körperteile beim<br />
Menschen (PLED RUMANTSCH 3 1984).<br />
Bei bilingualen Sprechern mit einer stark interferierten Sprache betrifft die<br />
linguistische Identifizierung nicht nur das definierte Romanisch als postulierte<br />
reine Sprache, sondern das gesamte Repertoire (Deutsch und andere Sprachen).<br />
Die romanische Form wirkt oft puristisch mit entsprechendem Registerwechsel und<br />
verletzt den oft einzigen verfügbaren tieferen Stil des Sprachverwenders; es entsteht<br />
ein neues Register. In der geläufigen Jugendsprache wirken magiel ‘Glas’ und gervosa<br />
‘Bier’ stilistisch fremd neben glas und pier, und sa participar ‘sich beteiligen’<br />
entspricht aus sozialkommunikativen Gründen nicht far cun ‘mitmachen’, das man<br />
ersetzen will (Solèr 2002:261).<br />
7. Linguistische Bereicherung und Unsicherheit<br />
Die Terminologie will eine umfassendere Verwendbarkeit der Sprache mit neuen<br />
Domänen erreichen, aber sie soll auch die linguistische Ausdrucksmöglichkeit<br />
erweitern und so das Romanische als Fachsprache fördern. Wohl sind die derivativen<br />
Prozesse linguistisch geeigneter als die analytischen, aber diese werden wegen der<br />
6 Gemäss Ascoli (1880-83:407) „materia romana in spirito tedesco“ und Solèr (2002:261) „mentale<br />
Symbiose“.<br />
19
höheren Transparenz und der Nähe zum Deutschen bevorzugt; auch psychologische<br />
Gründe scheinen eine Hürde darzustellen. Spontane und spielerische Bildungen sind<br />
Einzelfälle ohne Wirkung, so idear ‘die Idee haben’, impulsar ‘den Impuls geben’ oder<br />
praulistic ‘märchenhaft’. Besonders die Zeitungsleute des 19. Jh. mussten mehr oder<br />
weniger eine neue Sprache für die sich stark veränderte Umwelt erschaffen, weil<br />
bis anhin nur eine religiöse und juristische Fachsprache bestand und Deutsch keine<br />
Alternative war. Noch heute sind die Medien Pioniere, denken wir an ‘Seebeben’,<br />
‘SARS’, ‘Herdenschutzhund’ und ‘Vogelgrippe’, aber gelegentlich verhindern<br />
notdürftige abstrakte Stelzenbegriffe eine genaue und kohärente Terminologie: chaun<br />
da protecziun ‘Herdenschutzhund’ anstatt chaun-pastur, chaun pertgirader; forzas<br />
da segirezza ‘Sicherheitskräfte’ anstatt eines konkreten Begriffs armada, polizia;<br />
bains da provediment ‘Versorgungsgüter’ für victualias, provediment oder effectiv,<br />
populaziun da peschs ‘Fischbestand, -population’, das romanisch als ‘Fischbevölkerung’<br />
verstanden wird anstatt (ils) peschs als Kollektiv.<br />
Einzelelemente lassen sich problemlos austauschen, während mehrgliedrige Begriffe<br />
die bestehende Syntax überfordern, indem sie sie verändern oder eine systemfremde<br />
Syntax übernehmen:<br />
• Verben mit Präposition im abstrakten Sinn:<br />
metter enturn ideas ‘Ideen umsetzen’ verstanden als ‘Ideen umlegen, töten’<br />
anstatt realisar, concretisar ideas; sr: fatg en lavur cumina priu ora la lavur da<br />
professiun ‘in Gemeinwerk gemacht ausgenommen die Facharbeit’ anstatt auter<br />
che, cun excepziun da, danor ‘anders als, mit Ausnahme von, außer’;<br />
• Nominalisierung und Nominalketten:<br />
sm: La discussiun cun la donna ò savia neir exequeida sainza grond disturbi da<br />
canera ‘das Gespräch mit der Frau konnte ohne größere Belästigung durch Lärm<br />
durchgeführt werden’ anstatt ins ò savia discorrer cun la donna senza neir disturbo<br />
‘man konnte mit der Frau sprechen, ohne gestört zu werden’;<br />
• Stelzensätze und Leerformulierungen:<br />
far adiever dals meds publics da transport en Engiadina Bassa ‘Gebrauch machen<br />
von den öffentlichen Verkehrsmitteln im Unterengadin’ für ir cun il tren ed auto<br />
da posta; exequir lavurs da surfatscha ‘Oberflächenarbeiten ausführen’ anstatt far<br />
la cuvrida ‘die Abdeckung (der Straße) machen’.<br />
Mit diesen transkodischen Bildungen könnte man sich linguistisch allenfalls<br />
abfinden, wenn das Romanische damit nicht noch die Identität verlieren würde.<br />
Komplexe Begriffe widersprechen zwar der Sprachgewohnheit, der Tradition der<br />
Romanischsprecher, aber die abstrakte, sperrige, styroporartige Syntax hat sich<br />
20
vom traditionellen Romanisch so weit entfernt, dass man es nur über das Deutsche,<br />
versteht, also aus der Rückübersetzung.<br />
8. Sozialpsychologische Aspekte<br />
Das Romanische wird in dörflichen Sprachgemeinschaften und teilweise in<br />
den Regionalzentren verwendet; es schafft dort eine lokale Identifikation unter<br />
den Romanischsprechern, ganz besonders den Einheimischen, und steht für das<br />
Überschaubare gegenüber dem Fremden. Wenn man aber beruflich oder mit einem<br />
nichtromanischen Partner eine andere Sprache verwendet, so tut man das emotionslos.<br />
Und wenn manche wegen ihrer fehlenden romanischen Fachkompetenz Deutsch<br />
verwenden, dann ist das eher ein Reflex der Sprachpolitik, als dass man sich schämt. Es<br />
ist zudem eher selten, dass man bewusst neue romanische Ausdrücke sucht, denn allzu<br />
oft vergessen besonders die Sprachverwalter und Sprachpfleger, dass das Romanische<br />
oft nur informell gesprochen wird und endgültig eine Ko-Sprache des Deutschen ist.<br />
9. Politisch-wirtschaftliche Aspekte<br />
Jede Sprache kann zwar materiell (Terminologie, Neologie) erfolgreich, sozusagen<br />
im Labor erneuert werden, aber deren Verbreitung, Implementierung, kann nur die<br />
Anwendungsseite (Produkte, Sprachträger usw.) bewirken. Beim Romanischen hingegen<br />
sind die anwendungsorientierten Bedingungen überhaupt nicht oder nur schwach<br />
erfüllt und auch der technisch-linguistische Bereich ist nicht eindeutig bestimmt<br />
(Entscheidungskompetenz, Verbindlichkeit, Verbreitung). Die enge und fast intime<br />
Sprachgemeinschaft fordert vom linguistischen Bearbeiter, der zugleich selber betroffen<br />
ist, eine technisch-linguistische Spracherneuerung, die einerseits systematisch ist und<br />
andererseits auch eine sichere Triviallösung liefert. Diese Individualisierung beeinflusst<br />
trotzdem die Spracherneuerung weniger als andere Rahmenbedingungen, nämlich das<br />
sprachliche Umfeld, die Nützlichkeit und die Kleinräumigkeit.<br />
Im gemeinsamen Wirtschafts-, Verkehrs-, Ausbildungs- und Kommunikationsraum<br />
mit der deutschen Schweiz fehlt dem Romanischen die konkrete, durchgehende<br />
Anwendung, die Kommerzialisierung der Sprache, außer in den gesteuerten Bereichen<br />
der Verwaltung und Volksschule in denen sie ohne direkte Konkurrenz ist.<br />
10. Verbreitung und Nachhaltigkeit<br />
Im ganzen Anpassungsprozess erweist sich – bei einer minorisierten Sprache<br />
nicht unerwartet – ausgerechnet die wichtigste Phase, nämlich die Verbreitung und<br />
systematische Anwendung, als schwächstes Glied. Die Anpassung dringt nicht direkt<br />
zum Anwender im Berufsalltag, sondern er muss sie bewusst holen und auch bereit sein,<br />
21
sie zu verwenden; gezwungen wird er kaum und wenn, dann nur in einzelnen Bereichen<br />
von befohlener Mehrsprachigkeit. Zudem verstreicht häufig so viel Zeit zwischen dem<br />
Vorschlag und der Anwendung beim Endverbraucher, dass der Begriff im technischen<br />
Bereich entweder schon veraltet ist oder dass die entlehnte Erstbezeichnung oder<br />
ein Trivialbegriff sich eingebürgert hat. Häufig überrumpelt die Entwicklung aber die<br />
Sprache regelrecht, so z. B. im Informatikbereich.<br />
Beim Start des Rumantsch grischun 1982 war auch die Informatik ein relativ neues<br />
und unbekanntes Werkzeug, so dass in dieser Phase auch die romanischen Begriffe<br />
dafür geschaffen werden konnten. In der anschließenden rasanten Verbreitung<br />
der Informatik sind diese aber durch die internationalen bedrängt oder verdrängt<br />
worden, so: ordinatur ‘Rechner, Computer’ > computer; platta fixa ‘Festplatte’ ><br />
HD; arcunar ‘speichern’ > segirar ‘sichern’; datas ‘Daten’, datoteca, ‘Datenfile’ ><br />
file; actualisaziun, cumplettaziun ‘Update’ > update, palpader ‘Scanner’ > scanner.<br />
‘Laptop’ hat man direkt übernommen ohne portabel vorzuschlagen.<br />
Die Zeitungsredaktoren des 19 Jh. konnten ihre neuen, wenig systematischen<br />
Begriffe unmittelbar den Lesern konkurrenzlos vermitteln; aber sie verliefen oft im<br />
Sand, weil sie nicht systematisch gesammelt und weiter verbreitet wurden. Diese<br />
Schwächen versuchte die für das Engadin 1919 begonnene themenorientierte Reihe<br />
„S-CHET RUMANTSCH“ in der Zeitung und später in Buchform zu überwinden. Ich<br />
möchte es nicht unterlassen, einige phantasievolle Verbreitungsarten wenigstens zu<br />
erwähnen:<br />
• mit Metzgereibegriffen bedruckte Papiertüten<br />
• Beschriftungen der Produkte in den Auslagen<br />
• Sportterminologie auf Tafeln in den Turnhallen<br />
• zweisprachige Rechnungsformulare für Autowerkstätten<br />
• Beschreibung und Gebrauchsanweisung auf Produktepackungen7 Diese direkten Anwendungen wurden durch sekundäre Listen ergänzt und noch<br />
heute veröffentlichen einzelne Zeitungen regelmäßig kleine Wortlisten.<br />
Die wohl erfolgreichste Verbreitung brachte die Schule bis zu den großen<br />
Strukturänderungen der 70er Jahren des letzten Jahrhunderts, die eine noch<br />
mehrheitlich nur-romanische und ländliche Bevölkerung in eine neue Welt inhaltlich<br />
und sprachlich einführte. Diese Periode dauerte so lange, dass eine Schulbuchreihe noch<br />
über zwei Schulgenerationen reichte und die gelernten Neuerungen fast lebenslänglich<br />
galten. In dieser Zeit fallen auch die ersten systematischen Wörterbücher.<br />
7 Eines der wenigen Beispiele ist die „Lataria Engiadinaisa SA, CH-7502 Bever“; in den 90er Jahren waren<br />
einige Tierarzneien romanisch beschriftet; die Anschrift Adina Coca Cola blieb ein Werbegag der 90er<br />
Jahre.<br />
22
Wörterbücher und Lexikographie sind unverzichtbare Hilfsmittel für jede Sprache,<br />
aber wenig wirkungsvoll für den Sprachverwender, weil er bewusst und außerhalb<br />
der Gesprächs, sozusagen metakommunikativ auf sie greifen muss. Sie sind im<br />
Romanischen zudem nur referenziell und liefern eine Ersatzbezeichnung für schon<br />
bekannte – zwar deutsche – Ausdrücke, aber trotzdem verfügbar im Zeicheninventar<br />
der Romanischsprecher (Reiter 1984:289).<br />
Während die hochspezialisierte Terminologie in keiner Sprache zum allgemeinen<br />
Wortschatz gehört, sollten die neuen Begriffe des modernen Alltags wie Verkehr,<br />
Kommunikation, Unterhaltung, Lifestyle, aber auch der neueren Verwaltung<br />
umgesetzt werden. Für eine umfangreichere Durchsetzung, Implementierung, fehlt<br />
das romanische Umfeld sowie der Terminologiediskurs. Die bestehenden Medien<br />
erfüllen lediglich eine lokale und emotionale Rolle gegenüber einem umfassenden<br />
deutschsprachigen Angebot, und so entwickelt sich auch kaum eine Sprachnorm.<br />
Die Verwendung des Romanischen allgemein, und einer offiziellen Sprachform<br />
im besonderen, anstatt des Deutschen oder des Englischen ist nur ausnahmsweise<br />
bei Kulturtouristen und Heimwehromanen ein kommerzieller Faktor; sonst kann es<br />
sogar hinderlich sein, wie die Reaktionen der Bevölkerung auf jegliche Anpassung<br />
eindrücklich belegen. Das Romanische besitzt kein geschütztes Sprachgebiet und seine<br />
Verwendung kann gesetzlich kaum oder nicht durchgesetzt werden wie beispielsweise<br />
in Frankreich.<br />
Die kantonale Verwaltung verwendet die drei offiziellen Kantonsprachen in den<br />
Veröffentlichungen und im Internet (Erklärungen, Berichte, Anleitungen, Hinweise,<br />
Abstimmungen usw.). In der Verwaltungstätigkeit hingegen ist das Romanische<br />
gegenüber dem Deutschen besonders im Fachbereich eingeschränkt: Die romanische<br />
Steuererklärung gibt es nicht digital, verschiedene amtliche Formulare können online<br />
nur deutsch und gelegentlich italienisch ausgefüllt werden. Offensichtlich trifft<br />
folgendes für die regionalen Organisationen, die direkt mit der Bevölkerung arbeiten<br />
zu: nur Romanisch selten, zweisprachig ist häufiger, eher plakativ, und mehrheitlich<br />
Deutsch. Das ist auch eine Folge des ‘polykephalen’, sprich regionalisierten<br />
Romanisch als Teil einer deutschen Umwelt, und es verunmöglicht eine einheitliche<br />
Fachterminologie und ihre einheitliche umgangssprachliche Umsetzung. So bestätigt<br />
sich die Feststellung von Haarmann (1993:108) „Hier liegt ein prinzipielles Problem<br />
des Minderheitenschutzes. Eine indominante Sprache hat zwar grundsätzlich bessere<br />
Chancen zu überleben, wenn ihre Verwendung in Bereichen des öffentlichen Lebens<br />
garantiert wird, es besteht aber keine automatische Wechselbeziehung zwischen<br />
23
einer sprachpolitischen Förderung und der Erhaltung dieses Kommunikationsmediums<br />
als Mutter- und Heimsprache“.<br />
Die gewinnorientierte Wirtschaft wählt dementsprechend die beste Sprache.<br />
Romanisch verwendet sie identifikatorisch und emotional in den rtr. Regionen, aber nicht<br />
als durchgehende Plattform (Banken, Versicherungen). „Unique Selling Proposition“ ist<br />
ein Schlagwort und wird bestenfalls im Mäzenatentum eingelöst. Ohne die operative<br />
Bedeutung passt sich keine Fachsprache an, oder sie wird nicht systematisch und<br />
einheitlich verwendet, sondern als lokale und stilistische (diglossische) Variante,<br />
banalisiert als Trivialterminologie. Dann ist auch die domänenspezifische Verwendung<br />
des Romanischen und deren Aktualisierung weitgehend illusorisch, und auch die<br />
bescheidene berufliche Aus- und Weiterbildung dient bestenfalls für romanische<br />
Infrastrukturbetriebe (Lia Rumantscha, Radio, Fernsehen und die Schulunterstufe).<br />
Das Romanische passt sich zwar den neuen Erfordernissen dauernd an, aber weil<br />
diese Entwicklung eher spontan als geordnet erfolgt, und weil sie eher die gesprochene<br />
Sprache mit einer Trivialterminologie betrifft, fördert sie die zweisprachige<br />
Diglossie mit dem Schriftdeutschen in allen Außenbeziehungen und sogar unter<br />
Romanischverwendern.<br />
11. Ausblick – aber kaum die Lösung<br />
Das klingt nach einer Bankrotterklärung. Das ist es nicht, aber man muss sich<br />
auf die Grundlagen rückbesinnen und in erster Linie die Randbedingungen, die<br />
soziolinguistischen, politischen und wirtschaftlichen Voraussetzungen ernst(-er)<br />
nehmen.<br />
Zum ersten die Terminologie; anstatt der akademischen und direkt kopierten,<br />
sterilen Erneuerungen muss man sich um assoziative – und überschreite sie auch die<br />
Einzelsprache – einsichtige oder sogar spielerische, aber praxistaugliche Benennungen<br />
bemühen, die lebensnah sind und genaue Inhalte sprachlich sinnvoll und kulturell<br />
verträglich umsetzen können.<br />
Die Hauptschwierigkeit ist und bleibt die Verbreitung. Wenn eine Sprache wie<br />
das Romanische mehr kulturell, ideell und politisch, als wirtschaftlich begründet ist,<br />
erweist sich deren Anpassung (Modernisierung und Standardisierung) umso weniger<br />
durchsetzbar.<br />
Psychologischer Druck oder die Drohung eines Sprachniedergangs wirken vielleicht<br />
kurzfristig, erwecken Hoffnungen, aber sie wirken niemals nachhaltig.<br />
Dass sich der Riesenaufwand für die Romanisierung des ganzen MS-Office mit<br />
der Orthografiekontrolle (Spell-Checker) nicht lohnt, ist leicht vorauszusagen; das<br />
24
Produkt spricht eine zu kleine Gruppe an und das Bedürfnis nach romanischen <strong>Text</strong>en<br />
kann man nicht künstlich erzeugen. 8 Mit Sicherheit hilfreich und seit bald 15 Jahren<br />
nützlich erwies sich die Terminologiearbeit im Pledari Grond der Lia Rumantscha; es<br />
ist zwar bescheidener, dafür praxisbezogen und dient zudem als eine Hilfsbrücke zu<br />
den Idiomen und sollte somit Spannungen abbauen.<br />
Für eine isolierte Kleinsprache ist es aber unabdingbar, die Sprachverwender<br />
schnell, unkompliziert und konkret zu unterstützten. Die privaten und kollektiven<br />
Sprachverwalter wie die Lia Rumantscha und der Kanton mit seiner umfassenden<br />
Tätigkeit können die Sprachverwender am ehesten überzeugen mit gebrauchfertigen<br />
Vorlagen, schnellen Übersetzungen, gefälligen <strong>Text</strong>en und mit einem umsichtigen,<br />
engen Coaching bei der Sprachverwendung und so wären auch die Empfänger eingebunden.<br />
Für diese Aufgaben braucht es Terminologiearbeit. Das ist ein guter Anfang und ist<br />
auch zu bewältigen. Die folgenden ebenso notwendigen Schritte müssen zuallererst<br />
die Sprachverwender tun.<br />
8 Versuche der LR um 1990 digitales Material für Handwerksbetriebe herzustellen und zu vertreiben<br />
scheiterte an den einzelbetrieblichen „Branchenlösungen“, die miteinander unverträglich sind, an der<br />
Einheitsform Rumantsch grischun, an der gewohnten deutschen Berufssprache sowie der Einstellung<br />
gegenüber der deutschsprachigen Kundschaft.<br />
25
26<br />
Bibliographie<br />
Ascoli, G.I. (1880-1883). “Annotazioni sistematiche al Barlaam e Giosafat soprasilvano.”<br />
Archivio glottologico italiano. Roma: Loescher, 7:365-612.<br />
Carisch, O. (1848). Taschen-Wörterbuch der Rhætoromanischen Sprache in<br />
Graubünden. Chur: Wassali.<br />
Coray, R. (1993). “La mumma romontscha: in mitos.” ISCHI 77, 4, 146-151.<br />
Decurtins, A. (1993). “Wortschatz und Wortbildung – Beobachtungen im Lichte der<br />
bündnerromanischen Zeitungssprache des 19./20. Jahrhunderts.” Rätoromanisch,<br />
Aufsätze zur Sprach-, Kulturgeschichte und zur Kulturpolitik. Romanica Rætica 8,<br />
Chur: Società Retorumantscha, 235-254.<br />
EUROMOSAIC (1996). Produktion und Reproduktion der Minderheitensprachge-<br />
meinschaften in der Europäischen Union. Brüssel/Luxemburg: Amt für amtliche<br />
Veröffentlichungen der EG.<br />
Gadola, G. (1956).”Contribuziun alla sligiaziun dil problem puril muntagnard.” Igl<br />
Ischi, 42, 33-93.<br />
Haarmann, H. (1993). Die Sprachenwelt Europas. Geschichte und Zukunft der<br />
Sprachnationen zwischen Atlantik und Ural. Frankfurt: Campus.<br />
Jaberg, K. & Jud, J. (1928). Der Sprachatlas als Forschungsinstrument. Halle:<br />
Niemeyer.<br />
Pledari Grond (2003) deutsch-rumantsch, rumantsch-deutsch, cun conjugaziuns dals<br />
verbs rumantschs. Cuira: Lia rumantscha [CD-ROM].<br />
PLED RUMANTSCH/PLAID ROMONTSCH 3 (1984). Biologia. Cuira: Lia rumantscha.
RÄTOROMANISCH (2004). Facts & Figures. Cuira: Lia rumantscha.<br />
Reiter, N. (1984). Gruppe, Sprache, Nation. Wiesbaden: Harrassowitz.<br />
S-CHET RUMANTSCH (1917-1963). Fögls per cumbatter la caricatura nella lingua<br />
ladina. Scuol: Uniun dals Grischs.<br />
Solèr, C. (1986). “Ist das Domleschg zweisprachig?” Bündner Monatsblatt, 11/12, 283-<br />
300.<br />
Solèr, C. (2002). “Spracherhaltung – trotz oder wegen des Purismus. Etappen des<br />
Rätoromanischen.” Bündner Monatsblatt, 4, 251-264.<br />
Solèr, C. (2005). “Co e cura che la scrittira emprenda rumantsch. Cudeschs da scola per<br />
la Surselva.” Annalas da la Societad Retorumantscha. Cuira: Societad retorumantscha,<br />
7-32.<br />
27
Implementing NLP-Projects for Small<br />
Languages: Instructions for Funding Bodies,<br />
Strategies for Developers<br />
29<br />
Oliver Streiter<br />
This research starts from the assumption that the conditions under which ‘Small<br />
Language’ Projects (SLPs) and ‘Big Language’ Projects (BLPs) are conducted are<br />
different. These differences have far-reaching consequences that go beyond the<br />
material conditions of projects. We will therefore try to identify strategies or<br />
techniques that aim to handle these problems. A central idea we put forward is<br />
pooling the resources to be developed with other similar Open Source resources. We<br />
will elaborate the expected advantages of this approach, and suggest that it is of such<br />
crucial importance that funding organisations should put it as condicio sine qua non<br />
into the project contract.<br />
1. Introduction: Small Language & ‘Big Language’ Projects - An Analysis of<br />
their Differences<br />
Implementing NLP-projects for Small Languages: Is this an issue that requires<br />
special attention? Are Small Language Projects (SLPs) different from ‘Big Language’<br />
Projects (BLPs)? What might happen if SLPs are handled in the same way as BLPs? What<br />
are the risks? How can they be reduced? Can we formulate general guidelines so that<br />
such projects might be conducted more safely? Although the processing of minority<br />
languages and other Small Languages has been subject to a series of workshops, this<br />
subject has been barely tackled as such. While most contributions discuss specific<br />
achievements (e.g. an implementation or transfer of a technique from Big Languages<br />
to Small Languages), only a few articles transcend to higher levels of reflection on<br />
how Small Language Projects might be conducted in general.<br />
In this contribution, we will compare SLPs and BLPs at the abstract schematic<br />
level. This comparison reveals differences that affect, among other things, the status<br />
of the researcher, the research paradigm to be chosen, the attractiveness of the<br />
research for young researchers, as well as the persistence and availability of the<br />
elaborated data - all to the disadvantage of Small Languages. We will advance one far-<br />
reaching solution that overcomes some of these problems inherent to SLPs, that is,<br />
to pool the developed resources with other similar Open Source resources and make
them freely available. We will discuss, step by step, the possible advantages of this<br />
strategy, and suggest that this strategy is so promising and so crucial for the survival<br />
of the elaborated data that funding organisations should put it as condicio sine qua<br />
non into their project contract.<br />
Let us start with the comparison of BLPs and SLPs.<br />
• Competition in Big Languages: Big Languages are processed in more than<br />
one research centre. Within one research centre more than one group might work on<br />
different aspects of this single language. The different centres or groups compete<br />
for funding, and thus strive for scientific reputation (publications, membership in<br />
exclusive circles, membership in decision taking bodies) and try to influence the<br />
decision-making processes of funding bodies.<br />
• Niches for Small Languages: Small Languages are studied by individual<br />
persons, small research centres or cultural organisations. Small Languages create a<br />
niche that protects the research and the researcher. Direct competition is unusual.<br />
This, without doubt, is positive. On the negative side, however, we notice that<br />
methodological decisions, approaches and evaluations are not challenged by<br />
competitive research. This might lead to a self-protecting attitude that ignores<br />
inspiration coming from successful comparable language projects.<br />
• Big Languages Promise Money: There is commercial demand for BLPs<br />
as can be seen from the funding that companies like Google or Microsoft provide<br />
for NLP projects. As these companies try to obtain a relative advantage over their<br />
competitors, language data, algorithms, and so forth are kept secret.<br />
• There is No Money in Small Languages: Those organisations that<br />
fund BLPs are not interested in SLPs. If a Small Language wants to integrate its<br />
spellchecker in Microsoft Word, the SLP has to provide the linguistic data with no<br />
or little remuneration for Microsoft.<br />
• Big Languages Hide Data: Language resources for Big Languages are and<br />
have been produced many times in different variants before they find their way<br />
into an application, or before they are publicly released. Since research centres<br />
for Big Languages compete for funding, recognition and commercialisation, every<br />
centre hopes to obtain a relative advantage over their competitors by keeping<br />
developed resources inaccessible to others. 1<br />
1 That this secretiveness might have any advantages at all can be called into question. Compare, for<br />
example, the respective advantages Netscape or Sun had from handing over their resources to the Open<br />
Source Community. Netscape-based browsers by far outperform their previous competitors such as<br />
Internet Explorer or Opera and the data handling in Open Office is going to be copied by the competitor<br />
Microsoft Office. As for the scientific reputation, people cited frequently are those who make available<br />
their resources including dictionaries and corpora (e.g. Eric Brill, Henry Kucera, W. Nelson Francis,<br />
30
• Small Languages Shouldn't Do So: For Small Languages, such a waste of<br />
time and energy is unreasonable. Resources that have been built once should be<br />
made freely available so that new, related projects can build on top of them, even<br />
if they are conducted elsewhere. Without direct competition, a research centre<br />
should have no disadvantage by making its resources publicly available. Reasons for<br />
not distributing the developed resources are most likely due to the misconception<br />
that sharing the data equals to losing the copyright on the data.<br />
However, under the General Public License (a license that might be used in SLPs),<br />
the distribution of resources requires that copies must contain the appropriate<br />
copyright notice (so that the rights remain with the author of the resources). In<br />
addition, it has to contain the disclaimer of warranty, so that the author is not liable<br />
for any problems others have with the data or programs. Somebody modifying the<br />
data or programs cannot sell this modification unless the source code is made freely<br />
available, so that everybody, including the author, can take over the resources for<br />
further improvements.<br />
The consequence of not sharing the data (i.e., keeping the data on the researcher’s<br />
hard disk) is that the data will be definitely lost within ten years after its last<br />
modification. 2<br />
• BLPs Overlap in time and create a research continuum. In this research<br />
continuum, researchers and resources can develop and adapt to new paradigms<br />
(defined as “exemplary instances of scientific research”, Kuhn 1996/1962) or new<br />
research guidelines. In fact, a large part of a project is concerned with tying the<br />
knots between past and future projects. Data is re-worked, re-modelled and thus<br />
kept in shape for the future.<br />
• SLPs are Discontinuous. There might be temporal gaps between one SLP<br />
and the next one. This threatens the continuity of the research, forces researchers<br />
to leave the research body, or might endanger the persistence of the elaborated<br />
Huang Chu-ren, Chen Keh-jiann, George A. Miller, Christiane Fellbaum, Throsten Brants, and many<br />
others).<br />
2 Reasons for the physical loss of data are: Personal mobility (e.g. after the retirement of a collaborator,<br />
nobody knows that the data exists, or how it can be accessed or used). Changes in software formats<br />
(e.g. the format of backup programs, or changes in the SCSI controller make the data unreadable).<br />
Changes in the physical nature of external memories (punch card, soft floppy disk, hard floppy disk,<br />
micro floppy, CD-ROM, magnetic tape, external hard disk, or USB-stick) and the devices that can read<br />
them. Hard disk failure (caused by firmware corruption, electronic failure, mechanical failure, logical<br />
failure, or bad sectors). The limited lifetime of storage devices is: tapes (2 years), magnetic media (5-<br />
10 years) and optic media (10-30 years). This depends very much on the conditions of usage and storage<br />
(temperature, light and humidity).<br />
31
data. The data is unlikely to be re-worked, or ported to new platforms or formats,<br />
and thus it risks becoming obsolete or unreadable.<br />
• BLPs Rely on Specialists: The bigger the group in a BLP, the more<br />
specialists in programming languages, databases, linguistic theories, parsing, and<br />
so forth it will integrate. Specialists make the BLP autonomous, since specific<br />
solutions can be fabricated when needed.<br />
• All-rounders at Work: Specialisation is less likely to be found in SLPs,<br />
where one person has to cover a wider range of activities, theories, tools, and<br />
so forth. in addition to administrative tasks. Thus, SLP projects cannot operate<br />
autonomously. They largely depend on toolkits, integrated software packages, and<br />
so forth. Choosing the right toolkit is not an easy task. It not only decides the<br />
success or failure of the project, but will also influence the course of the research<br />
more than the genius of the researcher. If a standard program is simply chosen<br />
because the research group is acquainted with it, a rapid project start might be<br />
bought at the price of a troublesome project history, data that is difficult to port<br />
or upgrade, or data that does not match the linguistic reality it should describe.<br />
• BLPs Play with Research Paradigms: BLPs can freely choose their<br />
research paradigm and therefore frequently follow the most recent trends in<br />
research. Although different research paradigms offer different solutions and have<br />
different constraints, BLPs are not so sensitive to these constraints and can cope<br />
successfully with any of them. BLPs must not only be capable of handling new<br />
research paradigms; otherwise, the new research paradigms could not survive,<br />
BLPs are even expected to explore new research paradigms, as they are the only<br />
ones having the gross scientific product that can cope with fruitless attempts and<br />
time-consuming explorations. Indeed, we observe that BLPs frequently turn to the<br />
latest research paradigm to gain visibility and reputation. Shifts in the research<br />
paradigm might make it necessary to recreate language resources in another<br />
format or another logical structure.<br />
• SLPs Depend on the Right Research Paradigm: SLPs do not dispose of<br />
rich and manifold resources (dictionaries, tagged corpora, grammars, tag-sets, and<br />
taggers) in the same way as BLPs do. The research paradigm should thus be chosen<br />
according to the nature and quality of the available resources, and not according<br />
to the latest fashion in research. This might imply the usage of a) example-<br />
based methods, since they require less annotated data (cf. Streiter & de Luca<br />
[2003]), b) unsupervised learning algorithms, if no annotations are available, or c)<br />
hybrid bootstrapping methods (e.g. D. Prinsloo & U. Heid 2006), which are almost<br />
32
impossible to evaluate. Young researchers may experience these restrictions as a<br />
conflict. On one hand, they have to promote their research, ideally in the most<br />
fashionable research paradigm; on the other hand, they have to find approaches<br />
compatible with the available resources. Fortunately, after the dust of a new<br />
research trend has settled 3 , new research trends are looked at in a less mystified<br />
light, and it is perfectly acceptable for SLPs to stick to an older research paradigm,<br />
if it conforms to the overall requirements. 4<br />
• Model Research in Big Languages: Research on Big Languages is frequently<br />
presented as research on that respective language and, in addition, as research on<br />
Language in general. The same piece of research might thus be sold twice. From<br />
this, BLPs derive not only a greater reputation and better project funding, but also<br />
an increased attractiveness of this research for young researchers. Big Languages,<br />
as a consequence, are those languages for which, in virtue of general linguistic<br />
accounts, documentary and pedagogic resources are developed. Students are<br />
trained in and with these languages in the most fashionable methods, which they<br />
learn to consider as superior.<br />
• SLPs Represent Applied Research - at best!: SLPs are less likely to sell<br />
their research as research on Language in general. In fact, little else but research<br />
on English counts as research on Language, and is considered research on a specific<br />
language at best. 5 The less general the scope of research, the less likely it is to be<br />
I have taken this term from Harold Somers (1998).<br />
3<br />
4 Although Big Language research centres are free to choose their research paradigm, they more often<br />
than not are committed to a specific research paradigm, (i.e., the one they have been following for<br />
years or in which they play a leading role. This specialization of research centres to a research paradigm<br />
is partially desirable, as only specialists can advance the respective paradigm. However, when they do<br />
research on Small Languages, either to extend the scope of the paradigm or to access alternative<br />
funding, striking mismatches can be observed between paradigm and resources. Such mismatches are<br />
of no concern to the Big Language research centre, which, after all, is doing an academic exercise,<br />
but they should be closely watched in SLPs, where such mismatches will cause the complete failure of<br />
the project. For example, Machine Translation knows two approaches: rule-based approaches, where<br />
linguists write the translation rules; and corpus-based approaches, where the computer derives the<br />
translation rules from parallel corpora. Corpus-based approaches can be statistics-based or examplebased.<br />
Recently, RWTH Aachen University, known for its cutting-edge research in statistical Machine<br />
Translation, proposed a statistical approach to sign language translation (Bungeroth & Ney 2004).<br />
One year later Morrissey & Way (2005) from Dublin City University, a leading agent in Example-based<br />
Machine Translation, proposed “An Example-Based Approach to Translating Sign Languages.” The fact,<br />
however, that parallel corpora involving at least one sign language are extremely rare and extremely<br />
small is done away in both papers, as if it would not affect the research. In other words, the research<br />
builds on a type of resource that does not actually exist, just to please the paradigm.<br />
5 In a Round Table discussion at the 1st SIGHAN Workshop on Chinese Language Processing, hosted by<br />
ACL in Hong Kong, 2000, a leading researcher in Computational Linguistics vehemently expressed his<br />
dissatisfaction in being considered only a specialist in Chinese Language Processing, while his colleagues<br />
working in English are considered specialists in Language Processing. Nobody would call Chomsky a<br />
specialist in American English! Working on a Small Languages thus offers a niche at the price of a stigma<br />
that prevents researchers from ascending to the Olympus of science .<br />
33
taught at university. Students then implicitly learn what valuable research is, that<br />
is, research on Big Languages and recent research paradigms.<br />
To sum up, we observed that BLPs are conducted in a competitive and sometimes<br />
commercialised environment. Competition is a main factor that shapes the way in<br />
which BLPs are conducted. In such an environment, it is quite natural for research<br />
to overlap and to repeatedly produce similar resources. Not sharing the developed<br />
resources is seen as enhancing the competitiveness of the research centre. It is not<br />
considered to be an obstacle to the overall advance of the research field: similar<br />
resources are available elsewhere in any case. Different research paradigms can be<br />
freely explored in BLPs, with an obvious preference for the latest research paradigm,<br />
or for the one to which the research centre is committed. Gaining visibility, funding<br />
and eternal fame are not subordinated to the goal of producing working language<br />
resources.<br />
The situation of SLPs is much more critical. SLPs have to account for the persistence<br />
and portability of their data beyond the lifespan of the project, beyond the involvement<br />
of a specific researcher, and beyond the lifespan of a format or specific memory<br />
device. As Small Languages are not that much involved in the transition of paradigms,<br />
data cannot be reworked, especially if research is discontinuous. The refunding of a<br />
project due to a shift in research paradigms or lost or unreadable data is unthinkable.<br />
With few or no external competitors, most inspiration for SLPs comes from BLPs.<br />
However, the motivation for BLPs to choose a research paradigm and their capacity<br />
to handle research paradigms (given per definition) is not that of a SLP. For talented<br />
young researchers, SLPs are not attractive. As students, they have been trained in<br />
BLPs and share with the research community a system of values according to which<br />
other languages and other research paradigms are preferred.<br />
2. Improving the Situation: Free Software Pools<br />
Although most readers might consent with the obvious description of SLPs I have<br />
given above, few have turned to the solutions I am about to sketch below. The main<br />
reason for this might be possible misconceptions or unsubstantiated fears. Let us<br />
start with what seems to be the most puzzling question; that is: how can projects and<br />
researchers guarantee the existence of their data beyond the direct influence of the<br />
researcher him/herself? To give a hypothetical example: you develop a spellchecker<br />
for a language of two hundred speakers, all above the age of seventy (including<br />
yourself), and none of them having a computer (except for you). How can you ensure<br />
that the data survives? The answer is: Pool your data with other data of the same<br />
form and function and let the community take care of YOUR data. If you make your<br />
34
esearch results available as free software, other people will take care of your data<br />
and upgrade it into new formats, whenever needed. ‘But,’ you might wonder now,<br />
‘why should someone take care of my data on an unimportant and probably dying<br />
language?’ The answer lies in the pool: even if those people do not care about your data<br />
per se, they care about the pool, and once they transform resources for new versions<br />
they transform all resources of the pool, well knowing that the attractiveness of the<br />
pool comes from the number of different language modules within it. In addition,<br />
all language modules have the same format and function and if one module can be<br />
transformed automatically, all others can be automatically transformed as well. 6<br />
But which pools exist that could possibly integrate and maintain your data? Below,<br />
you find an overview of some popular and useful pools. This list might also be read<br />
as a suggestion for possible and interesting language projects, or as a check-list of<br />
components of your language that still need to be developed to be at par with other<br />
languages. Frequently, the same linguistic resources are made available to different<br />
pools (e.g. in ISPELL, ASPELL and MYSPELL). This enlarges the range of applications for<br />
a language resource, increases the visibility, and supports data persistence.<br />
6<br />
• Spelling, Office, Etc:<br />
ISPELL (lgs. > 50); spelling dictionary + rules:<br />
A spellchecker, used standalone or integrated into smaller applications.<br />
(AbiWord, flyspell, WBOSS). (http://www.gnu.org/software/ispell/)<br />
ASPELL (lgs. > 70); spelling dictionary + rules:<br />
An advanced spellchecker, used standalone or integrated into smaller<br />
applications. (emacs, AbiWord, WBOSS)(http://aspell.sourceforge.net/)<br />
MYSPELL (lgs. > 40); spelling dictionaries + rules:<br />
A spellchecker for Open Office. (http://lingucomponent.openoffice.org/)<br />
OpenOffice Grammar Checking (lgs. > 5); syntax checker:<br />
A heterogeneous set of grammar checkers for Open Office.<br />
OpenOffice Hyphenation (lgs. > 30); hyphenation dictionary:<br />
I do not know how much of an unmotivated over-generalisation this is. In the Fink project (http://fink.<br />
sourceforge.net), for example, there is one maintainer for each resource and not for each pool and, as<br />
a consequence, not all ispell modules are available. In DEBIAN (http://www.debian.org) we find again<br />
one maintainer for each resource, but orphaned packages, that is packages without maintainer, are<br />
taken over by the DEBIAN QA group.8<br />
35
A hyphenation dictionary for use with Open Office, but used also in LaTeX, GNU<br />
Troff, Scribus, and Apache FOP.<br />
OpenOffice Thesaurus (lgs. > 12); thesaurus:<br />
A thesaurus for use with Open Office.<br />
(http://lingucomponent.openoffice.org/)<br />
STYLE and DICTION (lgs. = 2); style checking:<br />
Help to improve wording and readability.<br />
(http://www.gnu.org/software/diction/diction.html)<br />
HUNSPELL (lgs. > 10); spelling dictionary + rules:<br />
An advanced spellchecker for morphologically rich languages that can be<br />
turned into a morphological analyser. (http://hunspell.sourceforge.net/).<br />
• Dictionaries:<br />
FREEDICT (lgs. > 50); translation dictionary:<br />
Simple, bilingual translation dictionaries, optionally with definitions and API as<br />
binary and in XML. (http://sourceforge.net/projects/freedict/).<br />
Papillon (lgs. > 8); multilingual dictionaries:<br />
Multilingual dictionaries structured according to Mel’čuk’s <strong>Text</strong> Meaning<br />
Theory. (http://www.papillon-dictionary.org/Home.po)<br />
JMDict (lgs. > 5); multilingual dictionaries:<br />
Multilingual translation dictionaries in XML, based on word senses.<br />
(http://www.csse.monash.edu.au/~jwb/j_jmdict.html)<br />
• Corpora:<br />
Universal Declaration of Human Rights (lgs. > 300); parallel corpus:<br />
The Universal Declaration of Humans Rights has been translated into many<br />
languages and can be easily aligned with other languages. (http://www.unhchr.<br />
ch/udhr/navigate/alpha.htm)<br />
Multex (lgs. > 9); corpora and morpho-syntactic dictionaries:<br />
36
Parallel corpora of Orwell’s 1984 annotated in CES with morpho-syntactic<br />
information in ten Middle and Eastern European languages. (http://nl.ijs.si/<br />
ME/V2/)<br />
• Analysis:<br />
Delphin (lgs. > 5); HPSG-grammars:<br />
HPSG-Grammars for NLP-applications, in addition various tools for running and<br />
developing HPSG resources. (http://www.delph-in.net/)<br />
AGFL (lgs.> 4); parser and grammars:<br />
A description of Natural Languages with context-free grammars. (http://www.<br />
cs.ru.nl/agfl/)<br />
• Generation:<br />
KPML (lgs.> 10); text generation system:<br />
Systemic-functional grammars for natural language generation.<br />
(http://purl.org/net/kpml)<br />
• Machine Translation:<br />
OpenLogos (lgs. > 4); Machine Translation software and data:<br />
An open Source version of the Logos Machine Translation System for new<br />
language pairs to be added. (http://logos-os.dfki.de/).<br />
3. Strategies and Recommendations for Developers<br />
If there is no pool of free software data that matches your own data, you should<br />
try the following: 1) Convert your data into free software so that you have a greater<br />
chance that others will copy and take care of it; and, 2) Modify your data so that it<br />
can be pooled with other data. This might imply only a minor change in the format of<br />
the data that can be done automatically by a script. Alternatively, create a community<br />
that will, in the longterm, create a pool. In general, this implies that you separate the<br />
procedural components (tagger, spelling checker, parser, etc.) from the static linguistic<br />
data; make the procedural components freely available; and, describe the format of<br />
the static linguistic data. An example might well be Kevin Scannell’s CRUBADAN, a<br />
web-crawler for the construction of word lists for ISPELL. The author succeeded in<br />
creating a community around his tool that develops spellcheckers for more than thirty<br />
Small Languages (cf. http://borel.slu.edu/crubadan). Through this split of declarative<br />
(linguistic) components on the one hand, and procedural components (programs) on<br />
the other, many pools come with adequate tools to create and maintain the data.<br />
37
Pooling of corpora is not as frequent as, for example, the pooling of dictionaries.<br />
The main reason for this may be that corpora are very specific, and document a<br />
cultural heritage. Pooling them with corpora of different languages, subject areas,<br />
registers, and so forth is only of limited use. Nevertheless, there are some computer-<br />
linguistic pools that integrate corpora for computational purposes, and that may,<br />
therefore, integrate your corpora and maintain them for you. A description of these<br />
mostly very complex pools is beyond the scope of this paper, but the interested reader<br />
might check the following projects:<br />
• GATE (http://gate.ac.uk);<br />
• Natural Language Toolkit (http://nltk.sourceforge.net); and,<br />
• XNLRDF (http://140.127.211.213/xnlrdf).<br />
Projects targeting language documentation may also host your corpora, (e.g. the<br />
TITUS Project [http://titus.uni-frankfurt.de/]). In addition, LDC (http://www.ldc.<br />
upenn.edu) and ELRA (http://www.elra.info) are hosting and distributing corpora<br />
(and dictionaries) so that your institute might profit financially from sold copies of the<br />
corpus you created.<br />
Once you decide to create your own free software (including corpora, dictionaries,<br />
etc.), you have to think about the license and the format of the data. From the great<br />
number of possible licenses you might use for your project (cf. http://www.gnu.org/<br />
philosophy/license-list.html for a commented list of licenses) you should consider the<br />
GNU General Public License, as this license, through the notion of Copyleft, doesn’t<br />
give a general advantage to someone who is copying and modifying your software.<br />
Copyleft refers to the obligation that:<br />
…anyone who redistributes the software, with or without changes,<br />
must pass along the freedom to further copy and change it. (...)<br />
Copyleft also provides an incentive for other programmers to add to<br />
free software.<br />
(http://www.gnu.org/copyleft/copyleft.htm)<br />
With Copyleft, modifications have to be made freely available under the same<br />
conditions as you originally distributed your data, and if the modifications are of<br />
general concern, you can integrate them back into your software. The quality of<br />
your resources improves, as others can find and point out mistakes or shortcomings<br />
in the resources. They will report to you as long as you remain the distributor. In<br />
addition, you may ask people to cite your publication on the resource whenever using<br />
the resource for one of their publications. Without Copyleft, important language<br />
38
data would already have been lost (e.g. the CEDICT dictionary, after the developer<br />
disappeared from the Internet).<br />
After putting your resources with the chosen license onto a webpage, you should<br />
try to integrate your resource into larger distributions such as DEBIAN (http://www.<br />
debian.org) so that, in the long term, these organisations will manage your data. To<br />
do this, your resources have to conform to some formal requirements that, although<br />
seeming tedious, will certainly contribute to the quality and maintainability of<br />
your resources (cf. http://www.debian.org/doc/debian-policy, for an example of<br />
the requirements of integration in DEBIAN). From DEBIAN, your resources might be<br />
migrated without your work into other distributions (REDHAT, SUSE, etc.) and into<br />
other modules, perhaps embedded into nice GUIs.<br />
4. Instructions for Funding Organisations<br />
A sponsoring organisation that is not interested in sponsoring a specific researcher<br />
or research institute, but which tries to promote a Small Language in electronic<br />
applications, should insist on putting the resources to be developed under Copyleft,<br />
and make this an explicit condition in the contract. Only this will guarantee that<br />
the resources will be continually developed even after the lifetime of the project.<br />
‘Copylefting’ thus allows for a sustainable development of language resources from<br />
punctual or discontinuous research activities. Only Copylefting guarantees that the<br />
most advanced version is available to everybody who might need it. In fact, a funding<br />
organisation that does not care about the way data can be accessed and distributed<br />
after the project’s end is, in my eyes, guilty of grossly negligent operation. Too many<br />
resources have been created in the past only to be lost on old computers, tapes, or<br />
simply forgotten. Adding resources to this invisible pile is of no use.<br />
In addition, the funding organisations may require the sources to be bundled with<br />
a pool of Free software resources in order to guarantee the physical preservation of<br />
the data and its widest accessibility. Copylefting alone only provides the legal grounds<br />
for the survival of the data; handing over the resources to a pool will make them<br />
available in many copies worldwide and independent from the survival of one or the<br />
other hard disk. Copylefting without providing free data access is like eating without<br />
swallowing.<br />
5. Free Software for SLPs: Benefits and Unsolved Problems<br />
Admittedly, it would be naive to assume that releasing project results as free<br />
software would solve all problems inherent in SLPs. This step might solve the most<br />
important problems of data maintenance and storage, and embed the project into a<br />
39
scientific context. But can it have more positive effects than this? Which problems<br />
remain? Let us return to our original list of critical points in SLPs to see how they are<br />
affected by such a step.<br />
• Open Source pools create a platform for research and data maintenance<br />
that allows the niche to be assigned to SLPs without having to handle situations of<br />
competition;<br />
• Data is made freely available for future modifications and improvements.<br />
If the data is useful it will be handed over from generation to generation;<br />
• The physical storage of the data is possible in many of the listed pools,<br />
and does not depend on the survival of the researchers hard disk;<br />
• The pools frequently provide specific, sophisticated tools for the<br />
production of resources. These tools are a cornerstone of a successful project;<br />
• In addition, through working with these tools, researchers acquire<br />
knowledge and skills that are relevant for the entire area of NLP;<br />
• Working with these tools will lead to ideas for improvement. Suggesting<br />
such improvements will not only help the SLP to leave the niche, but will finally<br />
lead to better tools. For young researchers, this allows them to work on their Small<br />
Language, and, at the same time, to be connected with a wider community for<br />
which their research might be relevant; and,<br />
Through the generality of the tools (i.e., their usage for many languages) the content<br />
of SLPs might become more appropriate for university curricula in computational<br />
linguistics, terminology, corpus linguistics, and so forth. Some problems, however,<br />
remain, for which other solutions have to be found.<br />
These are:<br />
• Discontinuous research if research depends on project acquisition;<br />
• Dependence on research paradigm. Corpus-based approaches can be<br />
used only when corpora are available, rule-based approaches when formally<br />
trained linguists participate in the project. To overcome most of these limitations,<br />
research centres and funding bodies should continuously work on the improvement<br />
of the necessary infrastructure for language technology (cf. Sarasola 2000); and,<br />
• Attracting and binding researchers. As the success of a project depends<br />
to a large extent on the researchers' engagement and skills, attracting and binding<br />
researchers is a sensitive topic, for which soccer clubs provide an illustrative<br />
model. Can a SLP attract top players, or is an SLP just a playground for a talented<br />
young researcher who will sooner or later transfer to a BLP? Or can the SLP count<br />
40
on local players only? A policy for building a home for researchers is thus another<br />
sensitive issue for which research centres and funding bodies should try to find a<br />
solution.<br />
6. Conclusions<br />
Although the ideas outlined in this paper are very much based on ‘sofa-research’ and<br />
intuition, a very schematic and simplistic thinking, informal personal communications,<br />
and my personal experience, I hope to have provided clear and convincing evidence that<br />
Small Language Projects profited, profit and will profit from joining the Open Source<br />
community. For those who want to follow this direction, the first and fundamental<br />
step is to study possible licenses (http://www.gnu.org/philosophy/license-list.html)<br />
and to understand their implications for the problems of SLPs, such as the storage<br />
and survival of data, their improvement through a large community, and so forth. This<br />
article lists some problems against which the licenses can be checked.<br />
Emotional reactions like “I do not want others to fumble in my data,” or “I do not<br />
want others to make money with my work” should be openly pronounced and discussed.<br />
What are the advantages of others having my data? What are the disadvantages? How<br />
can people make money with Open Source data? As said before, misconceptions, and<br />
thus unsubstantiated fears, often lead to a rejection of the Open Source idea than a<br />
well-founded argument. This is how humans function, but not how we advance Small<br />
Languages.<br />
7. Acknowledgments<br />
This paper would not have been written if I had not met with people like Mathias,<br />
Isabella, Christian, Judith, Daniel and Kevin. As a consequence of these encounters,<br />
the paper is much more a systematic summary than my original thinking. For mistakes,<br />
gaps and flawed arguments, however, the author alone is responsible.<br />
41
42<br />
References<br />
Bungeroth, J. & Ney, H. (2004). “Statistical Sign Language Translation.” Streiter,<br />
O. & Vettori, C. (eds) (2004). Proceedings of the Workshop on Representation and<br />
Processing of Sign Languages, 4th International Conference on Language Resources<br />
and Evaluation, Lisbon, Portugal, May 2004, 105-108.<br />
Kuhn, T.S. (1996/1962). The Structure of Scientific Revolutions. Chicago: University<br />
of Chicago Press, 3rd edition.<br />
Morrissey, S. & Way, A. (2005). “An Example-Based Approach to Translating Sign<br />
Languages.” Way, A. & Carl, M. (eds) (2005). Proceedings of the “Workshop on<br />
Example-Based Machine Translation”, MT-Summit X, Pukhet, Thailand, September<br />
2005, 109-116.<br />
Prinsloo, D. & Heid, U. (this volume). “Creating Word Class Tagged Corpora for Northern<br />
Sotho by Linguistically Informed Bootstrapping”, 97-115.<br />
Sarasola, K. (2000). “Language Engineering Resources for Minority Languages”<br />
Proceedings of the Workshop “Developing Language Resources for Minority Languages:<br />
Re-usability and Strategic Priorities.” Second International Conference on Language<br />
Resources and Evaluation, Athens, Greece, May 2000.<br />
Scannell, K. (2003). “Automatic Thesaurus Generation for Minority Languages: an<br />
Irish Example.” Proceedings of the Workshop “Traitement automatique des langues<br />
minoritaires et des petites langues. 10ème conférence TALN, Batz-sur-Mer, France,<br />
June 2003, 203-212.<br />
Somers, H. (1998). “New paradigms.” MT: The State of Play now that the Dust has<br />
Settled. Proceedings of the “Workshop on Machine Translation”,10th European<br />
Summer School in Logic, Language and Information, Saarbrücken, August 1998, 22-<br />
33.<br />
Streiter, O. & De Luca, E.W. (2003). “Example-based NLP for Minority Languages:<br />
Tasks, Resources and Tools.” Streiter, O. (ed) (2003) Proceedings of the Workshop
“Traitement automatique des langues minoritaires et des petites langues”, 10ème<br />
conférence TALN, Batz-sur-Mer, France, June 2003, 233-242.<br />
43
Un corpus per il sardo:<br />
problemi e perspettive<br />
45<br />
Nicoletta Puddu<br />
Creating a corpus for minority languages has provided an interesting tool to both study<br />
and preserve these languages (see, for example, the DoBeS project at MPI Nijmegen).<br />
Sardinian, as an endangered language, could certainly profit from a well-designed<br />
corpus. The first digital collection of Sardinian texts was the Sardinian <strong>Text</strong> Database;<br />
however, it cannot be considered as a corpus: it is not normalized and the user can<br />
only search for exact matches. In this paper, I discuss the main problems in designing<br />
and developing a corpus for Sardinian.<br />
Kennedy (1998) individuates three main stages in compiling a corpus: (1) corpus<br />
design; (2) text collection and capture; and, (3) text encoding or mark-up. As for<br />
the first stage, I propose that a Sardinian corpus should be mixed, monolingual,<br />
synchronic, balanced, and annotated, and I discuss the reasons for these choices<br />
throughout the paper. <strong>Text</strong> collection seems to be a minor problem in the case of<br />
Sardinian: both written and spoken texts are available and the number of speakers<br />
is still significant enough to collect a sufficient amount of data. The major problems<br />
arise at the third stage. Sardinian is fragmented into different varieties, and has not<br />
a standard variety (not even a standard orthography). Recently, several proposals for<br />
standardization have been made, but without success (see the discussion in Calaresu<br />
2002; Puddu 2003). First of all, I suggest using a standard orthography that allows us<br />
to group Sardinian dialects into macro varieties. Then, it will be possible to articulate<br />
the corpus into sub-corpora that are representative of each variety. The creation of<br />
an adequate morphological tag system will be fundamental. As a matter of fact, with<br />
a homogeneous tag system, it will be possible to perform searches throughout the<br />
corpus and study linguistic phenomena both in the single macro variety and in the<br />
language as a whole.<br />
Finally, I propose a morphological tag system and a first tagged pilot corpus of Sardinian<br />
based on written samples according to EAGLES and XCES standards.<br />
1. Perché creare corpora per le lingue minoritarie<br />
La corpus linguistics o linguistica dei corpora (da qui LC) risulta di particolare<br />
interesse, soprattutto per chi adotti un approccio funzionalista, in quanto “studia la
lingua nel modo in cui essa viene effettivamente utilizzata, da parlanti concreti in<br />
reali situazioni comunicative” (Spina 2001:53). L’utilizzo dei corpora, come noto, può<br />
essere molteplice: dagli studi sul lessico (creazione di lessici e dizionari di frequenza)<br />
a quelli sulla sintassi, fino alla didattica delle lingue e alla traduzione. Per le lingue<br />
standardizzate, l’utilizzo di corpora è in grande sviluppo. Tuttavia, anche per le lingue<br />
in pericolo di estinzione, la creazione di corpora si può rivelare particolarmente utile.<br />
Oltre alle motivazioni comuni alle lingue standardizzate, creare un corpus può essere<br />
un valido metodo per conservare la testimonianza della lingua, nella malaugurata<br />
ipotesi che essa si estingua. Se un corpus viene infatti definito come “una raccolta<br />
strutturata di testi in formato elettronico, che si assumono rappresentativi di una data<br />
lingua o di un suo sottoinsieme, mirata ad analisi di tipo linguistico” (Spina 2001:65),<br />
è evidente che esso può fungere anche da “specchio” di una lingua in un determinato<br />
stato. In questo senso il corpus può porsi come strumento complementare ad atlanti<br />
linguistici e indagini mirate, fotografando un campione rappresentativo della lingua.<br />
La presenza di corpora facilita di molto l’analisi di fenomeni linguistici. La presenza<br />
di un corpus non elimina certamente i metodi tradizionali di raccolta dati, ma<br />
fornisce un valido strumento per testare la validità di una ipotesi anche per studiosi<br />
che non possano accedere direttamente ai parlanti. Inoltre, su un corpus è possibile<br />
compiere degli studi sulla frequenza, evidentemente molto difficili da realizzare con<br />
gli strumenti tradizionali.<br />
Mostrata quindi l’utilità dei corpora anche per le lingue minoritarie, è necessario<br />
porre in evidenza le particolari questioni che la linguistica dei corpora si trova ad<br />
affrontare nel caso di lingue non standardizzate. In questo studio prenderemo ad<br />
esempio il caso del sardo, per evidenziare i possibili problemi (e le eventuali soluzioni<br />
da adottare).<br />
Nel caso del sardo, come in molte altre lingue minoritarie in via di estinzione che<br />
si aprono solo ora alla linguistica dei corpora, ci troviamo davanti a due questioni<br />
fondamentali.<br />
Da un lato è necessario creare quanto prima un corpus per cercare di preservare<br />
lo stato di lingua attuale. Nel caso del sardo, sottoposto a massiccia influenza da<br />
parte dell’italiano, alcune varietà rischiano una rapida estinzione e sarebbe quanto<br />
mai auspicabile raccogliere il prima possibile, con criteri omogenei, dati della lingua<br />
parlata che possano essere inseriti in un corpus.<br />
Dall’altro lato, oltre che per la pianificazione del corpus e la raccolta dei dati, è<br />
necessario stabilire tutti gli standard di codifica e annotazione e ciò, come vedremo,<br />
46
crea non pochi problemi nel caso di lingue non standardizzate e frammentarie come<br />
il sardo.<br />
2. Un progetto sperimentale: il Sardinian Digital Corpus<br />
Il sardo è, come ben noto, suddiviso in diverse varietà e non standardizzato,<br />
nonché in costante regresso. Esistono diverse grammatiche e dizionari e, su internet,<br />
è disponibile il Sardinian <strong>Text</strong> Database (http://www.lingrom.fu-berlin.de/sardu/<br />
textos.html), una raccolta di testi in sardo curata dall’Università di Colonia. Si tratta<br />
di un’interessante iniziativa, che però non risponde ai criteri di rappresentatività,<br />
campionamento e bilanciamento. I testi vengono infatti inseriti dai vari autori e non<br />
vi è uniformità nella codifica.<br />
2.1 La pianificazione<br />
Come noto, la fase di pianificazione è fondamentale in quanto, proprio in questa<br />
fase, si prendono decisioni che determinano la fisionomia del corpus e che, da un<br />
certo momento in poi, non possono più essere modificate.<br />
Nel descrivere le fasi della progettazione del SDC, seguiremo la fondamentale<br />
tassonomia di Ball (anno:pag.).<br />
Una prima distinzione è quella per mezzo. Nel progettare un corpus di una lingua<br />
in pericolo di estinzione, ovviamente sarebbe da privilegiare la presenza di campioni<br />
di lingua orale. Tuttavia, quando esista una tradizione scritta, la scelta potrebbe<br />
ricadere anche su una tipologia di corpus mista, in modo da avere una visione quanto<br />
più possibile globale della lingua. Nel caso del sardo, ad esempio, un corpus misto<br />
sarebbe possibile, in quanto esiste una certa produzione scritta sia “tradizionale”<br />
(romanzi, racconti, testi poetici, articoli di giornale), sia “elettronica” (mailing lists<br />
e siti in sardo).<br />
Il SDC dovrebbe essere un corpus monolingue, ma rappresentativo delle diverse<br />
varietà. La creazione di corpora multilingui o paralleli potrebbe essere un passo<br />
successivo, particolarmente interessante sia dal punto di vista della linguistica<br />
comparativa che della glottodidattica.<br />
Per le stesse ragioni enumerate sopra, il SDC si propone come un corpus sincronico,<br />
dato che abbiamo mostrato come sia urgente documentare lo stato di lingua attuale.<br />
Il confronto con stati di lingua passati e quindi la creazione di un corpus diacronico<br />
dovrà essere necessariamente successiva.<br />
47
Il SDC, data la totale assenza di altri corpora per il sardo dovrebbe quindi essere<br />
un corpus di riferimento, basato rigorosamente sui due criteri di campionamento e<br />
bilanciamento.<br />
Il corpus si propone, almeno in una fase iniziale, come un corpus aperto,<br />
continuamente aggiornabile con nuove acquisizioni, sempre e comunque coerenti con<br />
la pianificazione iniziale.<br />
Infine, il corpus dovrà essere annotato. A questo proposito, proponiamo qui anche<br />
una prima ipotesi di annotazione del SDC secondo gli standard internazionali.<br />
2.2 Acquisizione dei dati<br />
Per quanto riguarda la raccolta dati, non vi sono particolari differenze rispetto<br />
a lingue ufficiali e standardizzate. Bisogna pertanto adottare gli accorgimenti tipici<br />
della ricerca sul campo.<br />
Problemi ben più importanti si pongono invece per quanto riguarda la codifica<br />
dei dati. Bisogna infatti arrivare a una normalizzazione grafica che evidentemente<br />
comporta, per i testi non standardizzati, una scelta da parte di chi codifica.<br />
Il sardo rappresenta, sotto questo punto di vista, un caso emblematico. Le differenze<br />
tra le diverse varietà sono, infatti, numerose, soprattutto sul piano fonetico. In quale<br />
varietà devono essere inseriti i testi nel corpus? E sino a che punto è possibile ridurre<br />
le diverse varietà del sardo a una unica macrovarietà?<br />
La questione della standardizzazione della lingua sarda è stata oggetto, negli ultimi<br />
anni, di una robusta polemica. Nel 2001 l’Assessorato alla Pubblica Istruzione della<br />
Regione Sardegna ha pubblicato una prima proposta di standardizzazione denominata<br />
Limba sarda unificada (LSU). Tale proposta è stata elaborata da un’apposita commissione<br />
e si tratta in sostanza di una varietà che, per ammissione della stessa commissione,<br />
per quanto si ponga come obiettivo la mediazione tra le diverse varietà presenti<br />
nell’isola, è “rappresentativa di quelle varietà più vicine alle origini storico-evolutive<br />
della lingua sarda” (LSU: 5). Di fatto, i tratti scelti per la LSU sono perlopiù logudoresi<br />
e sono stati percepiti dai parlanti come tratti “locali” piuttosto che conservativi. Ciò<br />
ha portato a una netta opposizione allo standard soprattutto da parte dei parlanti<br />
campidanesi: le motivazioni di questa reazione sono analizzate dal punto di vista<br />
sociolinguistico in Calaresu (2002) e Puddu (2003, 2005).<br />
Di recente, la Regione Sardegna ha incaricato una nuova commissione di elaborare<br />
una lingua standard per usi solo burocratici-amministrativi e, nel contempo, di creare<br />
48
delle norme ortografiche sarde “per tutte le varietà linguistiche in uso nel territorio<br />
regionale” 1 .<br />
La soluzione migliore, a mio parere, è quindi di inserire i testi nelle diverse<br />
macrovarietà del sardo con una standardizzazione solo ortografica in base alle proposte<br />
della commissione. In sostanza, si tratterebbe di operare solo una normalizzazione<br />
grafica sui vari testi riconducendoli a macrovarietà e annotando le eventuali<br />
differenziazioni fonetiche a parte. Facciamo un esempio: la nasale intervocalica<br />
originaria del latino subisce, nelle diverse varietà del sardo campidanese, trattamenti<br />
differenti. In alcuni casi viene mantenuta, in alcuni viene raddoppiata, in altri è ridotta<br />
a nasalizzazione della vocale precedente, mentre in altri ancora è sostituita dal colpo<br />
di glottide. Pertanto, la parola per ‘luna’, può essere pronunciata come [‘luna], [‘lun:<br />
a] [‘luâa] o ancora [‘lu/a]. La mia proposta è quindi di trascrivere la forma originaria<br />
luna, annotando poi la eventuale trascrizione fonetica in un file separato.<br />
Vi è inoltre il problema di inserire nel corpus anche alcune varietà come il<br />
gallurese e il sassarese che non sono unanimemente riconosciute come dialetti del<br />
sardo. Si potrebbe però operare una distinzione tra ‘nuclear Sardinian’, costituito<br />
da campidanese e logudorese e ‘core Sardinian’, con il gallurese e sassarese. La<br />
struttura del SDC dovrebbe, in definitiva, essere rappresentata come nella figura 1 in<br />
appendice.<br />
2.3 Codifica primaria<br />
La nostra proposta è, come detto, che il SDC sia annotato. Il corpus sarà codificato<br />
usando XML e lo standard sarà quello stabilito da XCES (Corpus Encoding Standard<br />
for XML). Le linee guida CES raccomandano in particolar modo che l’annotazione sia<br />
di tipo stand-off, vale a dire che l’annotazione non sia unita al testo base, ma sia<br />
contenuta in altri files XML a esso collegati. Nel caso del SDC l’annotazione stand-off<br />
è particolarmente importante, dato che è pensato come un corpus aperto.<br />
• Nella codifica dei corpora, le guidelines CES riconoscono tre principali<br />
categorie di informazioni rilevanti:la documentazione, che contiene informazioni<br />
globali sul testo, il suo contenuto e la sua codifica; i dati primari, che comprendono<br />
sia la “gross structure”, ovverosia il testo al quale vengono aggiunte informazioni<br />
sulla struttura generale (paragrafi, capitoli, titoli, note a piè di pagina, tabelle,<br />
figure ecc.) sia elementi che compaiono al livello del sottoparagrafo;<br />
1 “La Commissione dovrà inoltre definire norme ortografiche comuni per tutte le varietà linguistiche in<br />
uso nel territorio regionale. In questo modo sarà possibile promuovere la creazione di word processor,<br />
correttori ortografici, oltre all’utilizzo e alla diffusione di strumenti elettronici per favorire l’uso<br />
corretto della lingua sarda” (http://www.regionesardegna.it/j/v/25?s=3661&v=2&c=220&t=1).<br />
49
• l’annotazione linguistica, che può essere di tipo morfologico, sintattico,<br />
prosodico ecc.<br />
A titolo esemplificativo, mostrerò di seguito come possa essere codificato un testo<br />
preso da un articolo di giornale. Supponiamo di dover codificare il testo seguente<br />
pubblicato su un giornale locale.<br />
Su Casteddu hat giogau ariseru in su stadiu Sant’Elia bincendi po quatru a zero<br />
contras a sa Juventus. Is festegiamentus funti sighius fintzas a mengianeddu.<br />
Cust’annu parit chi ddoi siant bonas possibilidadis de binci su scudetu.<br />
‘Il Cagliari ha giocato ieri allo stadio Sant’Elia vincendo per quattro a zero<br />
contro la Juventus. I festeggiamenti sono continuati sino all’alba. Quest’anno<br />
pare che ci siano buone possibilità di vincere lo scudetto’.<br />
Le informazioni relative alla documentazione saranno contenute in una header<br />
conforme alle guidlines CE. La header minima per questo documento sarà la<br />
seguente.<br />
sdc_header.xml<br />
<br />
<br />
<br />
<br />
Sardinian Digital Corpus, demo<br />
<br />
<br />
University of…<br />
via…<br />
Free<br />
2005<br />
<br />
<br />
<br />
50
La Gazzetta di Sardegna<br />
…<br />
…<br />
…<br />
<br />
<br />
Su scudetu in Casteddu<br />
Porru<br />
<br />
<br />
<br />
<br />
<br />
Il Sardinian Digital Corpus…<br />
<br />
CES Level 1<br />
<br />
Rendition attribute<br />
values on Q and QUOTE tags are adapted from ISOpub and ISOnum standard<br />
entity set names<br />
Marked up to the level of sentence<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Campidanese Sardinian<br />
51
<br />
<br />
La header sarà esterna al CesDoc. Le varie headers saranno salvate in uno headerbase<br />
esterno al documento e vi si farà riferimento attraverso un’espressione Xpointer nel<br />
CesDoc. Il CesDoc, che contiene il secondo tipo di informazioni si presenterà come<br />
segue:<br />
sdc_demo.xml<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Su Casteddu hat giogau ariseru in su stadium<br />
Sant' Elia bincendi po quatru a zero contras a sa Juventus.<br />
Is festegiamentus funti sighius fintzas<br />
mengianeddu.<br />
Cust' annu parit chi ddoi siant bonas<br />
ssibilidadis de binci su scudetu.<br />
<br />
<br />
<br />
<br />
52
2.4 Annotazione<br />
Come detto, il SDC si propone come un corpus annotato. Al XCESDoc faranno<br />
quindi riferimento, tramite Xpointer, le annotazioni ai vari livelli. Il più diffuso<br />
tipo di annotazione è quello per parti del discorso. Nel caso del sardo un corpus<br />
annotato per parti del discorso potrebbe rivelarsi particolarmente utile per la ricerca<br />
linguistica, soprattutto nella creazione di una grammatica descrittiva corpus-based.<br />
L’annotazione per parti del discorso sarà logicamente la prima in ordine di tempo, ma,<br />
dato il carattere aperto del corpus, potranno in seguito essere aggiunte annotazioni<br />
su altri livelli.<br />
2.5 Il tagset<br />
Nel caso del sardo è necessario creare un apposito tagset, che può essere in parte<br />
mutuato dai tagset per l’italiano e lo spagnolo creati all’interno del progetto MULTEXT<br />
secondo le raccomandazioni EAGLES. Le diverse varietà del sardo non differiscono<br />
particolarmente dal punto di vista morfosintattico: questo significa che è possibile<br />
definire un unico tagset per tutte le varietà. Gli esempi in questo articolo sono in<br />
campidanese, ma le etichette si potranno applicare praticamente senza variazioni<br />
anche alle altre varietà del sardo.<br />
L’annotazione grammaticale del nostro corpus, compatibile con la CesAna DTD,<br />
consisterà di tre livelli:<br />
• la forma base ();<br />
• una descrizione morfosintattica secondo le linee guida EAGLES ();<br />
• un corpus tag ().<br />
In accordo con quanto proposto da EAGLES, abbiamo una descrizione a due livelli:<br />
• la prima, a grana più fine, contiene la descrizione quanto più accurata<br />
possibile del token (descrizione lessicale );<br />
• la seconda invece, “a grana più grossa”, è una versione sottodeterminata<br />
della prima descrizione (corpus tag ).<br />
La distinzione a due livelli si mostra particolarmente utile quando si voglia utilizzare<br />
un sistema di etichettatura automatica. Alcune categorie sono infatti piuttosto difficili<br />
da disambiguare automaticamente ed è pertanto opportuno avere un sistema di<br />
etichettatura a grana più grossa. Nel caso del sardo, la creazione o l’implementazione<br />
di tagger automatici può essere un passo successivo, ma mi sembra utile, sin d’ora,<br />
definire un sistema di etichettatura adatto anche per un futuro utilizzo automatico.<br />
• Il tag si compone di una stringa di caratteri strutturata nel modo<br />
seguente: in posizione 0 il simbolo che codifica la parte del discorso;<br />
53
• nelle posizioni da 1 a n i valori degli attributi relativi a persona, genere,<br />
numero, caso ecc.;<br />
• se un attributo non si può applicare è sostituito da un trattino.<br />
Il tagset qui proposto è simile, come detto, a quello proposto per l’italiano e lo<br />
spagnolo all’interno del progetto MULTEXT (Calzolai & Monachini 1996). Manteniamo<br />
innanzi tutto la classificazione in parti del discorso proposta all’interno di MULTEXT<br />
(tab.1 in appendice).<br />
Analizziamo quindi in breve le etichette create per il sardo. Ci soffermeremo solo<br />
nel caso in cui vi siano differenze notevoli tra il sardo e le altre due lingue romanze,<br />
e nei casi in cui siano state operate scelte differenti.<br />
Nome<br />
Per quanto riguarda la categoria “nome”, i tratti presi in considerazione sono<br />
esemplificati nella tabella 2 e corrispondono a quelli considerati per l’italiano in<br />
Calzolari e Monachesi. La tabella 3 mostra le possibili combinazioni e la traduzione<br />
in ctags.<br />
Verbo<br />
Nella scelta dei valori del verbo, vi sono alcune modifiche sia rispetto all’italiano<br />
che allo spagnolo (tab. 4). Per quanto riguarda il modo, non è inserito tra i codici il<br />
condizionale. In sardo, infatti, esso è una forma perifrastica formata tramite un verbo<br />
ausiliare (camp. hai ‘avere’, log. depi ‘dovere’) più l’infinito del verbo. Pertanto,<br />
in conformità con quanto fatto nelle altre lingue per le forme perifrastiche, le due<br />
forme saranno etichettate autonomamente.<br />
Il medesimo discorso vale per il futuro, formato nelle diverse varietà dal verbo<br />
‘avere’ coniugato, la preposizione a e l’infinito del verbo.<br />
Il sardo non possiede inoltre forme di passato remoto, ma il passato non durativo<br />
è espresso dalla forma perifrastica formata da ’avere’ più il participio passato del<br />
verbo.<br />
La situazione è piuttosto complessa per quanto riguarda i clitici. Mentre nel caso<br />
dell’italiano è previsto un unico codice E per tutti i tipi di clitico, nel caso dello<br />
spagnolo ogni tipo di clitico e le possibili combinazioni vengono specificati con diversi<br />
codici anche nel ctag. Lo spagnolo non presenta però i cosiddetti “clitici avverbiali”<br />
(italiano ne e ci), che in sardo si possono aggiungere al verbo e combinarsi con altri<br />
54
clitici. Abbiamo quindi mantenuto la convenzione dello spagnolo, aggiungendo però i<br />
codici per i clitici avverbiali.<br />
La tabella 5 mostra tutte le possibili combinazioni. Si noti che in sardo non esiste<br />
il participio presente e che è possibile aggiungere forme clitiche del pronome solo al<br />
gerundio e all’imperativo.<br />
Aggettivo<br />
L’aggettivo in sardo (tabb. 6 e 7) non presenta particolari differenze rispetto<br />
all’italiano e allo spagnolo. Il comparativo è normalmente formato con prus ‘più’<br />
seguito dall’aggettivo.<br />
Pronome<br />
Per quanto riguarda i pronomi, in analogia con quanto fatto per lo spagnolo, è<br />
stato preso in considerazione anche l’attributo caso (tabb. 8 e 9).<br />
Determinante<br />
Cfr. tabb.10 e 11 in appendice.<br />
Articolo<br />
Cfr. tabb. 12 e 13 in appendice.<br />
Avverbio<br />
Cfr. tabb. 14 e 15 in appendice.<br />
Determinante<br />
Cfr. tabb.10 e 11 in appendice.<br />
Articolo<br />
Cfr. tabb. 12 e 13 in appendice.<br />
Avverbio<br />
Cfr. tabb. 14 e 15 in appendice.<br />
Punteggiatura<br />
55
Cfr. tabella 24 in appendice.<br />
2.4 L’esempio annotato<br />
A questo punto siamo in grado di annotare l’esempio secondo le convenzioni XCES.<br />
Per questioni di brevità forniamo l’annotazione solo di una parte del nostro testo.<br />
sdc_annotation.xml<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
su<br />
<br />
su<br />
Tdms-<br />
RMS<br />
<br />
<br />
<br />
Casteddu<br />
<br />
Casteddu<br />
Np..-<br />
56
NP<br />
<br />
<br />
<br />
hat<br />
<br />
hai<br />
Vaip3s-<br />
VAS3IP<br />
<br />
<br />
<br />
giogau<br />
<br />
giogai<br />
Vmp--sm-<br />
VMPSM<br />
<br />
<br />
<br />
ariseru<br />
<br />
<br />
<br />
ariseru<br />
R-p-<br />
B<br />
<br />
in<br />
<br />
in<br />
Sp<br />
57
SP<br />
<br />
<br />
<br />
su<br />
<br />
su<br />
Sp-<br />
SP<br />
<br />
<br />
<br />
stadiu<br />
<br />
stadiu<br />
Ncms-<br />
NMS<br />
<br />
<br />
<br />
Sant&apos<br />
<br />
Santu<br />
A-pms-<br />
AMS<br />
<br />
<br />
<br />
Elia<br />
<br />
Elia<br />
Np..-<br />
NP<br />
58
.....<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
is<br />
<br />
is<br />
Tdmp-<br />
RMP<br />
<br />
<br />
<br />
festegiamentus<br />
<br />
festegiamentu<br />
Ncmp-<br />
NMP<br />
<br />
<br />
<br />
<br />
<br />
59
3. Conclusioni<br />
In questo articolo abbiamo mostrato, attraverso un case study, le problematiche<br />
relative all’applicazione della linguistica dei corpora a lingue minoritarie e non<br />
standardizzate. Mi pare quindi che si possano trarre alcune considerazioni più generali<br />
applicabili a gran parte delle lingue minoritarie.<br />
Innanzi tutto, il lavoro da fare, in casi come quello del sardo, è a più livelli dalla<br />
progettazione del corpus, alla raccolta, all’etichettatura dei dati. L’imponenza<br />
dell’opera si accompagna alla necessità di accelerare i tempi nel caso di varietà<br />
particolarmente a rischio di estinzione.<br />
In secondo luogo, in caso di varietà non standardizzate, si pone, come abbiamo<br />
visto, il problema della codifica dei dati. La scelta qui operata, vale a dire di un corpus<br />
articolato in sotto-macro-varietà, mi sembra possa essere un buon compromesso tra<br />
la necessità di standardizzazione da un lato e il mantenimento delle differenziazioni<br />
dall’altro. Ciò che è assolutamente necessario è invece il raggiungimento di una<br />
standardizzazione dal punto di vista ortografico.<br />
Infine, l’annotazione per parti del discorso può essere particolarmente utile per<br />
la creazione di grammatiche basate sull’uso. Nel caso del sardo, ad esempio, la<br />
presenza di una lingua dominante come l’italiano, con una ricca tradizione letteraria<br />
e grammaticale, può influenzare i giudizi di grammaticalità dei parlanti, inficiando in<br />
alcuni casi i dati raccolti.<br />
Mi pare quindi che, in base a tutte queste riflessioni, la progettazione e lo sviluppo<br />
di corpora per le lingue minoritarie debbano assumere un ruolo prioritario in progetti<br />
di salvaguardia e politica linguistica.<br />
Ringraziamenti<br />
Ringrazio Andrea Sansò per aver letto con attenzione il manoscritto.<br />
60
61<br />
Bibliografia<br />
Ball, C. (1994). Concordances and corpora for classroom and research. <strong>Online</strong> at<br />
http://www.georgetown.edu/cball/corpora/tutorial.html.<br />
Bel, N. & Aguilar A. (1994). Proposal for Morphosyntactic encoding: Application to<br />
Spanish, Barcelona.<br />
Blasco Ferrer, E. (1986). La lingua sarda contemporanea. Cagliari: Edizioni della<br />
Torre.<br />
Calaresu, E. (2002). “Alcune riflessioni sulla LSU (Limba Sarda Unificada).” Orioles,<br />
V. (a cura di), La legislazione nazionale sulle minoranze linguistiche. Problemi,<br />
applicazioni, prospettive. Udine: Forum, 247-266.<br />
Calzolari, N. & Monachini, M. (1996). Multext. Common Specification and notations<br />
for Lexicon Encoding, Pisa: Istituto di Linguistica Computazionale.<br />
EAGLES (1996) Recommendations for the morphosyntactic annotation of corpora.<br />
EAG-TCWG-MAC/R, Pisa: Istituto di Linguistica Computazionale.<br />
Ide, N. (1998) “Corpus Encoding Standard. SGML Guidelines for Encoding Linguistic<br />
Corpora.” Proceedings of the First International Language Resources and Evaluation<br />
Conference, Paris: European Language Resources Association, 463-70.<br />
Ide, N., Bonhomme, P. & Romary, L. (2000). “XCES: An XML-based Standard for<br />
Linguistic Corpora.” Proceedings of the Second Language Resources and Evaluation<br />
Conference (LREC), Athens, 825-30.
Ide, N. (2004). “Preparation and Analysis of Linguistic Corpora.” Schreibman, S.,<br />
Siemens, R. & Unsworth, J. (a cura di), A Companion to Digital Humanities. London:<br />
Blackwell.<br />
Kennedy, G. (1998). An introduction to Corpus Linguistics. London: Longman.<br />
Leech, G. & Wilson, A. (1996). EAGLES recommendations for the morphosyntactic<br />
annotation of corpora. Pisa: Istituto di Linguistica Computazionale.<br />
Regione Autonoma della Sardegna (2001). “Limba sarda unificada. Sintesi delle norme<br />
di base: ortografia, fonetica, morfologia e lessico”, Cagliari.<br />
McEnery, T. & Wilson, A. (1996). Corpus Linguistics. Edinburgh: Edinburgh University<br />
Press.<br />
Mensching, G. & Grimaldi, L. (2000). Sardinian <strong>Text</strong> Database, http://www.lingrom.<br />
fu-berlin.de/sardu/textos.html.<br />
Puddu, N. (2003). “In search of the “real Sardinian”: truth and representation.”<br />
Brincat, J., Boeder, W., Stolz, T. (a cura di), Purism in minor languages, endangered<br />
languages, regional languages, mixed languages. Bochum: Universitätsverlag Dr. N.<br />
Brockmeyer, 27-42.<br />
Puddu, N. (2005). “La nozione di purismo nel progetto di standardizzazione della<br />
lingua sarda.“ Carli, A., Calaresu, E. & Guardiano, C. (a cura di), Lingue, istituzioni,<br />
territori. Riflessioni teoriche, proposte metodologiche ed esperienze di politica<br />
linguistica. Bulzoni: Roma, 257-278.<br />
Spina, S. (2001). Fare i conti con le parole. Perugia: Guerra Edizioni.<br />
Una commissione tecnico-scientifica per un’indagine socio-linguistica sullo stato<br />
della lingua sarda. <strong>Online</strong> at www.regione.sardegna.it.<br />
62
Figura 1: La struttura del Sardinian Digital Corpus<br />
63<br />
Appendice
Tabella 1: Codici per le parti del discorso<br />
Parte del discorso Codice<br />
Nome N<br />
Verbo V<br />
Aggettivo A<br />
Pronome P<br />
Determinante D<br />
Articolo T<br />
Avverbio R<br />
Apposizione S<br />
Congiunzioni C<br />
Numerali M<br />
Interiezione I<br />
Unico U<br />
Residuale X<br />
Abbreviazione Y<br />
Tabella 2: Coppie attributo-valore per la categoria “Nome” in sardo<br />
Attributo Valore Esempio Codice<br />
Tipo comune libru c<br />
proprio Giuanni p<br />
Genere maschile omini m<br />
femminile femina f<br />
comune meri c<br />
Numero singolare omini s<br />
plurale feminas p<br />
Caso /// /// ///<br />
Tabella 3: e per la categoria “Nome” in sardo<br />
msd ctag esempio<br />
Ncms- NMS liburu<br />
Ncmp- NMP liburus<br />
Ncmn- NN lunis (su/is)<br />
Ncfs- NFS domu<br />
Ncfp- NFP domus<br />
Nccs- NNS meri (su/sa)<br />
Nccp- NNP meris (is f.m.),<br />
Np- NP Mariu, Maria, Puddu<br />
64
Tabella 4: Coppie attributo-valore per la categoria “Verbo” in sardo<br />
Attributo Valore Esempio Codice<br />
Status lessicale papai m<br />
ausiliare hai/essi a<br />
modale podi o<br />
Modo indicativo papat I<br />
congiuntivo papit s<br />
imperative papa m<br />
infinito papai n<br />
participio papau p<br />
gerundio papendi g<br />
Tempo presente papu p<br />
imperfetto papasta i<br />
persona prima seu 1<br />
seconda ses 2<br />
terza est 3<br />
numero singolare papat s<br />
plurale papant p<br />
genere maschile papau m<br />
femninile papada f<br />
clitico accusativo donaddu a<br />
beninci r<br />
dativo donaddi d<br />
avverbiale donandi<br />
acc+dat donasiddu t<br />
dat+avv donasindi u<br />
avv+dat donandeddi v<br />
dat+avv+acc mandasinceddu z<br />
Tabella 5: e per la categoria “Verbo” in sardo<br />
msd ctag esempio<br />
Vaip1s- VAS1IP hapu/seu<br />
Vaip2s- VAS2IP has/ses<br />
Vaip3s- VAS3IP hat/est<br />
Vaip1p- VAP1ICP eus/seus<br />
Vaip2p- VAP2IP eus/seis<br />
Vaip3p- VAP3IP hant/funt<br />
Vaii1s- VAS1II hia, femu<br />
Vaii2s- VAS2II hiast, fiast<br />
65
Vaii3s- VAS3II hiat, fiat<br />
Vaii1p- VAP1II emus, femus<br />
Vaii2p- VAP2II eis, festis<br />
Vaii3p- VAP3II iant, fiant<br />
Vasp1s- VASXCP apa, sia<br />
Vasp2s- VASXCP apas, sias<br />
Vasp3s- VASXCP apas, siat<br />
Vasp1p- VAP1ICP apaus, siaus<br />
Vasp2p- VAP2CMP apais, siais<br />
Vasp3p- VAP3CP apant, siant<br />
Vasi1s- VAS3CI hemu, fessi<br />
Vasi2s- VAS3CI essist, fessis<br />
Vasi3s- VAS3CI essit, fessit<br />
Vasi1p- VAP1CI essimus, festus<br />
Vasi2s- VAP2ICR essidis, festis<br />
Vasi3p- VAP3CI essint, fessint<br />
Vanp--- VAF hai, essi<br />
Vaps-sm VAMSPR apiu, stetiu<br />
Vaps-pm VAMPPR stetius<br />
Vaps-sf VAFSPR stetia<br />
Vaps-pf VAFPPR stetias<br />
Vmip1s- VMIP1S papu<br />
Vmip2s- VMIP2S papas<br />
Vmip3s- VMIP3S papat<br />
Vmip1p- VMIP1P papaus<br />
Vmip2p- VMIP2P papais<br />
Vmip3p- VMIP3P papant<br />
Vmsp1s- VMSP1S papi<br />
Vmsp2s- VMSP2S papis<br />
Vmsp3s- VMSP3S papi<br />
Vmsp1p- VMSP1P papeus<br />
Vmsp2p- VMSP2P papeis<br />
Vmsp3p- VMSP3P papint<br />
Vmii1s- VMII1S papemu<br />
Vmii2s- VMII2S papást<br />
Vmii3s- VMII3S papát<br />
Vmii1p- VMII1P papemus<br />
Vmii2p- VMII2P papestis<br />
Vmii3p- VMII3P papánt<br />
66
Vmsi1s- VMSI1S tenessi<br />
Vmsi2s- VMSI2S tenessis<br />
Vmsi3s- VMSI3S tenessit<br />
Vmis1s- VMIS1S tenessimus<br />
Vmsi2p- VMSI2P tenestis<br />
Vmsi3p- VMSI3P tenessint<br />
Vmp--pf- VMPPF tentas<br />
Vmp--sf- VMPSF tenta<br />
Vmp--pm- VMPPM tentus<br />
Vmp--sm- VMPSM tentu<br />
Vmg----t VMGT tzerrienimiddas<br />
Vmg----t VMGT tzerriandimiddas<br />
Vmg----t VMGT tzerriendimidda<br />
Vmg----t VMGT tzerriendimiddus<br />
Vmg----t VMGT tzerriendimiddu<br />
Vmg----d VMGD tzerriendimì<br />
Vmg----t VMGT tzerriendididdas<br />
Vmg----t VMGT tzerriendididdas<br />
Vmg----t VMGT tzerriendididda<br />
Vmg----t VMGT tzerriendididdus<br />
Vmg----t VMGT tzerriendididdu<br />
Vmg----d VMGD tzerriendidì<br />
Vmg----t VMGT tzerriendisiddas<br />
Vmg----t VMGT tzerriendisidda<br />
Vmg----t VMGT tzerriendisiddus<br />
Vmg----t VMGT tzerriendisiddu<br />
Vmg----d VMGD tzerriendisì<br />
Vmg----t VMGT tzerreindisiddas<br />
Vmg----t VMGT tzerriendisidda<br />
Vmg----t VMGT tzerriendisiddus<br />
Vmg----t VMGT tzerriendisiddu<br />
Vmg----d VMGD tzerriendisì<br />
Vmg----t VMGT tzerriendisiddas<br />
Vmg----t VMGT tzerriendisidda<br />
Vmg----t VMGT tzerriendisiddus<br />
Vmg----t VMGT tzerriendisiddu<br />
Vmg----d VMGD tzerriendisì<br />
Vmg----a VMGA tzerriendimì<br />
Vmg----a VMGA tzerrienditì<br />
Vmg----a VMGA tzerriendiddas<br />
67
Vmg----a VMGA tzerriendidda<br />
Vmg----a VMGA tzerriendiddus<br />
Vmg----a VMGA tzerriendiddu<br />
Vmg----a VMGA tzerriendisì<br />
Vmg----a VMGA tzerriendisì<br />
Vmg----u VMGU mandendisindi<br />
Vmg----z VMGZ mandendisinceddu<br />
Vmg----- VMG tzerriendi<br />
Vmmp2sa VMM2SA mandaddu<br />
Vmmp2sd VMM2SD mandadì<br />
Vmmp2st VMM2ST mandadiddu<br />
Vmmp2su VMM2SU mandadindi<br />
Vmmp2sv VMM2SV mandandeddi<br />
Vmmp2sz VMM2SZ mandasinceddu<br />
Vmmp2pa VMM2PA mandaiddu<br />
Vmmp2pd VMM2PD mandaisì<br />
Vmmp2pt VMM2PT mandaisiddu<br />
Vmmp2pu VMM2PU mandaisndi<br />
Vmmp2pz VMM2PZ mandaisinceddu<br />
Vmmp2s- VMM2S manda<br />
Vmmp2p- VMM2P mandai<br />
Vmmp2pv VMM2PV mandaindeddi<br />
Tabella 6: Coppie attributo-valore per la categoria “Aggettivo” in sardo<br />
Attributo Valore Esempio Codice<br />
Tipo // // //<br />
Grado positivo bonu p<br />
comparativo mellus c<br />
superlativo mellus s<br />
Genere maschile bonu m<br />
femminile bona f<br />
l-spec druci c<br />
Numero singolare bonu s<br />
plurale bonus p<br />
Caso // // //<br />
68
Tabella 7: e per la categoria “Aggettivo” in sardo<br />
msd ctag esempio<br />
A-pms- AMS bonu<br />
A-pmp- AMP bonus<br />
A-pfs- AFS bella<br />
A-pfp- AFP bellas<br />
A-pcs- ANS druci<br />
A-pcp- ANP drucis<br />
A-sms- AMS bellissimu<br />
A-smp- AMP bellissimus<br />
A-sfs- AFS bellissima<br />
A-sfp- AFP bellissimas<br />
A-sfs- AFS bellissima<br />
A-sfp- AFP bellissimas<br />
Tabella 8: Coppie attributo-valore per la categoria “Pronome” in sardo<br />
Attributo Valore Esempio Codice<br />
Tipo personale deu p<br />
dimostrativo cuddu d<br />
indefinito calincunu i<br />
possessivo miu m<br />
interrogativo chini t<br />
relativo chi r<br />
esclamativo cantu ! e<br />
riflessivo si x<br />
Persona prima deu 1<br />
seconda tui 2<br />
terza issu 3<br />
Genere maschile issu m<br />
femminile issa f<br />
L-spec comune deu c<br />
Numero singolare custu s<br />
plurale custus p<br />
L-spec invariante chini n<br />
Caso nominativo deu n<br />
dativo ddi d<br />
accusativo ddu a<br />
obliquo mei o<br />
69
Tabella 9: e per la categoria “Pronome” in sardo<br />
msd ctag esempio<br />
Pd-ms-- PDMS cussu<br />
Pd-mp-- PDMP cuddus<br />
Pd-fs-- <strong>PDF</strong>S cudda<br />
Pd-fp-- <strong>PDF</strong>P cuddas<br />
Pi-ms-- PIMS dognunu<br />
Pi-mp-- PIMP calincunus<br />
Pi-fs-- PIFS dognuna<br />
Pi-fp-- PIFP calincunas<br />
Pi-cs-- PINS chinechisiat<br />
Ps1ms-- PPMS miu, nostru<br />
Ps1mp-- PPMP mius<br />
Ps1fs-- PPFS mia<br />
Ps1fp-- PPFP mias<br />
Ps2ms-- PPMS tuu<br />
Ps2mp-- PPMP tuus<br />
Ps2fs-- PPFS tua<br />
Ps2fp-- PPFP tuas<br />
Ps3ms-- PPMS suu<br />
Ps3mp-- PPMP suus<br />
Ps3fs-- PPFS sua<br />
Ps3fp-- PPFP suas<br />
Ps3cp-- PPNP insoru<br />
Pt-cs-- PWNS chini?<br />
Pt-cn-- PWNN ita?<br />
Pt-cs-- PWMS cantu?<br />
Pt-cp-- PWMP cantus?<br />
Pr-cs-- PWMS cantu<br />
Pr-cp-- PWNP cantus<br />
Pr-cs-- PWNS chini<br />
Pr-cp-- PWNP calis<br />
Pe-cs-- PWNS cantu!<br />
Pe-cp-- PWNP cactus!<br />
Pe-cn-- PWNN ita!<br />
Pp1csn- PP1SN deu<br />
Pp2cs-n PP2SN tui<br />
Pp3ms[no] PP3MS issu<br />
Pp3fs[no] PP3FS issa<br />
Pp1cp[no] PP1PN nosus<br />
Pp2cp[no] PP2PN bosatrus<br />
Pp3mp[no] PP3MP issus<br />
Pp3fp[no] PP3FP issas<br />
Pp1cso- PP1SO mei<br />
Pp2-so- PP2SO ti<br />
P[px]1cs[ad]- P1S mi<br />
P[px]2cs[ad]- P2S ti<br />
70
P[px]3cs[ad]- P3 si<br />
Pp3.pd- PP3PD ddis<br />
Pp3.sd- PP3SD ddi<br />
Pp3fpa- PP3FPA ddas<br />
Pp3fsa- PP3FSA dda<br />
Pp3mpa- PP3MPA ddus<br />
Pp3msa- PP3MSA ddu<br />
P[px]1cp[ad]- P1P si<br />
P[px]2cp[ad]- P2P si<br />
P..fp--- PFP mias, custas,<br />
P..fs--- PFS mia, custa, canta etc.<br />
P..mp--- PMP mius, custus, cantas etc.<br />
P..ms--- PMS miu, custu, cantu etc.<br />
Tabella 10: Coppie attributo-valore per la categoria “Determinante” in sardo<br />
Attributo Valore Esempio Codice<br />
Tipo dimostrativo cuddu d<br />
indefinito dogna i<br />
possessivo miu m<br />
interrogativo chini t<br />
relativo chi r<br />
esclamativo cantu ! e<br />
Persona prima mia 1<br />
seconda tua 2<br />
terza sua 3<br />
Genere maschile custu m<br />
femminile custa f<br />
L-spec comune dogna c<br />
Numero singolare custu s<br />
plurale custus p<br />
L-spec invariante chini n<br />
nominativo deu n<br />
Possessore singolare miu s<br />
plurale nostru p<br />
71
Tabella 11: e per la categoria “Determinante” in sardo<br />
msd ctag esempio<br />
Dd-ms-- DDMS cuddu<br />
Dd-mp-- DDMP cuddus<br />
Dd-fs-- DDFS cudda<br />
Dd-fp-- DDFP cuddas<br />
Di-ms-- DIMS nisciunu<br />
Di-mp-- DIMP unus cantu<br />
Di-fs-- DIFS nisciunas<br />
Di-fp-- DIFP unas cantu<br />
Di-cs-- DINS chinechisiat<br />
Di-cc-- DINC dogni<br />
Ds1ms-- DPMS miu, nostru<br />
Ds1mp-- DPMP mius<br />
Ds1fs-- DPFS mia<br />
Ds1fp-- DPFP mias<br />
Ds2ms-- DPMS tuu, vostru<br />
Ds2mp-- DPMP tuus<br />
Ds2fs-- DPFS tua<br />
Ds2fp-- DPFP tuas<br />
Ds3ms-- DPMS suu<br />
Ds3mp-- DPMP suus<br />
Ds3fs-- DPFS sua<br />
Ds3fp-- DPFP suas<br />
Ds3cp-- DPNP insoru<br />
Dr-cs-- DWNS cantu<br />
Dr-cp-- DWNP cantus<br />
Dt-cn-- DWNN cali<br />
Dt-cs-- DWNS cantu<br />
Dt-cp-- DWNP cantus<br />
De-cs-- DWMS cantu<br />
De-cp-- DWMP cantus<br />
D..fp--- DFP mias, custas, cantas ecc.<br />
D..fs--- DFS mia, custa, canta, ecc.<br />
D..mp--- DMP mius, custus, cantus, ecc.<br />
D..ms--- DMS miu, custu, cantu, ecc.<br />
72
Tabella 12: Coppie attributo-valore per la categoria “Articolo” in sardo<br />
Attributo Valore Esempio Codice<br />
Tipo definito su d<br />
indefinito unu i<br />
Genere maschile su m<br />
femminile sa f<br />
Numero singolare su s<br />
plurale is p<br />
Caso // // //<br />
Tabella 13: e per la categoria “Articolo” in sardo<br />
msd ctag esempio<br />
Tdms- RMS su<br />
Td[fm]p- RXP is<br />
Tdfs- RFS sa<br />
Tims- RIMS unu<br />
Tifs- RIFS una<br />
Tabella 14: Coppie attributo-valore per la categoria “Avverbio” in sardo<br />
Attributo Valore Esempio Codice<br />
Tipo _ _ _<br />
Grado positivo<br />
superlativo<br />
73<br />
chitzi<br />
malissimu<br />
Tabella 15: e per la categoria “Avverbio” in sardo<br />
msd ctag esempio<br />
R-p B mali<br />
R-s BS malissimu<br />
Tabella 16: Coppie attributo-valore per la categoria “Preposizione” in sardo<br />
Attributo Valore Esempio Codice<br />
Tipo preposizione in, de p<br />
p<br />
s
Tabella 17: e per la categoria “Preposizione” in sardo<br />
msd Ctag Esempio<br />
Sp SP in<br />
Tabella 18: Coppie attributo-valore per la categoria “Congiunzione” in sardo<br />
msd ctag esempio<br />
Cc CC ma<br />
Cs CS poita<br />
Tabella 19: e per la categoria “Congiunzione” in sardo<br />
msd ctag esempio<br />
Cc CC ma<br />
Cs CS poita<br />
Tabella 20: Coppie attributo-valore per la categoria “Numerale” in sardo<br />
Attributo Valore Esempio Codice<br />
Tipo cardinale centu c<br />
ordinale primu o<br />
Genere maschile primu m<br />
femminile prima f<br />
Numero singolare primu s<br />
plurale primus p<br />
Caso // // //<br />
Tabella 21: e per la categoria “Numerale” in sardo<br />
msd ctag esempio<br />
M.ms- NMS primu<br />
M.fs- NFS prima<br />
M.mp- NMP primus<br />
M.fp- NFP primas<br />
Mc--- N unu, centu<br />
Tabella 22: e per la categoria “Interiezione” in sardo<br />
msd ctag esempio<br />
I I ayo!<br />
74
Tabella 23: e per la categoria “Residuale” in sardo<br />
ctag esempio<br />
X simboli ecc.<br />
Tabella 24: e per la categoria “Punteggiatura” in sardo<br />
ctag esempio<br />
punct .,:!?…<br />
75
The Relevance of Lesser-Used Languages for<br />
Theoretical Linguitics: The Case of Cimbrian<br />
and the Support of the TITUS Corpus<br />
Ermenegildo Bidese, Cecilia Poletto<br />
and Alessandra Tomaselli<br />
On the basis of the TITUS Project, the following contribution aims at showing the<br />
importance of a lesser-used language, such as Cimbrian, for the theory of grammar. In<br />
Chapter 1, we present the goals of TITUS and its possibilities in order to analyse old<br />
Cimbrian writings. Furthermore, according to these possibilities, the second chapter<br />
will summarise some recent results of the linguistic research about relevant aspects<br />
of Cimbrian grammar, in particular the syntax of verbal elements, of subject clitics,<br />
and of subject nominal phrases. Chapter 3 and 4 discuss which relevance these results<br />
can have in the Generative framework, in particular with respect to a generalisation<br />
concerning the syntactic change in context of isolation and language contact.*<br />
1. The TITUS Project (http://titus.uni-frankfurt.de)<br />
The TITUS Project was conceived in 1987 during the Eighth Conference of Indo-<br />
European Studies in Leiden, when some of the participants had the idea to link their<br />
work together in order to create a text database for the electronic storage of writings/<br />
sources relevant to their discipline. 1 The name of the project was “Thesaurus of<br />
Indo-European <strong>Text</strong>ual Materials on Data Media” (Thesaurus indogermanischer<br />
<strong>Text</strong>materialien auf Datenträgern). In the first phase, the project aimed at preparing<br />
a collection of textual materials from old Indo-European languages, such as Sanskrit,<br />
Old Iranian, Old Greek, Latin, as well as Hittite, Old High German and Old English.<br />
In the beginning of the ’90s, the rapid increase of electronic storage capacities<br />
in data processing led to a second phase of the project in 1994. During the Third<br />
Working Conference for the Employment of Data Processing in the Historical and<br />
Comparative Linguistics, in Dresden, the newly-founded working group ‘Historisch-<br />
Vergleichende Sprachwissenschaft’ (Historic-Comparative Linguistics) of the Society<br />
for Computational Linguistics and Language Technology (Gesellschaft für Linguistische<br />
* The present contribution was written by the three authors in complete collaboration. For the formal<br />
definition of scholar responsibility, we declare that Ermenegildo Bidese draws up sections 1, 1.1 and<br />
1.2, 2, 2.1, Cecilia Poletto sections 2.2 and 2.3, Alessandra Tomaselli sections 3 and 4. We would like<br />
to thank the staff of <strong>EURAC</strong> for the opportunity to present our research.<br />
1 Cf. Gippert (1995)<br />
77
Datenverarbeitung) decided on an extension of the objectives for the ‘Thesaurus’,<br />
including further text corpora from other Indo-European and neighbouring languages,<br />
and introduced the new name ‘Thesaurus of Indo-European <strong>Text</strong>ual and Linguistic<br />
Materials’, shortened to the acronym from the German designation: TITUS (Thesaurus<br />
indogermanischer <strong>Text</strong>- und Sprachmaterialien). The addition, ‘linguistic materials’,<br />
emphasizes that TITUS understands itself no longer only as a text database, but also as<br />
a ‘data pool’. 2 On the TITUS server, you can find materials and aids for the analysis of<br />
the texts as well as, such as, among other things, a currently up-to-date bibliography<br />
with the newest publications in Indo-European studies, teaching materials, lexica,<br />
glossaries, language maps, audiovisual materials, software and fonts and heaps of<br />
helpful links. In fact, since 1995, owing to the above-mentioned conference, TITUS has<br />
been present on the World Wide Web with its own site at http://titus.uni-frankfurt.<br />
de. 3 Responsible for the project is the Institut für Vergleichende Sprachwissenschaft<br />
at the University Johann Wolfgang-Goethe in Frankfurt am Main/Germany (direction:<br />
Professor Jost Gippert) in connection with other European universities.<br />
The third phase in the development of the TITUS Project coincides with the explosive<br />
expansion of the Internet, and the new possibilities that online communication and<br />
Web performance offer. The new target of TITUS is the replacement of static data<br />
retrieval by an interactive one. 4 This means that in order to better comprehend and<br />
analyse the texts, further information about the writings are made available to the<br />
user, who can then become interactive with the text. Three issues are pursued:<br />
• a graphic documentation of the physical supports of the texts, usually<br />
manuscripts and inscriptions;<br />
• an automatic retrievement of word form correspondences in a single text<br />
or in an entire language corpus; and,<br />
• an automatic linguistic analysis of occurrences for the morphology of a<br />
word or for the basic forms of a verb. 5<br />
This interactive retrieval system is currently in development.<br />
1.1 The Cimbrian <strong>Text</strong>s in the TITUS Project<br />
The TITUS text database includes two Cimbrian texts provided by Jost Gippert,<br />
Oliver Baumann & Ermenegildo Bidese (1999). 6 They comprise the catechism of 1813<br />
2 Bunz (1998:12)<br />
3 Ibid.<br />
4 Cf. Gippert (2001)<br />
5 Cf. Ibid. Cf. the same for four illustrative examples.<br />
6 The direct links are: http://titus.uni-frankfurt.de/texte/etcs/germ/zimbr/kat1813d/ kat18.htm and<br />
http://titus.uni-frankfurt.de/texte/etcs/germ/zimbr/kat1842d/kat18. htm.<br />
78
(better known as the ‘short Cimbrian catechism’, written in the Cimbrian variety of<br />
the Seven Communities), and a new edition of the same text with slight alterations<br />
from 1842. 7 In fact, this catechism is a Cimbrian translation of the ‘Piccolo Catechismo<br />
ad uso del regno d’Italia’ (Short Catechism for the Italian Kingdom) of 1807. A critical<br />
edition of both the original Italian text and the two Cimbrian versions was provided<br />
by Wolfgang Meid. 8 The situation of Cimbrian knowledge at this time (with particular<br />
reference to the plateau of the Seven Communities) was very good, even though<br />
the use of the local Romance variety – in accordance with what the same text in the<br />
introduction testifies – was spreading. 9 For this reason, and in view of the possibility<br />
of comparing this text with the first Cimbrian catechism of 1602, (which represents<br />
the oldest Cimbrian writing 10 ), the ‘short catechism’ of 1813 and its later version in<br />
1842 are essential sources for studying and analysing the diachronic development of<br />
the Cimbrian language. 11<br />
On the basis of the above-mentioned critical edition by Professor Meid, we digitised<br />
the text in agreement with Meid’s linearization of the original version. Moreover, we<br />
provided a first linguistic structuring of the text marking, above all, for the prefix<br />
of the participle perfect, pronominal clitics, personal pronouns, and the existence<br />
particle -da. 12<br />
1.2 The Research of Linguistic Content of the Cimbrian <strong>Text</strong>s<br />
The first way of accessing the content of the Cimbrian texts is selecting the levels<br />
(chapters, paragraphs, verses and lines) into which the text is specifically subdivided<br />
in the entry form on the right frame of the text’s start page. In this way, you can<br />
precisely find any given passage of the Cimbrian text. 13<br />
7 Cimbrian is a German dialect commonly spoken today in the village of Lusern/Luserna in the region of<br />
Trentino in northern Italy. It is also found, albeit in widely dispersed pockets, in the Venetian communities<br />
of Mittoballe/Mezzaselva (Seven Communities) and Ljetzan/Giazza (Thirteen Communities), in the<br />
northeast of Italy. When the Cimbrian colonies were founded and where the colonists came from are<br />
still subjects of controversy, although the accepted historical explanation is that the Cimbrian colonies<br />
originated from a migration of people from Tyrol and Bavaria (Lechtal) at the beginning of the second<br />
millennium. For a general introduction about the Cimbrian question and this language, cf. Bidese<br />
(2004b).<br />
8 Cf. Meid (1985b)<br />
9 Cf. Cat.1813:17-21 in Meid (1985b:35)<br />
10 Cf. Meid (1985a). The first Cimbrian catechism is the translation of Cardinal Bellarmino’s ‘Dottrina<br />
cristiana breve’ (cristian short doctrine). In spite of the title, the text is remarkably longer than the<br />
1813’s ‘short catechism.’<br />
11 Moreover, in TITUS, there is the first part of Remigius Geiser’s (1999) self-learning Cimbrian course(cf.<br />
http://titus.fkidg1.uni-frankfurt.de/didact/zimbr/cimbrian.htm).<br />
12 Cf. for the linguistically analysed texts following links: http://titus.uni-frankfurt.de/texte/etcs/germ/<br />
zimbr/kat1813s/kat18.htm and http://titus.uni-frankfurt.de/texte/etcs/germ/ zimbr/kat1842s/<br />
kat18.htm.<br />
13 Cf. for a detailed description of all these possibilities Gippert (2002).<br />
79
A second possibility for content searching is obtained by using TITUS word search<br />
engine. By double-clicking on a given word of the Cimbrian text, for example, you can<br />
automatically look for its occurrences in the text, for the exact text references, and<br />
for the context in which this word is used (including orthographic variants).<br />
A third way of content searching in the Cimbrian texts consists of using a search<br />
entry form that you can find when you open the link Switch to Word Index on the right<br />
frame of the start page of the text. In the box, you can enter a word and obtain its<br />
occurrences in the Cimbrian text.<br />
In conclusion, we can state that the TITUS Project, with all the above-mentioned<br />
possibilities (and including the Cimbrian texts with a first linguistic structuring), offer<br />
a good starting-point for the research of the diachronic development of Cimbrian’s<br />
syntax.<br />
2. Some Relevant Aspects of Cimbrian Syntax<br />
In the last decade, three interrelated syntactic aspects of the Cimbrian dialects<br />
have become the subject of intensive descriptive studies, from both the diachronic<br />
and the synchronic point of view: a) the syntax of verbal elements; b) the syntax of<br />
subject clitics; and, c) the syntax of subject NPs. The theoretical relevance of these<br />
studies will be discussed in section 4.<br />
2.1 Verb Syntax<br />
As for the syntax of verbal elements, the following descriptive results can be taken<br />
for granted:<br />
i) Cimbrian is no longer characterised by the V2 restriction, which requires the<br />
second position of the finite verb in the main declarative clause. As the following<br />
examples show, the finite verb can be preceded by two or more constituents that are<br />
not rigidly ordered, as shown by the fact that both (1) (a and b) and (2) are grammatical.<br />
Similar cases of V3 (as in [1a]) or V4 (as in [1b]) are not acceptable, neither in Standard<br />
German (cf. 3), or in any other continen-tal Germanic languages: 14<br />
(1a) Gheistar in Giani hat gahakat iz holtz ime balje (/in balt) 15 (Giazza)<br />
Yesterday the G. has cut the wood in the forest<br />
(1b) De muotar gheistar kam Abato hat kost iz mel 16 (Giazza)<br />
The mother yesterday in Abato has bought the flour<br />
14 Cf. Scardoni (2000), Poletto & Tomaselli (2000), Tomaselli (2004), Bidese & Tomaselli (2005). In the<br />
catechism of 1602, there are few examples of V3 constructions, but this is probably due to the fact that<br />
there is no relevant context for the topic. Cf. for this problem Bidese and Tomaselli (2005:76ff.)<br />
15 Scardoni (2000:152)<br />
16 Ivi:157<br />
80
(2) Haute die Mome hat gekoaft die öala al mercà 17 (Luserna)<br />
Today the mother has bought the eggs at-the market<br />
(3) *Gestern die Mutter hat Mehl gekauft<br />
yesterday the mother has flour bought<br />
ii) A correlate of the V2 phenomenology forces the reordering of subject and<br />
inflected verb: in the Germanic languages, 18 the subject can be found in main clauses<br />
to the right of the inflected verb (but still to the left of a past participle, if the<br />
sentence contains one) when another constituent is located in first position, yielding<br />
the ordering XP Vinfl Subject … (Vpast part.). In Cimbrian, the phenomenon of subject<br />
- (finite) verb inversion is limited to subject clitics starting from the first written<br />
documents (i.e., the Cimbrian catechisms of 1602, here shortened in Cat.1602) (cf.<br />
4), and survived the loss of the V2 word order restriction for quite a long time (cf. 5<br />
and 6). Nowadays, in Giazza, it is only optionally present, and only for some speakers<br />
(cf. 7 and 8), while it survives in Luserna (cf. 9 and 10): 19<br />
... 21<br />
...<br />
(4) [Mitt der Bizzonghe] saibar ghemostert zò bizzan den billen Gottez. 20<br />
Through knowledge are-we taught to know the will of God.<br />
(5) [Benne di andarn drai Lentar habent gahört asó], haben-se-sich manegiart<br />
When the other three villages had heard this, had-they taken pains to<br />
(6) [Am boute] [gan ljêtsen] hense getrust gien … 22<br />
Once in Ljetzan have-they got to go …<br />
(7) In sontaghe regatz / In sontaghe iz regat 23 (Giazza)<br />
On Sunday rains-it / On Sunday it rains<br />
(8) Haute er borkofart de oiar / Haute borkofartar de oiar 24 (Giazza)<br />
Today he sells the eggs/today sells-he the eggs<br />
17 Grewendord & Poletto (2005:117)<br />
18 English has this possibility too, but it is restricted to main interrogatives, while in the other Germanic<br />
languages it is found also in declaratives.<br />
19 Bosco (1996) and (1999), Benincà & Renzi (2000), Scardoni (2000), Poletto & Tomaselli (2000), Tomaselli<br />
(2004), Bidese & Tomaselli (2005) and Grewendorf & Poletto (2005). That subject clitics continue<br />
to invert when nominal subjects cannot is a well-known generalisation confirmed in other language<br />
domains, such as Romance.<br />
20 Cat.1602:694–5 in Meid (1985a:87)<br />
21 Baragiola 1906:108<br />
22 Schweizer 1939:36<br />
23 Scardoni 2000:144<br />
24 Ivi:155<br />
81
(9) *Haüte geat dar Giani vort 25 (Luserna)<br />
Today goes the Gianni away<br />
(10) Haüte geatar vort (dar Gianni) 26 (Luserna)<br />
Today goes-he away (the John)<br />
This seems to indicate that the ‘core’ of the V2 phenomenon (i.e., the word order<br />
restriction) could be lost before one of its main correlates (i.e., pronomi-nal subject<br />
inversion).<br />
• Germanic languages can be OV (German and Dutch) or VO (Scandinavian<br />
and Yiddish). In Cimbrian, the discontinuity of the verbal complex is limited to the<br />
intervention of pronominal elements, negation (cf. 12), monosyllabic adverbs/<br />
verbal prefixes, 27 and bare quantifiers 28 (cf. 13). In fact, from a ty-pological point<br />
of view, Cimbrian belongs, without any doubt, to the group of VO languages:<br />
(11a) Haüte die Mome hat gebäscht di Piattn 29 (Luserna)<br />
Today the mother has washed the dishes<br />
(11b) *Haüte di Mome hat di Piattn gebäscht 30 (Luserna)<br />
(12) Sa hom khött ke dar Gianni hat net geböllt gian pit se 31 (Luserna)<br />
They have said that the G. has not wanted go with them<br />
(13a) I hon niamat gesek 32 (Luserna)<br />
I have nobody seen<br />
(13b) han-ich khoome gaseecht (Roana)<br />
have-I nobody seen<br />
• Residual word order asymmetries between main and subordinate clauses<br />
with respect to the position of the finite verb are determined by a) the syntax<br />
of some ‘light’ elements (cf. 14 and 15 for negation and pronominal); b) by the<br />
presence of clitics (cf. 14b and 15b versus 16 and 17); and, c) by the type of<br />
subordinate clause (cf. 14b and 15b versus 18 and 19):<br />
(14a) Biar zéteren nete33 We give in not<br />
25 Grewendorf & Poletto 2005:116<br />
26 Ibid.<br />
27 Cf. Bidese 2004a and Bidese & Tomaselli 2005<br />
28 Cf. Grewendorf & Poletto (2005)<br />
29 Ivi:117<br />
30 Ivi:121<br />
31 Ivi:122<br />
32 Ivi:123<br />
33 Baragiola 1906:108<br />
82
(14b) ’az se nette ghenan vüar 34<br />
that they don’t put forward<br />
(15a) Noch in de erste Lichte von deme Tage hevan-se-sich alle 35<br />
Even at the break of that day get-they all up<br />
(15b) ’az se sich legen in Kiete 36<br />
that they calm down<br />
(16) ’az de Consiliere ghen nette auf in de Sala 37<br />
that the advisers go not above into the room<br />
(17) ’az diese Loite richten-sich 38<br />
that these people arrange themselves<br />
(18) umbrume di andar Lentar saint net contente 39<br />
because the other villages are not glad<br />
(19) umbrume dear Afar has-sich gamachet groaz 40<br />
2.2 Clitic Syntax<br />
because the question has got great<br />
The Cimbrian dialect, contrary to other Germanic languages that only admit weak<br />
object pronouns, is characterized by a very structured set of pronominal clitics, like<br />
all northern Italian dialects. 41 One important piece of evidence that subject and<br />
object pronouns are indeed clitics is the phenomenon of clitic doubling, namely,<br />
the possibility to double a full pronoun or an NP with a clitic, already noted in the<br />
grammars:<br />
(20) az sai-der getant diar 42<br />
that it will be to you made to you<br />
(21) Hoite [de muuutar] hat-se gakhoofet de ojar in merkaten (Roana)<br />
Today the mother has-she bought the eggs at-the market<br />
From a diachronic point of view, this phenomenon already appears for subject<br />
clitics in Cat.1813, but is limited to interrogative sentences, while in Baragiola (1906)<br />
it also appears in declarative sentences. The phenomenon is, nowadays, according to<br />
34 Ivi:111<br />
35 Ivi:109-110<br />
36 Ivi:114<br />
37 Ivi:110<br />
38 Ivi:108<br />
39 Ivi:105<br />
40 Ivi:113<br />
41 For an exhaustive description of the positions of clitics and pronouns in Cimbrian cf. Castagna (2005).<br />
42 Schweizer (1952:27)<br />
83
the research of Scardoni (2000), no longer productive in Giazza, optional/possible in<br />
Luserna, 43 but still frequent in Roana. 44<br />
In main clauses, subject clitics are usually found in enclisis to the finite verb (in<br />
Giazza, only as a vestige, cf. the above sentences [7] and [8]): 45<br />
(22) Bia hoas-to (de) (du)? (Luserna)<br />
How call-you?<br />
(23) Hasto gi khoaft in ğornal? 46 (Luserna)<br />
Have-you bought the newspaper?<br />
(24) Ghestar han-ich ghet an libar ame Pieren (Roana) 47<br />
Yesterday have-I given a book to P.<br />
In embedded clauses, subject clitics occur either in enclitic position to the finite<br />
verb or in enclitic position to the conjunction, depending on two main factors: i) the<br />
Cimbrian variety under consideration (and the ‘degree’ of V2 preservation); and, ii) the<br />
different types of subordinate clauses. According to what our data suggest, nowadays,<br />
enclisis to the finite verb seems to be the rule in Roana (25-8), but Schweizer’s grammar<br />
(Schweizer 1952) gives evidence for a different distribution of the subject clitics in<br />
subordinate clauses. He observes that subject clitics in the variety of Roana usually<br />
occur (or occurred) at the Wackernagel’s position (WP) in enclisis to the subordinating<br />
conjunction (cf. 29-31; cf. the above sentences [14b] and [15b] as well): 48<br />
(25) Ist gant zoornig, ambrumme han-ich ghet an libarn ame Pieren (Roana)<br />
(He) has got angry, because have-I given a book P.<br />
(26) Gianni hatt-ar-mi gaboorset, benne khimmas-to hoam (Roana)<br />
Gianni has-he-me asked, when come-you home<br />
(27) Haban-sa-mich gaboorset, ba ghe-ban haint (Roana)<br />
Have-they-me asked, where go-we today evening<br />
(28) Haban-sa-mich khött, habat-ar gabunnet Maria nach im beeck (Roana)<br />
Have-they-(to)me said, have-you met M. on the road<br />
(29) bas-er köt 49 (Roana)<br />
43 Cf. Vicentini (1993:149-51) and Castagna (2005)<br />
44 Our data suggest that there may be a difference between auxiliaries and main verbs: with the auxiliary<br />
‘have’, doubling seems mandatory, while this is not the case with main verbs.<br />
45 Some ambiguous forms can also appear in first position; we assume here that when occurring in first<br />
position, the pronominal forms are not real clitics, but, at most, weak forms.<br />
46 Vicentini (1993:44)<br />
47 In the variety of Roana, when the subject is definite and preverbal, there is always an enclitic<br />
pronoun.<br />
48 Cf. Castagna (2005) as well<br />
49 Schweizer (1952:27)<br />
84
what-he says<br />
(30) ben-ig-en nox vinne 50 (Roana)<br />
if-I-him still meet<br />
(31) ad-ix gea au 51 (Roana)<br />
if-EXPl.-I (az-da-ich) go above<br />
All the same, Schweizer (1952) underlines that there are many irregularities in<br />
accordance to which subject clitics in embedded clauses can appear in enclisis to<br />
the finite verb, or in both positions (clitic doubling). Luserna Schweizer notes that<br />
all the pronouns have to be clitized to the complementizer. 52 But we found evidence<br />
for a construction (cf. 32) in which the subject clitic appears in enclisis to the finite<br />
verb, probably due to the presence of a constituent between the complementizer and<br />
the finite verb (a case of “residual” embedded V2). In this sentence, there is clitic<br />
doubling too:<br />
(32) Dar issese darzürnt obrom gestarn honne i get an libar in Peatar 53<br />
(Luserna)<br />
He has got angry because yesterday have-I I given a book P.<br />
In main clauses, object clitics are always in enclisis to the inflected verb:<br />
(33a) Der Tatta hat-se gekoaft 54 (Luserna)<br />
The father has-her bought<br />
(33b) Der Tatta *se hat gekoaft 55 (Luserna)<br />
(34) De muutari hat-sei-se gasecht (Roana)<br />
The mother has-she-her seen<br />
(35) Gianni hatt-an-se gaseecht (Roana)<br />
Gianni has-he-her seen<br />
The same is true for embedded declarative clauses:<br />
(36a) I woas ke der Tatta hatse (net) gekoaft 56 (Luserna)<br />
I know that the father has-her (not) bought<br />
(36b) I woas ke der Tatta *se hat gekoaft 57 (Luserna)<br />
I know that the father her has bought<br />
50 Ibid.<br />
51 Ibid.<br />
52 Ibid. This analysis is confirmed in the data of Vicentini (1993)<br />
53 Grewendorf & Poletto (2005:121)<br />
54 Ivi:122<br />
55 Ibid.<br />
56 Ivi:123<br />
57 Ibid.<br />
85
(37) Gianni hatt-ar-mi gaboorset, bear hat-ar-dich telephonaart (Roana)<br />
Gianni has-he-me asked, who has-he-you called<br />
(38) kloob-ich Gianni hatt-ar-me ghet nicht ad ander (Roana)<br />
believe-I (that) Gianni has-he-(to)me given nothing else<br />
(39) biss-i net, Gianni hat-an-en ghakhoofet (Roana)<br />
know-I not, (if) Gianni has-he-him bought<br />
While in Roana, enclisis to the finite verb is the rule in all embedded clauses (including<br />
embedded interrogatives), in Luserna, in relative and embedded interrogative clauses,<br />
subject and object clitics are usually found in a position located to the immediate right<br />
of the complementiser (or the wh-item). 58 This corresponds to Wackernagel’s position<br />
of the Germanic tradition, and is usually hosting weak pronouns in the Germanic<br />
languages, which are rigidly ordered (contrary to DPs, which can scramble):<br />
(40) ’s baibe bo-da-r-en hat geet an liber 59 (Luserna)<br />
the woman who-EXPL.-he-(to) her has given a book<br />
(41) dar Mann bo dar en (er) hat geet an libar (Luserna)<br />
the man who-EXPL.-he-him (he) has given a book<br />
(42) Dar Giani hatmar gevorst zega ber (da)de hat o-gerüaft (Luserna)<br />
The G. has-me asked compl. who you has phoned<br />
(43) I boas net ber-me hat o-gerüaft (Luserna)<br />
I know not who us has phoned<br />
(44) I vorsmaar zega bar me mage hom o-gerüaf (Luserna)<br />
I wonder COMPL. who me could have phoned<br />
Summarising the data illustrated so far, we can state that:<br />
• Both subject and object clitics are always in enclisis to the finite verb in<br />
main clauses in all varieties;<br />
• Currently in Roana, both subject and object clitics always occur in enclisis<br />
to the finite verb in all embedded clauses; and,<br />
• In Luserna, clitics occur in enclisis in embedded declaratives and in WP in<br />
relative and embedded interrogatives.<br />
From this we conclude that:<br />
• Luserna displays a split between embedded wh-constructions on the one<br />
hand and embedded declaratives on the other, while Roana (at least nowadays)<br />
does not; and,<br />
58 This means that no element can intervene between the element located in CP and the pronoun(s).<br />
59 Grewendorf & Poletto (2005:121)<br />
86
• No cases of proclisis to the inflected verb are ever found in any Cimbrian<br />
variety.<br />
In general, although Cimbrian, contrary to other Germanic languages, has<br />
developed a class of clitic pronouns, it does not seem to have ‘copied’ the syntactic<br />
behaviour of subject and object clitics of neighbouring Romance dialects, which<br />
realize consistently proclisis to the inflected verb for object clitics in all sentence<br />
types, and permit enclisis of subject clitics only in main interrogative clauses, and<br />
enclisis of object clitics only with infinitival verbal forms. 60 On the contrary, enclisis<br />
to the inflected verb seems to be the rule in Cimbrian. Proclisis to the inflected verb<br />
is not at all attested, and the only other position apart from enclisis is the Germanic<br />
WP position in some embedded clause types in the variety of Luserna.<br />
2.3 The Syntax of Subject NPs<br />
As regards the syntax of the subject NPs in Cimbrian, there is evidence of the<br />
following aspects:<br />
• Cimbrian is not a pro-drop language. As with standard German, English<br />
and French, it is characterised by: a) obligatory expression of the subject (cf.<br />
45);<br />
• the use of the expletive pronoun iz (cf. 46); and, c) (contrary to standard<br />
German) a VO typology and the consequent adjacency of the verbal complex (cf.<br />
47); and, d) a relatively free position of the finite verb: 61<br />
(45) i han gaarbat (/gaarbatat) ime balt / Haute hani gaarbatat ime balje 62<br />
(Giazza)<br />
Today I have worked in the forest / Today have-I worked in the forest<br />
(46) Haute iz regat / Haute regatz63 (Giazza)<br />
Today it rains / Today rains-it<br />
(47) Gheistar in Giani hat gahakat iz holtz ime balje (/in balt) 64 (Giazza)<br />
Yesterday G. has cut the wood in the forest<br />
• Languages requiring a mandatory expression of the subject, such as English<br />
or French, see the possibility of putting the subject NPs on the right of the verbal<br />
60 Note that there are Romance dialects that have enclisis to the inflected verb, such as the variety of<br />
Borgomanero, studied by Tortora (1997), but this is a Piedmontese dialect, which can not have been in<br />
touch with Cimbrian, so we can exclude that enclisis has been developed through language contact with<br />
Romance.<br />
61 Cf. Poletto & Tomaselli (2002) and Tomaselli (2004:543). Cf. Castagna (2005) as well.<br />
62 Scardoni (2000:155)<br />
63 Ivi:144<br />
64 Ivi:152<br />
87
complex only in very limited contexts. From this perspective, it is interesting to<br />
note that Cimbrian generally permits it (cf. 48 and 49), similarly to standard Italian<br />
(cf. 50), and in opposition to the neighbouring romance dialect, in which the post<br />
verbal subject co-occurs with a subject pronoun in a preverbal position (cf. 51 and<br />
52):<br />
(48) Gheistar hat gessat dain manestar iz diarlja 65 (Giazza)<br />
Yesterday has eaten your soup the girl<br />
(49) Hat gahakat iz holtz dain vatar 66 (Giazza)<br />
Has cut he wood your father<br />
(50) Lo hanno comprato al mercato i miei genitori<br />
It have bought at the market my parents<br />
(51) Algéri l’à magnà la to minestra la buteleta 67<br />
Yesterday she has eaten your soup the girl<br />
(52) L’à taià la legna to papà 68<br />
He has cut the wood your father<br />
3. Cimbrian Data and the Generative Grammar Framework<br />
The results of the syntactic description of some aspects of Cimbrian grammar are<br />
relevant for any theoretical framework. In particular, within the Generative Grammar<br />
theoretical approach, the data discussed so far is relevant from both a synchronic and<br />
a diachronic point of view.<br />
Cimbrian, having been in a situation of language contact for centuries, offers a<br />
privileged point of view for determining how phenomena are lost and acquired. A<br />
number of interesting observations can be made concerning language change induced<br />
by language contact.<br />
First, Cimbrian shows that the ‘correlates’ of a given phenomenon (in our case<br />
V2) are lost after the loss of the phenomenon itself. More specifically, Cimbrian has<br />
maintained the possibility of inverting subject pronouns, while losing the V2 linear<br />
restriction. On the other hand, we can also state that the correlates can be acquired<br />
before the phenomenon itself: although Cimbrian has not developed a fully-fledged<br />
pro drop system, it already admits subject free inversion of the Italian type (i.e., the<br />
subject inverts with the ‘whole’ verbal phrase).<br />
65 Ivi:165<br />
66 Ibid.<br />
67 Ibid.<br />
68 Ibid.<br />
88
Second, syntactic change does not proceed in parallel to the lexicon, where a word<br />
is simply borrowed and then adapted to the phonological system of the language. 69 The<br />
syntactic distribution of clitic elements in Cimbrian shows that they have maintained<br />
a Germanic syntax, allowing either enclisis to the verb or the complementizer (WP),<br />
but never proclisis to the inflected verb, as is the case for Romance. Therefore, even<br />
though Cimbrian might have developed (or rather ‘maintained’/’preserved’) a class<br />
of clitic elements due to language contact, it has not ‘copied’ the Romance syntax of<br />
clitics.<br />
Moreover, the study of Cimbrian also confirms two descriptive generalisations<br />
concerning the loss of the V2 phenomenology established on the basis of the evolution<br />
of Romance syntax: 70<br />
• Embedded wh-constructions constitute the sentence type that longer<br />
maintains asymmetry with main clauses. This is shown in Cimbrian by the possibility<br />
of having clitics in WP only in embedded interrogatives, and relatives in the variety<br />
of Luserna; and,<br />
• Inversion of NPs is lost before inversion of subject clitics, which persists<br />
for a longer period.<br />
More generally, Cimbrian also confirms the hypothesis first put forth by Lightfoot<br />
(1979), and mathematically developed by Clark & Roberts (1993), that the reanalysis<br />
made by bilingual speakers goes through ambiguous strings that have two possible<br />
structural analyses; the speaker tends to use the more economical one (in terms of<br />
movement) that is compatible with the set of data at his/her disposal.<br />
Also, from the synchronic point of view, Cimbrian is an interesting study case, at<br />
least as far as verb movement is concerned. In V2 languages, it is most probably an<br />
Agreement feature located in the C that attracts the finite verb (see Tomaselli 1990 for<br />
a detailed discussion of this hypothesis). Cimbrian seems to have lost this property, as<br />
neither the linear V2 restriction nor the NP subject inversion are possible at this time.<br />
On the other hand, it has not (yet) developed a ‘Romance’ syntax, because clitics are<br />
always enclitics in the main clause (both declarative and interrogative). It is a well-<br />
known fact (see, among others, Sportiche 1993 and Kayne 1991 & 1994) that in the<br />
higher portion of the IP layer, there is a (set of) position(s) for clitic elements, and<br />
that subject clitics are always located to the left of object clitics inside the template<br />
containing the various clitics.<br />
69 This hypothesis is already been made by Brugmann (1917).<br />
70 See Benincà (2005) for the first generalization, Benincà (1984), Poletto (1998) and Roberts (1993), for<br />
the second.<br />
89
The position of the inflected verb in Cimbrian is neither the one found in V2 language<br />
(within the CP domain), nor the lower one found in modern Romance (within the IP<br />
domain). The syntax of clitics suggests that, in Cimbrian, the inflected verb moves to<br />
a position inside the clitic layer in the high IP (corresponding to the traditional WP),<br />
and precisely to the left of clitic elements both in main and embedded declarative<br />
clauses. 71 If this theoretical description proves to be tenable, we are now in the<br />
position to speculate about a possible explanation.<br />
4. A New Theoretical Correlation ‘Visible’ in Cimbrian<br />
A further interesting field to explore has to do with the theoretical reason why<br />
Cimbrian could not develop a Romance clitic syntax. In other words, there must have<br />
been some restriction constraining the speakers to maintain enclisis.<br />
A striking difference between the neighbouring Romance dialects and Cimbrian is<br />
the past participle agreement phenomenon. Past participle agreement is mandatory<br />
(at least for some object clitics) in Northern Italian dialects (cf. 53), while it is<br />
completely absent in Cimbrian. The morphological structure of the Cimbrian past<br />
participle has simply preserved the invariant German model, that is, ge- … -t, (cf.<br />
54):<br />
(53) (A) so k’el papá li ga visti<br />
I know that the father them-has seen<br />
(54) I woas ke der Tatta hatze (net) gekoaft (Luserna)<br />
I know that the father has-her (not) bought<br />
The existence of past participle agreement is usually analysed in the relevant<br />
literature as involving an agreement projection (AgrOP) to which both the object<br />
clitic and the verb move; the configuration of spec-head agreement between the two<br />
triggers the ‘passage’ of the number and gender features of the clitic onto the verb<br />
yielding agreement on the past participle (see Kayne 1991 and 1993).<br />
We believe that it is the presence of this lower agreement projection that is related to<br />
the possibility of having proclisis in Romance, and its absence that constrains Cimbrian<br />
to enclisis to the inflected verb. In Cimbrian, the clitic element moves directly to the<br />
higher clitic position (within the IP domain), while in Romance, this movement is<br />
always in two steps, the first being movement to the lower AgrO projection. In favour<br />
of this assumption is the fact that Cimbrian, like all other Germanic varieties, never<br />
showed past participle agreement of the Romance type.<br />
71 As we have already noted, the same is true for embedded interrogatives in Roana, while in Luserna,<br />
the verb is probably located lower in embedded interrogatives and relative clauses, leaving the clitic<br />
in WP alone.<br />
90
91<br />
Abbreviations<br />
Cat.1602 Cimbrian Catechism of 1602 (cf. Meid 1985a)<br />
Cat.1813 Cimbrian Catechism of 1813 (cf. Meid 1985b)<br />
DP Determiner Phrase<br />
NP Nominal Phrase<br />
Vinfl Inflected Verb<br />
Vpast part. Participle Past Verb<br />
Wh (interrogative element)<br />
XP X-phrase
92<br />
References<br />
Baragiola, A. (1906). “Il tumulto delle donne di Roana per il ponte (nel dialetto di<br />
Camporovere, Sette Comuni)”. Padova: Tip, Fratelli Salmin, reprinted in Lobbia, N. &<br />
Bonato, S. (eds.) (1998). Il Ponte di Roana. Dez Dink vo’ der Prucka. Roana: Istituto<br />
di Cultura Cimbra.<br />
Benincà, P. (1984). “Un’ipotesi sulla sintassi delle lingue romanze medievali.“ Quaderni<br />
Patavini di Linguistica 4, 3-19.<br />
Benincà, P. (2005). “A Detailed Map of the Left Periphery of Medieval Romance.”<br />
Zanuttini, R. et al. (eds.) (2005). Negation, Tense and Clausal Architecture: Cross-<br />
linguistics Investigations. Georgetown University Press.<br />
Benincà, P. & Renzi, L. (2000). “La venetizzazione della sintassi nel dialetto cimbro.<br />
“ Marcato, G. (ed.) (2000). Isole linguistiche? Per un’analisi dei sistemi in contatto.<br />
Atti del convegno di Sappada/Plodn (Belluno), 1–4 luglio 1999. Padova: Unipress,<br />
137–62.<br />
Bidese, E. (2004a). “Tracce di Nebensatzklammer nel cimbro settecomunigiano.”<br />
Marcato, G. (ed.) (2000). I dialetti e la montagna. Atti del convegno di Sappada/<br />
Plodn (Belluno), 2–6 luglio 2003, Padova: Unipress, 269–74.<br />
Bidese, E. (2004b). “Die Zimbern und ihre Sprache: Geographische, historische und<br />
sprachwissenschaftlich relevante Aspekte.” Stolz, T. (ed.) (2004). “Alte“ Sprachen.<br />
Beiträge zum Bremer Kolloquium über “Alte Sprachen und Sprachstufen” (Bremen,<br />
Sommersemester 2003). Bochum: Universitätsverlag Dr. N. Brockmeyer, 3–42.<br />
Bidese, E. & Tomaselli, A. (2005). “Formen der ‚Herausstellung’ und Verlust der V2-<br />
Restriktion in der Geschichte der zimbrischen Sprache.” Bidese, E., Dow, J.R. & Stolz,<br />
T. (eds.) (2005). Das Zimbrische zwischen Germanisch und Romanisch. Bochum:<br />
Universitätsverlag Dr. N. Brockmeyer, 71-92.<br />
Bosco, I. (1996). ’Christlike unt korze Dottrina’: un’analisi sintattica della lingua
cimbra del XVI secolo. Final essay for the degree “Laureat in Modern Languages and<br />
Literature.” Unpublished Essay, University of Verona.<br />
Bosco, I. (1999). “Christlike unt korze Dottrina’: un’analisi sintattica della lingua<br />
cimbra del XVI secolo.” Thune, E.M. & Tomaselli, A. (eds.) (1999). Tesi di linguistica<br />
tedesca. Padova: Unipress, 29–39.<br />
Brugmann, K. (1917). “Der Ursprung des Scheinsubjekts ‘es’ in den germanischen und<br />
den romanischen Sprachen.” Berichte über die Verhandlungen der Königl. Sächsischen<br />
Gesellschaft der Wissenschaften zu Leipzig, Philologisch-historische Klasse 69/5.<br />
Leipzig: Teubner, 1–57.<br />
Bunz, C.M. (1998). “Der Thesaurus indogermanischer <strong>Text</strong>- und Sprachmateria-<br />
lien (TITUS) – ein Pionierprojekt der EDV in der Historisch-Vergleichenden Sprach-<br />
wissenschaft.” Sprachen und Datenverarbeitung 1(98), 11-30. http://titus.uni-<br />
frankfurt.de/texte/sdv198.pdf.<br />
Castagna, A. (2005), “Personalpronomen und Klitika im Zimbrischen.” Bidese, E., Dow,<br />
J.R. & Stolz, T. (eds) (2005). Das Zimbrische zwischen Germanisch und Romanisch.<br />
Bochum: Universitätsverlag Dr. N. Brockmeyer, 93-113.<br />
Clark, R. & Roberts, I. (1993), “A Computational Model of Language Learnability and<br />
Language Change.” Linguistic Inquiry 24, 299-345.<br />
Geiser, R. (1999). “Grundkurs in klassischem Zimbrisch.” http://titus.fkidg1.uni-<br />
frankfurt.de/didact/zimbr/cimbrian.htm.<br />
Gippert, J. (1995). “TITUS. Das Projekt eines indogermanistischen Thesaurus.” LDV-<br />
Forum (Forum der Gesellschaft für Linguistische Datenverarbeitung) 12 (2), 35-47.<br />
http://titus.uni-frankfurt.de/texte/titusldv.htm.<br />
Gippert, J. (2001). Der TITUS-Server: Grundlagen eines multilingualen <strong>Online</strong>-<br />
Retrieval-Systems (aus dem Protokoll des 83. Kolloquiums über die Anwendung der<br />
Elektronischen Datenverarbeitung in den Geisteswissenschaften an der Universität<br />
Tübingen 17. November 2001). http://www.zdv.uni-tuebingen.de/ tustep/prot/<br />
93
prot831-titus.html.<br />
Gippert, J. (2002). The TITUS <strong>Text</strong> Retrieval Engine. http://titus.uni-frankfurt. de/<br />
texte/textex.htm.<br />
Grewendorf, G. & Poletto, C. (2005). “Von OV zu VO: ein Vergleich zwischen Zimbrisch<br />
und Plodarisch.” Bidese, E, Dow, J.R. & Stolz, T. (eds) (2005). Das Zimbrische zwischen<br />
Germanisch und Romanisch. Bochum: Universitätsverlag Dr. N. Brockmeyer, 114-128.<br />
Kayne, R.S. (1991). “Romance Clitics, Verb Movement, and PRO.” Linguistic Inquiry<br />
22, 647-686.<br />
Kayne, R.S. (1993). “Towards a Modular Theory of Auxiliary Selection.” Studia<br />
Linguistica 47, 3-31.<br />
Kayne, R.S. (1994). The Antisymmetry of Syntax. Cambridge, Mass.: MIT Press.<br />
Lightfoot, D. (1979). Principles of Diachronic Syntax. Cambridge, England: Cambridge<br />
University Press.<br />
Meid, W. (1985a). Der erste zimbrische Katechismus CHRISTLIKE UNT KORZE<br />
DOTTRINA. Die zimbrische Version aus dem Jahre 1602 der DOTTRINA CHRISTIANA<br />
BREVE des Kardinals Bellarmin in kritischer Ausgabe. Einleitung, italienischer und<br />
zimbrischer <strong>Text</strong>, Übersetzung, Kommentar, Reproduktionen. Innsbruck: Institut für<br />
Sprachwissenschaft der Universität Innsbruck.<br />
Meid, W. (1985b). Der zweite zimbrische Katechismus DAR KLÓANE CATECHISMO VOR<br />
DEZ BÉLOSELAND. Die zimbrische Version aus dem Jahre 1813 und 1842 des PICCOLO<br />
CATECHISMO AD USO DEL REGNO D’ITALIA von 1807 in kritischer Ausgabe. Einleitung,<br />
italienischer und zimbrischer <strong>Text</strong>, Übersetzung, Kommentar, Reproduktionen.<br />
Innsbruck: Institut für Sprachwissenschaft der Universität Innsbruck. http://titus.uni-<br />
frankfurt.de/texte/etcs/germ/zimbr/ kat1813d/kat18.htm.<br />
Poletto, C. (1998). “L’inversione interrogativa come ‘verbo secondo residuo’: l’analisi<br />
94
sincronica proiettata nella diacronia.” Atti del XXX convegno SLI, Roma: Bulzoni, 311-<br />
327.<br />
Poletto, C. & Tomaselli, A. (2000). “L’interazione tra germanico e romanzo in due<br />
‘isole linguistiche’. Cimbro e ladino centrale a confronto.” Marcato, G. (ed.) (2000).<br />
Isole linguistiche? Per un’analisi dei sistemi in contatto. Atti del convegno di Sappada/<br />
Plodn (Belluno), 1–4 luglio 1999. Padova: Unipress, 163–76.<br />
Poletto, C. & Tomaselli, A. (2002). “La sintassi del soggetto nullo nelle isole tedescofone<br />
del Veneto: cimbro e sappadino a confronto.” Marcato, G. (ed.) (2002). La dialettologia<br />
oltre il 2001. Atti del convegno di Sappada/Plodn (Belluno), 1–5 Luglio 2001. Padova:<br />
Unipress, 237–52.<br />
Roberts, I. (1993). Verbs and Diachronic Syntax: A Comparative History of English and<br />
French. Dordrecht: Kluwer.<br />
Scardoni, S. (2000). La sintassi del soggetto nel cimbro parlato a Giazza. Final essay<br />
for the degree “Laureat in Modern Languages and Literature.” Unpublished Essay,<br />
University of Verona.<br />
Schweizer, B. (1939). Zimbrische Sprachreste. Teil 1: <strong>Text</strong>e aus Giazza (Dreizehn<br />
Gemeinden ob Verona). Nach dem Volksmunde aufgenommen und mit deutscher<br />
Übersetzung herausgegeben. Halle/Saale: Max Niemeyer.<br />
Schweizer, B. (1952). Zimbrische Gesamtgrammtik. Band V.: Syntax der zimbrischen<br />
Dialekte in Oberitalien. Diessen am Ammersee. Unpublished typescript. Marburg/<br />
Lahn, Germany: Institut für die Forschung der Deutschen Sprache.<br />
Sportiche, D. (1993). “Clitic Constructions.” Rooryck, J. & Zaring, L. (eds) (1993).<br />
Phrase Structure and the Lexicon. Dordrecht: Kluwer, 213-276.<br />
Tomaselli, A. (1990). La sintassi del verbo finito nelle lingue germaniche. Padova:<br />
Unipress.<br />
95
Tomaselli, A. (2004). “Il cimbro come laboratorio d’analisi per la variazione linguistica<br />
in diacronia e sincronia.” Quaderni di lingue e letterature 28, Supplemento: Variis<br />
Linguis: Studi offerti a Elio Mosele in occasione del suo settantesimo compleanno,<br />
533–549.<br />
Tortora, C.M. (1997). “I Pronomi Interrogativi in Borgomanerese.” Benincà, P.<br />
& Poletto, C. (eds) (1997). Quaderni di Lavoro dell ASIS (Atlante Sintattico Italia<br />
Settentrionale): Strutture Interrogative dell Italia Settentrionale. Padova: Consiglio<br />
Nazionale delle Ricerche, 83-88.<br />
Vicentini, R. (1993). Il dialetto cimbro di Luserna: analisi di alcuni fenomeni linguistici.<br />
Final essay for the degree “Laureat in Modern Languages and Literature.” Unpublished<br />
Essay, University of Trento.<br />
96
Creating Word Class Tagged Corpora<br />
for Northern Sotho by Linguistically<br />
Informed Bootstrapping<br />
Danie J. Prinsloo and Ulrich Heid<br />
To bootstrap tagging resources (tagger lexicon and training corpus) for Northern<br />
Sotho, a tagset and a number of modular and reusable corpus processing tools are<br />
being developed. This article describes the tagset and routines for identifying verbs<br />
and nouns, and for disambiguating closed class items. All of these are based on<br />
morphological and morphosyntactic specificities of Northern Sotho.<br />
1. Introduction<br />
In this paper, we report on ongoing work towards the parallel creation of<br />
computational linguistic resources for Northern Sotho, on the basis of linguistic<br />
knowledge about the language. Northern Sotho is one of the eleven official languages<br />
of South Africa, spoken by about 4.2 million speakers in the northeastern part of the<br />
country. It belongs to the Sotho family of the Bantu languages (S32), (Guthrie 1971).<br />
The three Sotho languages are closely related.<br />
The creation of Natural Language Processing (NLP) resources is part of an effort<br />
towards an infrastructure for corpus linguistics and computational lexicography and<br />
terminology for Northern Sotho, which is seen as an element of a broader action for<br />
the development of Human Language Technology (HLT) and NLP applications for the<br />
South African languages.<br />
Parallel resource creation has been attempted as part of our research and<br />
development agenda in order to speed up the resource building process, in the sense<br />
of rapid prototyping of a part-of-speech (=POS) tagset; a tagger lexicon and (manually<br />
corrected) reference corpus; and a statistical tagger. These constitute the first set of<br />
corpus linguistic tools to be developed (we report on the first three tools here). At the<br />
same time, we intend to verify to what extent ‘traditional’ corpus linguistic methods<br />
and tools (as used for European languages) can be applied to a Bantu language-- an<br />
attempt that, to our knowledge, has not been made before.<br />
Two text corpora are used as input to the study. The first is a 43,000 tokens corpus,<br />
a selection from the Northern Sotho novel Tša ka Mafuri (Matsepe 1974), and the<br />
second is the Pretoria Sepedi Corpus (PSC) of 6 million tokens, a collection of 327<br />
97
Northern Sotho books and magazines. These are raw, unannotated corpora, compiled<br />
by means of optical character recognition (OCR), commonly known as ‘scanning’, with<br />
tokenization done per sentence. The PSC is still in the process of being cleaned from<br />
scanning errors. For details regarding the PSC and subsequent applications thereof,<br />
see sources such as Prinsloo (1991), De Schryver & Prinsloo (2000, 2000a & 2000b), and<br />
Prinsloo & De Schryver (2001).<br />
In this paper, we will discuss our task at both a specific and general level. We report<br />
about the specific task of creating resources for Northern Sotho, and our examples<br />
and illustrative material will be taken from this language. More generally, we also<br />
analyse the exercise in terms of methods and strategies for the joint bootstrapping<br />
of different resources for an ‘unresourced’ language, trying to abstract away from<br />
language-specific details.<br />
This article is organised as follows: in section 2, we give a brief overview of some of<br />
the language-specific phenomena we exploit in resource building; section 3 deals with<br />
the component elements of a corpus linguistic infrastructure for Northern Sotho that<br />
are presently being constructed, with the steps and procedures used in the process and<br />
the characteristics of the resulting resources; section 4 is a methodological conclusion<br />
(order of steps in resource creation, role of linguistic knowledge, etc.) and an analysis<br />
of the processes in terms of generalisability and portability to other languages such as<br />
Sotho, Bantu, and possibly completely different languages.<br />
2. Northern Sotho Linguistics Informing Corpus Technology<br />
A prerequisite to successful interpretation of the criteria for and output of a POS-<br />
tagger for Northern Sotho is a brief outline of certain basic linguistic characteristics<br />
of the language, especially of nouns and verbs. See Lombard et al. (1985), Louwrens<br />
(1991), and Poulos & Louwrens (1994) for a detailed grammatical description of this<br />
language.<br />
2.1 Noun System: Classifiers and Concords<br />
Nouns in Bantu languages are grouped into different noun classes. Compare Table<br />
1 for Northern Sotho.<br />
Table 1: Noun Classes of Northern Sotho with Examples<br />
Class Prefix Example Translation<br />
1 mo- monna man<br />
2 ba- banna men<br />
1a Ø malome uncle<br />
98
2b bo+ bomalome uncles<br />
3 mo- monwana finger<br />
4 me- menwana fingers<br />
5 le- lesogana young man<br />
6 ma- masogana young men<br />
7 se- selepe axe<br />
8 di- dilepe axes<br />
9 N-/Ø nku sheep (sg.)<br />
10 di+ dinku sheep (pl.)<br />
11<br />
12<br />
13<br />
14 bo- bogobe porridge<br />
6 ma- magobe different kinds of<br />
99<br />
porridge<br />
15 go go bona to see<br />
16 fa- fase below<br />
17 go- godimo above<br />
18 mo- morago behind<br />
Nouns are subdivided into different classes, each with its own prefix, and the<br />
prefixes of the first ten classes mark singular versus plural forms. Classes 11-13 do<br />
not exist in Northern Sotho. The prefixes also generate a number of concords and<br />
pronouns that are used to complete phrases and sentences. Consider the following<br />
example from Class 1, given in Table 2.<br />
Table 2: Example of a Sentence Consisting of a Noun, Verb, Pronoun and Concords<br />
Monna yo o A di rata<br />
noun Cl. 1 demonstrative<br />
(pronoun) Cl. 1<br />
subject<br />
concord Cl. 1<br />
present tense<br />
marker<br />
object concord<br />
Class 8/10<br />
verb stem<br />
Man this (he) () them loves<br />
This man loves them.<br />
There are a few hundred closed class items such as the subject concords, object<br />
concords, demonstratives (pronouns) and particles. Prime criteria for detecting and<br />
tagging nouns will naturally be based on class prefixes and nominal concords and to a<br />
limited extent on nominal suffixes such as the locative –ng.
2.2 Verb System: Productivity in Morphology<br />
In the case of verbs, numerous derivations of a single verb stem exist, consisting of<br />
the root, plus one or more prefix(es) and/or suffix(es), as is clearly indicated in Table<br />
3, which reflects a subsection (five out of eighteen modules, cf. Prinsloo [1994]) of<br />
the suffixes and combinations of suffixes for the verb stem reka ‘buy.’ The complexity<br />
of this layout is evident.<br />
Verbal derivations such as those in the rightmost column of Table 3 can all simply<br />
be tagged as verbs, or, alternatively, first be morphologically analysed (cf. Taljard &<br />
Bosch 2005) and then tagged in terms of their specific verbal suffixes, cf. column 2<br />
versus column 3 in Table 4 with respect to the suffixal cluster 02 ANA in Table 3.<br />
MODULE NUMBER AND<br />
MARKER<br />
Table 3: Selection of Derivations of the Verb reka<br />
MODULE COMPOSITION ABBREVIATIONS STEMS AND<br />
01 root + standard<br />
modifications<br />
02 ANA root + reciprocal + standard<br />
100<br />
VR reka<br />
(Per = Perfect tense) VRPer rekile<br />
(Pas = Passive) VRPas rekwa<br />
modifications<br />
03 ANTŠHA root + reciprocal + causative<br />
+ standard modifications<br />
04 ANYA root + alternative causative<br />
+ standard modifications<br />
05 EGA root + neutro passive +<br />
standard modifications<br />
DERIVATIONS<br />
VR PerPas rekilwe<br />
VRRec rekana<br />
VRRecPer rekane<br />
VRRecPas rekanwa<br />
VRRecPerPas rekanwe<br />
VRRecCau rekantšha<br />
VRRecCauPer rekantšhitše<br />
VRRecCauPas rekantšhwa<br />
VRRecCauPerPas rekantšhitšwe<br />
VRAlt-Cau rekanya<br />
VRAlt-CauPer rekantše<br />
VRAlt-CauPas rekanywa<br />
VRAlt-CauPerPas rekantšwe<br />
VRNeu-Pas rekega<br />
VRNeu-PasPer rekegile<br />
VRPas<br />
VRPerPas
Table 4: Alternatives in Tagging the Verb reka<br />
02 ANA rekana ‘V’ rek ‘Vroot’ an ‘Rec’ a<br />
2.3 Quantitative Aspects of the Lexicon<br />
rekane ‘V’ rek ‘Vroot’ an ‘Rec’ e ‘Per’<br />
rekanwa ‘V’ rek ‘Vroot’ an ‘Rec’ w ‘Pas’ a<br />
rekanwe ‘V’ rek ‘Vroot’ an ‘Rec’ w ‘Pas’ e ‘Per’<br />
There are a few marked tendencies in the quantitative distribution of lexical items<br />
in Northern Sotho, especially with respect to the relationship between frequency of<br />
use and ambiguity.<br />
In our 43,000 word corpus sample, we counted types and tokens, distinguishing<br />
nouns, verbs and closed class items. In Northern Sotho, only nouns and verbs allow<br />
for productive word formation (i.e., are open word classes), whereas function words,<br />
adverbs and adjectives are listed (i.e., belong to closed classes). Note that we did<br />
not consider numerals at all; the figures given are to be taken as tendencies. We<br />
separately counted forms that can be unambiguously identified as nouns, verbs or<br />
elements of one of the closed classes, as opposed to ambiguous forms where more<br />
than one word class can be assigned, depending on the context.<br />
All three have many more unambiguous types than ambiguous ones. As is likely in<br />
most languages, however, high frequency items are also highly ambiguous (cf. Table 5<br />
below). Nevertheless, if only slightly more than half of the potential verb occurrences<br />
in the sample are unambiguous (ca. 5000 tokens), the percentage of unambiguous<br />
occurrences of noun candidates is as high as 90% (5800 out of 6300 tokens). Ambiguity<br />
with nouns is restricted to rather infrequent items. For closed class items, however,<br />
the inverse situation is observed: only little more than 20% of the occurrences of<br />
closed class items in our sample are unambiguous, and a small set of closed class item<br />
types (88 types), of an average frequency of two hundred or more, constitutes about<br />
40% of the total amount of word forms in the sample. We expect that this distribution<br />
will be more or less generalisable to larger data sets of Northern Sotho. It will have<br />
an incidence on our approach to the bootstrapping of linguistic resources for this<br />
language. Table 5 lists the most frequent (and at the same time most ambiguous)<br />
items from the 43,000 word corpus sample with their tags (according to the tagset<br />
described in section 3.2) and their absolute frequency in the sample.<br />
101
Table 5: Most Frequent and Most Ambiguous Items in the Sample<br />
Item Possible Tags Freq.<br />
a CDEM6:CO6:CS1:CS6:CPOSS1:CPOSS6:QUE:PRES 2261<br />
go CO2psg:CO15:CO17:CS15:CS17:CSindef:PALOC 2075<br />
ka CS1psg:PAINS:PATEMP:PALOC:POSSPRO1psg 1807<br />
le CDEM5:CO2ppl:CO5:CS2ppl:CS5:PACON:VCOP 1615<br />
ba AUX:CDEM2:CO2:CS2:CPOSS2:VCOP 1429<br />
o CO3:CS1:CS2psg:CS3 1192<br />
ke AUX:CS1psg:PAAGEN:PACOP 1107<br />
3. Elements of a Scenario for Resource Building for Northern Sotho<br />
3.1 Starting Point and Objectives<br />
For computational lexicography, a sufficiently large corpus is needed, annotated<br />
at least at the level of part-of-speech. For the development of automatic tools for<br />
syntactic analysis, a more detailed annotation is required. In this paper we concentrate<br />
on a step prior to both of these resources, that is, on the creation of smaller, but<br />
generic resources to enable part of speech tagging.<br />
Tagset design for Northern Sotho is based on distinctions in traditional Northern<br />
Sotho grammar, it is carried out with a view to the kinds of information that would<br />
be extracted from a corpus once it has been tagged. As statistical tagging can only<br />
be attempted when a sufficiently large training corpus is available, an adaptation<br />
of the tagset is likely to be needed when the automatic tagging is tested, since<br />
some distinctions from the grammar may not be identifiable in texts without deeper<br />
knowledge.<br />
In working towards an annotated training corpus, different procedures are possible<br />
in principle: one could manually annotate a significant amount of data, or one could<br />
opt for a mixed approach, where certain parts of the corpus would receive manual<br />
annotation, and others would be annotated in a semi-automatic fashion, where<br />
the results of an automatic pre-classification are manually corrected. Due to the<br />
morphological and distributional properties of Northern Sotho discussed in section 2,<br />
the following breakdown was chosen:<br />
• Closed class items, as well as other words of very high frequency, were<br />
introduced manually to the tagger lexicon, with a disjunctive tag annotation that<br />
indicates for each item all its possible tags (Table 5);<br />
102
• Nouns and verbs can be guessed in the text on the basis of their<br />
morphological properties; thus, separate rule-based guessers were developed, and<br />
their results were manually corrected in the training corpus; and,<br />
• The disambiguation of closed-class items in context is, to a considerable<br />
extent, possible on the basis of rules similar to ‘local grammars.’ A certain amount<br />
of ambiguities in the training corpus have to be dealt with manually.<br />
In the remainder of this section, we report on tagset design (section 3.2); on an<br />
architecture for the creation of a tagger lexicon and a training corpus (section 3.3);<br />
and, on verb and noun guessing and the disambiguation of closed class items (sections<br />
3.4 to 3.6).<br />
3.2 Tagset Design<br />
The tagset designed for Northern Sotho is organised as a logical tagset (similar to<br />
a type hierarchy); this opens up the possibility to formulate underspecified queries to<br />
the corpus.<br />
The tagset mirrors some of the linguistic specificities of Northern Sotho, but is also<br />
conditioned by considerations of automatic processability with a statistical tagger.<br />
The tagset reflects properties of the nominal system of classes and concords: as they<br />
are (mostly) lexically distinct, we introduced class-based subtypes for nouns, pronouns<br />
and concords, as well as for adjectives: N, ADJ, C (for concord) and PRO (for pronoun)<br />
have such subtypes. As concords and pronouns have functionally and/or semantically<br />
defined subtypes, we apply the class-based subdivision in fact to the types listed in<br />
Table 6:<br />
Table 6: Nominal Categories that have Class-related Subtypes<br />
N Nouns CPOSS possessive concords<br />
ADJ adjectives EMPRO emphatic pronouns<br />
CS subject concords POSSPRO possessive pronouns<br />
CO object concords QUANTPRO quantifying pronouns<br />
CDEM demonstrative concords<br />
Given the complexity of the system of verbal derivation (cf. Table 3 above), an<br />
attempt to subclassify verbal forms accordingly would have led to an amount of<br />
tags (i.e., of distinctions) that would not be manageable with a statistical tagger.<br />
Furthermore, as- according to Northern Sotho orthography conventions- concords,<br />
adjectives and pronouns are written separately from the nouns and verbs to which<br />
they are grammatically related (disjunctive writing), these elements receive their<br />
103
own tags. Since verbal derivation is written conjunctively (like word formation in<br />
European languages), a single ‘verb’ tag (V) proved sufficient (cf. Table 4). As with<br />
parts of tense morphology and with word formation in European languages, an analysis<br />
of Northern Sotho verbal derivations is left to a separate tool (e.g. to a morphological<br />
analyser; see the discussion in Taljard & Bosch 2005).<br />
Other tags cover invariable lexical items:<br />
• adverbs (ADV) and numerals (NUM);<br />
• tense/mood/aspect markers for present tense (PRES), future (FUT), and<br />
progressive (PROG);<br />
• auxiliaries (AUX) and copulative verbs (VCOP);<br />
• ideophones (IDEO); and,<br />
• different (semantically defined) kinds of particles that mark a hortative<br />
(HORT), questions (QUE), as well as agentive (PAAGEN), connective (PACON),<br />
copulative (PACOP), instrumental (PAINS), locative (PALOC) and temporal (PATEMP)<br />
constructs.<br />
In principle, our approach to the design of tagsets for nouns and verbs is similar to<br />
the one of Van Rooy and Pretorius (2003) for Setswana, but it is much less complex.<br />
In the case of verbs we agree on the allocation of a single tag for verb stem plus<br />
suffix(es) as well as on separate tags for verbal prefixes:<br />
“[…] verbs are preceded by a number of prefixes, which are regarded as<br />
separate tokens for the purposes of tagging. The verb stem, containing the<br />
root and a number of suffixes (as well as the reflexive prefix) receives a single<br />
tag.“ (Van Rooy & Pretorius 2003:211)<br />
Likewise, for nouns, we are in agreement that at this stage in the development of<br />
tagsets, certain subclassifications such as the separate identification of deverbatives<br />
should be excluded (cf. Van Rooy & Pretorius 2003:210). Our approach differs from Van<br />
Rooy and Pretorius among others, in that a much smaller tagset is compiled for both<br />
verbs and nouns. In the case of verbs, we do not consider modal categories, and in the<br />
case of nouns, we honour subclasses but not divisions in terms of relational nouns and<br />
proper names. Consider the following examples illustrating basic differences in terms<br />
of the approaches as well as of the complexity of the tags:<br />
(1) Nouns<br />
a) Mosadi ‘woman’<br />
Tswana (Van Rooy & Pretorius 2003:217):<br />
104
Tag category: Common noun, singular; Label: NC1; Intermediate tagset:<br />
N101001<br />
Northern Sotho: Noun; Tag: N1<br />
(2) Verbs<br />
b) Bomalome ‘uncles’<br />
Tswana (Van Rooy & Pretorius 2003:217):<br />
Tag category: Relational noun, plural; Label: NR2; Intermediate tagset:<br />
N302001<br />
Northern Sotho: Noun; Tag: N2<br />
Tswana: kwala/kwalwa/kwadile; Northern Sotho: ngwala/ngwalwa/ngwadile<br />
‘write/be written/wrote’<br />
Tswana (Van Rooy & Pretorius 2003:219):<br />
Tag category: Lexical verb, indicative, present, active; Label: Vl0PA;<br />
Intermediate tagset: V0001111102000 kwala<br />
Tag category: Lexical verb, indicative, present, passive; Label: Vl0PP;<br />
Intermediate tagset: V0001112102000 kwalwa<br />
Tag category: Lexical verb, indicative, past, active; Label: Vl0DA; Intermediate<br />
tagset: V0001141102000 kwadile<br />
Northern Sotho: verb; Tag: V<br />
3.3 An Architecture for Parallel Resource Building<br />
Since we opted, as far as POS-tagging is concerned, for an attempt to apply Schmid’s<br />
(1994) statistical TreeTagger to Northern Sotho, both a tagger lexicon and a reference<br />
corpus for training were needed. Schmid’s TreeTagger was chosen, because it needs<br />
much less manually annotated training material than other statistical taggers. For<br />
European languages (German, French, English, Dutch, and Italian) training corpora of<br />
40,000 to 100,000 words have proven sufficient to obtain the 96-97% tagging rate that<br />
is standard in current applications. Tagging quality of the TreeTagger also depends<br />
upon the number of different tags and on the size of the tagger lexicon. It thus seems<br />
obvious to bootstrap lexicon and corpus in parallel.<br />
Given the grammatical and distributional properties of Northern Sotho, we opted<br />
for the overall approach as sketched above in section 3.1: a list of closed class items<br />
and their possible tags is created manually, whereas nouns and verbs are guessed on<br />
the basis of morphological rules, and closed class item disambiguation is performed<br />
semi-automatically, based on rules, and possibly also on frequency-based heuristics.<br />
105
Figure 1 shows the strands of corpus annotation, where the (upper) strand leading<br />
to the training corpus is meant to be carried out once, whereas the general strand<br />
(below) can be repeated for each newly acquired corpus.<br />
Figure 1: Strands of Corpus Annotation<br />
The workflow involves a number of modular tools (developed in the course of the<br />
preparation of the training corpus) that can be reused with any additional Northern<br />
Sotho corpus. These include a sentence tokenizer; the tagger lexicon and a tool to<br />
project its contents (i.e., potentially ambiguous annotations for individual word<br />
forms) against the corpus words; guessers for nouns and verbs; and, disambiguation<br />
rules for closed class item disambiguation in context.<br />
The procedure sketched here, and depicted in Figure 1, is in fact a combination of<br />
rule-based symbolic tagging and statistical tagging, whereby a number of ambiguities<br />
are solved by the rule-based component before the statistical tagger is used. This<br />
setup is similar to Klatt’s (2005) work on a corpus processing suite for German.<br />
106
3.4 Verb Guesser<br />
In Table 3 above, a few selected examples of derived verb forms of Northern Sotho<br />
are given. Except for very frequent forms of a few verbs, most verb forms are marked<br />
by unambiguous derivational and inflectional affixes. For example, a word form found<br />
in a corpus that ends in -antšwe will almost inevitably be a verb form (cf. rekantšwe<br />
in Table 3).<br />
Consequently, many verb forms can be identified by simple pattern matching.<br />
Based on the grammatical system of verb affixation sketched in Prinsloo (1994),<br />
we developed a verb form guesser. It compares each candidate form with a list of<br />
unambiguous verbal affixes to distinguish verb forms from forms of other categories.<br />
Given the productivity of verbal derivation in Northern Sotho (cf. section 2 above),<br />
this guesser will be needed on any new corpus of Northern Sotho to be annotated.<br />
If required, the grammatical information encoded in the verbal affixes can be made<br />
explicit in the annotation (cf. Table 4 above).<br />
3.5 Noun Guesser<br />
Suffixal derivation appears in nouns only to denote locatives, augmentatives/<br />
feminins and diminutives. Given the low frequency of these derivations, with the<br />
possible exception of the locative, a noun detection strategy based on pattern<br />
matching alone, in analogy to that of the verb guesser, will have low recall, even<br />
though its precision will be very high.<br />
But nouns are characterised by their class prefixes (cf. Table 1 above); prefixes of<br />
classes 1 to 10 indicate singular (classes 1,3,5,7 and 9) versus plural (classes 2,4,6,8<br />
and 10). The prefixes are not, however, unambiguous with respect to classes (mo-:<br />
class 1,3 and, less relevant, class 18; di-: class 8 and 10; etc.). Not all words starting<br />
with a syllable that can be a noun prefix are indeed nouns (cf. e.g. the verb form<br />
letetše ‘wait(ed) for’ where the first syllable le- is not the prefix of class 5).<br />
What is indeed a highly unambiguous indicator of a noun form is its syntagmatic<br />
environment, as well as the alternation pattern between singular and plural. Very<br />
often, nouns are accompanied by concords or adjectives, as illustrated by the example<br />
in Table 2, where the noun monna is followed by a demonstrative and a subject<br />
concord, both of which show agreement with the noun with respect to the class.<br />
Adjectives also show this agreement.<br />
We exploit this regularity in our noun form guesser as follows: to identify items<br />
of a given pair of singular/plural classes, we apply word sequence patterns to the<br />
corpus data, which rely on the presence of concords, pronouns, adjectives, and so<br />
107
forth in the neighbourhood of the noun candidates. We check for the existence of<br />
such patterns in parallel for singular forms and for their potential plural counterparts.<br />
The search is approximative, in so far as it checks the presence of agreement-bearing<br />
elements within a window of up to three words left or right of the noun candidate.<br />
The rules can, in principle, be triggered either by singular or by plural items (with the<br />
exception of class 9 versus class 10, where it is preferable to start from the plural).<br />
Table 7 contains an example of a noun guessing query (simplified, as many potential<br />
agreement-bearing indicator items are left out), formulated in the notation of the<br />
CQP corpus query language, which underlies the CWB Corpus WorkBench, (Christ et<br />
al. 1999), used in our experiments as a corpus representation and infrastructure. We<br />
indicate (parts of) the queries that extract nouns of classes 7 (and 8).<br />
Table 7: Sample Query for the Identification of Noun Candidates of Classes 7 + 8<br />
(<br />
);<br />
[word = ‘sego|selo|sebatakgomo|...|<br />
[]{0,2}<br />
setšhaba|seatla|sello’]<br />
[word = ‘sa|se|segolo|<br />
) |<br />
sekhwi|sengwe|seo|sona …|’]<br />
( [word = ‘sa|se|segolo|<br />
( ....)<br />
[]{0,2}<br />
...]<br />
[word =<br />
‘sego|selo|sebatakgomo|...’]<br />
108<br />
First part of query:<br />
candidate se- words<br />
as a disjunction;<br />
followed in distance 0 to 2<br />
by class-7-indicators noted as a<br />
disjunction<br />
or (second part of query):<br />
choice of indicators<br />
followed in distance 0 to 2<br />
by candidate words<br />
analogous procedure for<br />
noun candidates created<br />
by replacement of se-<br />
with di- (plus class 8 concords)<br />
When applied to the 43,000 words corpus sample, the query throws up,<br />
among others, the results displayed in Table 8.
Table 8: Sample Results of Noun Guessing for Classes 7 and 8<br />
Class 7 cands. Class 8 cands. N? Equivalent(s)<br />
selo dilo + thing, things<br />
setšhaba ditšhaba + nation, nations<br />
sello dillo + (out)cry, outcries<br />
sepetše *dipetše — walked<br />
sekelela dikelela — recommend, disappear<br />
The checking tool is robust towards inexistent forms (cf. *dipetše) and towards<br />
forms that are not nominal (due to the context constraint on agreement-bearing<br />
items, (cf. sekelela versus dikelela).<br />
A first qualitative evaluation of the noun guessing routines on all candidates from<br />
the 43,000 word corpus sample seems to suggest that the tool only fails on lexicalized<br />
irregular forms (e.g. mong - beng, ‘owner(s)’, instead of the hypothetical mong -<br />
*bang), and on nouns that, mostly due to semantic reasons, do not have both a singular<br />
and a plural form (such as Sepedi ‘Pedi language and culture’, or leboa ‘North’). As<br />
for the verb guesser, the noun guesser can be and has to be applied (for quantitative<br />
reasons) to any new corpus to be annotated.<br />
3.6 Rules for the Disambiguation of Closed Class Items<br />
Given the high degree of ambiguity in closed class items (see section 2.3), there is<br />
a major need for disambiguation strategies for these items. Even though a statistical<br />
tagger is designed for this type of disambiguation, a rule-based preprocessing, leading<br />
at least to a partial reduction of ambiguity, seems necessary.<br />
We use context-based disambiguation rules, in the spirit of Gross and Silberztein’s<br />
local grammars (Silberztein 1993) and of rule-based tagging. As with the noun guessing<br />
queries, disambiguation rules are implemented as queries in the format of the CQP<br />
language. Some extraction rules exclusively rely on lexical contexts (cf. the topmost<br />
part of Table 9), while others involve lexemes and word class tagged items (middle<br />
row), or a combination of lexical, categorical and morphological constraints (including,<br />
for example, the presence of certain affixes [cf. lower part of Table 9]). The examples<br />
in Table 9 all relate to the disambiguation of the form a, the most frequent and most<br />
ambiguous item in our sample (cf. Table 5).<br />
109
Table 9: Examples of Disambiguation Queries for the Form a<br />
‘o|O’ ‘be’ ‘a’ Sequence of o be a (‘he/she was’)<br />
[pos = ‘N.{1,2}’]<br />
‘a’<br />
[pos = ‘N.{1,2}’];<br />
‘a’<br />
[pos = ‘V’<br />
& word = ‘.*go’];<br />
Hypothesis: a: CS1<br />
Coverage: 109 instances<br />
Precision: 109 (100%)<br />
a between two nouns (‘of’: possessive)<br />
Hypothesis: a: CPOSS6<br />
Coverage: 42 instances<br />
Precision: 42 (100%)<br />
a preceding a verb form ending in -go (Relative Marker)<br />
Hypothesis: a: CS1 or CS6 or CO6<br />
Coverage: 75 instances<br />
63 (80,8%) CS1; 9 (15.4%) CS6; 3 (3.8%) C06.<br />
The examples show that some rules do not fully disambiguate, but leave a<br />
set of options. Since we use the rules as a preparatory step to statistical tagging<br />
(and to manual disambiguation in the preparation of the training corpus), partial<br />
disambiguation is still useful to reduce the effort needed at a subsequent stage (cf.<br />
the third example of Table 9, where the choice of eight tags for a is reduced to a four-<br />
way ambiguity).<br />
4. Methodological Considerations<br />
4.1 Sequencing of Processing Steps<br />
We use semi-automatic procedures to create tagging resources for Northern Sotho.<br />
As raw corpora are the only available input, a first step in the project is to define a<br />
tagset that underlies all subsequent work (cf. section 3.2).<br />
The creation of the tagger lexicon and the annotation of the training corpus<br />
mostly run in parallel. We classify word forms from the corpus, store their (possibly<br />
disjunctive) description in the lexicon, and annotate them at the same time in the<br />
upcoming training corpus. (We annotate each word form in the corpus with the<br />
respective entry from the tagger lexicon.) While the disjunctive annotations remain<br />
in the tagger lexicon, context-based rules are used to partly disambiguate the corpus<br />
occurrences (cf. section 3.4 and 3.5).<br />
To get the process started efficiently, we first manually annotated the thousand<br />
most frequent word forms in the corpus, aiming at a complete coverage of their<br />
110
potential word class features. This information can be provided easily on the basis of<br />
Northern Sotho grammar, as many of them are function words.<br />
Subsequently, we employed semi-automatic procedures (automatic pre-classification<br />
of data, followed by manual verification) that focus on high precision, allowing, at<br />
the same time, for efficient data production: we capitalised on unambiguous verb<br />
and noun forms, covering thereby more than one fourth of all corpus occurrences<br />
(tokens), and obtaining in parallel a stock of approximately 2800 additional entries of<br />
the tagger lexicon (word form types).<br />
Once nouns and verbs were annotated, disambiguation rules for closed class items<br />
were formulated (based on regularities of the Northern Sotho grammar) and applied;<br />
many of these contextual constraints involve verbs and nouns. The rules are ordered<br />
by specificity: as in many other NLP applications, the most specific cases are handled<br />
first; at the end of the cascade, more general rules are applied, which may also be<br />
less ‘safe’ and less effective, that is, have less precision and/or recall.<br />
In conclusion, the strategy may be characterised as ‘easy-first’ and ‘safety-first’: for<br />
example, as disambiguation rules cannot overwrite (previously verified) lexical data,<br />
the overall process is one of monotonic accumulation of information. A bootstrapping<br />
procedure proved most efficient, where the validated results of each of the above-<br />
mentioned steps are persistently represented in both corpus and lexicon, such that<br />
they are available as input for the subsequent steps.<br />
4.2 Reusability of the Created Resources<br />
As mentioned in section 3, our verb and noun guessers can be applied to other<br />
Northern Sotho corpora, as can the tool projecting lexical descriptions onto corpus<br />
word forms. Given the productivity of verbal derivation and the amount of nouns to<br />
be expected in larger corpora, we assume that both tools will prove useful in the<br />
preparation of an annotated version of the PSC. Moreover, even though statistical<br />
taggers are designed to both disambiguate in context and guess word class values<br />
for unknown words (i.e., those not contained in the system’s lexicon), reducing the<br />
amount of the latter may improve overall output quality.<br />
Obviously, the parallel growth of lexicon and corpus will continue when larger<br />
corpora will be treated. At a later stage, we envisage the parallel enhancement of<br />
both resources, not only in coverage, but also with respect to the degree of detail<br />
covered: some morphological details of nouns (locatives, feminins/augmentatives,<br />
and diminutives) and verbs (cf. Tables 3 and 4) can be identified, but are not yet<br />
accounted for in our resources. Thus, the tagger lexicon may become part of an NLP-<br />
111
oriented dictionary that would explicitly store such properties. As far as the corpus<br />
is concerned, a multilevel annotation would be more appropriate than the current<br />
monodimensional view: without changes to the current annotation, extra layers<br />
may be added for the above-mentioned features of nouns and verbs, but also for an<br />
appropriate treatment of fused forms (cf. dirang, ‘do what?’ from dira + eng) and of<br />
multiword items, for example, idiomatic expressions (cf. bona kgwedi ‘see the moon’<br />
i.e., ‘menstruate’). As Northern Sotho orthography is not yet fully standardised, a<br />
distinction between standard orthography and observed (possibly deviant) orthography<br />
may be introduced through additional layers.<br />
5. Conclusions and Future Work<br />
We reported on an ongoing research and development project for the creation<br />
of tagging resources for Northern Sotho. In this context, modular components of a<br />
two-layered architecture were created, which are needed in the first place for the<br />
preparation of a training corpus for statistical tagging, but which will prove equally<br />
useful, we hope, for the later development of larger corpora.<br />
We bootstrap the training corpus and the tagger lexicon in parallel, using semi-<br />
automatic procedures consisting of a rule-based automatic pre-classification and<br />
subsequent manual validation: the procedures concern the identification of verbal<br />
and nominal forms and the disambiguation of closed class items. These procedures are<br />
applied one after the other by order of their expected precision (‘easy-first’, ‘safety-<br />
first’), leading thereby to a partly disambiguated corpus. For the creation of the<br />
training corpus, the remaining ambiguities are removed manually, whereas this task is<br />
supposed to be left to the statistical tagger in the later creation of larger corpora.<br />
Linguistic knowledge about the language is extensively used in the definition of the<br />
automatic procedures: morphological and morpho-syntactic regularities in the local<br />
context provide the starting point for their formulation.<br />
Future work on the tools described in this paper will be devoted to the development<br />
of further disambiguation rules, to the finalisation of a fully disambiguated training<br />
corpus, and to tagger training and tests. This will allow us to (i) assess tagging quality<br />
as obtained by the use of the statistical tagger only in a setup with our rule-based<br />
pre-processing, (ii) to stabilise the proposed tagset on the basis of experience with<br />
statistical tagging, and (iii) to undertake tagging of the PSC, which could then serve<br />
for lexicographic exploration.<br />
A well-designed POS-tagger for Northern Sotho would provide a flying start to the<br />
development of similar taggers for the other Sotho languages, the Nguni languages,<br />
112
and Bantu languages in general. It is expected that only minor adjustments will be<br />
required to adapt a POS-tagger for Northern Sotho to the other two Sotho languages<br />
(Tswana and Southern Sotho) because these languages are closely related. Pending<br />
certain morphological parsing for the Nguni languages (i.e., Zulu, Xhosa, Swazi<br />
and Ndebele) the tagger will be equally usable, since these languages do not differ<br />
structurally from the Sotho languages. It could finally be extended to other Bantu<br />
languages, since Bantu languages in general have a common structure.<br />
6. Acknowledgements<br />
This work was carried out as a joint project between the Department of<br />
African languages of the University of Pretoria and the Institut für maschinelle<br />
Sprachverarbeitung of Universität Stuttgart. We would like to thank Elsabé Taljard<br />
(Pretoria) who contributed to the design of the tagset and who cross-checked a large<br />
proportion of the output of our tools. Furthermore, we would like to thank Gertrud<br />
Faaß (Stuttgart) for her invaluable help with the implementation of the noun and verb<br />
guessers and of the tagging support tools.<br />
113
114<br />
References<br />
Christ, O., Schulze, B.M. & König, E. (1999). Corpus Query Processor (CQP). User’s<br />
Manual. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart Stuttgart,<br />
Germany.<br />
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/.<br />
De Schryver, G.M. & Prinsloo, D.J. (2000). “The Compilation of Electronic Corpora,<br />
with Special Reference to the African Languages.” Southern African Linguistics and<br />
Applied Language Studies 18(1-4), 89–106.<br />
De Schryver, G.-M. & Prinsloo, D.J. (2000a). “Electronic Corpora as a Basis for the<br />
Compilation of African-language Dictionaries, Part 1. The Macrostructure.” South<br />
African Journal of African Languages 20(4), 291–309.<br />
De Schryver, G.-M. & Prinsloo, D.J. (2000b). “Electronic Corpora as a Basis for the<br />
Compilation of African-language Dictionaries, Part 2: The Microstructure.” South<br />
African Journal of African Languages 20(4), 310–30.<br />
Guthrie, M. (1971). Comparative Bantu: an Introduction to the Comparative Linguistics<br />
and Prehistory of the Bantu Languages. Vol. 2: The Comparative Linguistics of the<br />
Bantu Languages, London: Gregg Press.<br />
Klatt, S. (2005). <strong>Text</strong>analyseverfahren für die Korpusannotation und<br />
Informationsextraktion. Aachen: Shaker.<br />
Lombard, D.P., Van Wyk, E.B. & Mokgokong, P.C. (1985). Introduction to the Grammar<br />
of Northern Sotho. Pretoria: J.L. van Schaik.<br />
Louwrens, L.J. (1991). Aspects of Northern Sotho Grammar. Pretoria: Via Afrika<br />
Limited.<br />
Matsepe, O.K. (1974). Tša ka mafuri. Pretoria: Van Schaik.
Poulos, G. & Louwrens, L.J. (1994). A Linguistic Analysis of Northern Sotho. Pretoria:<br />
Via Afrika Limited.<br />
Prinsloo, D.J. (1994). “Lemmatization of Verbs in Northern Sotho.” SA Journal of<br />
African Languages 14(2), 93-102.<br />
Prinsloo, D.J. (1991). “Towards Computer-assisted Word Frequency Studies in Northern<br />
Sotho.” SA Journal of African Languages 11(2).<br />
Prinsloo, D.J. & de Schryver, G.-M. (2001). “Monitoring the Stability of a Growing<br />
Organic Corpus, with Special Reference to Sepedi and Xitsonga.” Dictionaries: Journal<br />
of The Dictionary Society of North America 22, 85–129.<br />
Schmid, H. (1994). “Probabilistic Part-of-Speech Tagging Using Decision Trees.”<br />
Proceedings of the International Conference on New Methods in Language Processing.<br />
Manchester, UK, 44-49.<br />
Taljard, E. & Bosch, S.E. (this volume). “A Comparison of Approaches Towards Word<br />
Class Tagging: Disjunctively versus Conjunctively Written Bantu Languages”, 117-131.<br />
Silberztein, M. (1993). Dictionnaires électroniques et analyse automatique de textes:<br />
le système INTEX. Paris: Masson.<br />
Van Rooy, B. & Pretorius, R. (2003). “A Word-Class Tagset for Setswana.” Southern<br />
African Linguistics and Applied Language Studies 21(4), 203-222.<br />
115
A Comparison of Approaches to Word Class<br />
Tagging: Distinctively Versus Conjunctively<br />
Written Bantu Languages<br />
Elsabé Taljard and Sonja E. Bosch<br />
Northern Sotho and Zulu are two South African Bantu languages that make use of<br />
different writing systems, namely, a disjunctive and a conjunctive writing system,<br />
respectively. In this paper, it is argued that the different orthographic systems obscure<br />
the morphological similarities, and that these systems impact directly on word class<br />
tagging for the two languages. It is illustrated that not only different approaches are<br />
needed for word class tagging, but also that the sequencing of tasks is, to a large<br />
extent, determined by the difference in writing systems.<br />
1. Introduction<br />
The aim of this paper is to draw a comparison of approaches towards word class<br />
tagging in two orthographically distinct Bantu languages. The disjunctive versus<br />
conjunctive writing systems in the South African Bantu languages have direct<br />
implications for word class tagging. For the purposes of this discussion we selected<br />
Northern Sotho to represent the disjunctive writing system, and Zulu as an example<br />
of a conjunctively written language. These two languages, which belong to the South-<br />
Eastern zone of Bantu languages, are two of the eleven official languages of South<br />
Africa. Northern Sotho and Zulu are spoken by approximately 4.2 and 10.6 million<br />
mother-tongued speakers, respectively. Both these languages belong to a larger<br />
grouping of languages, that is, the Sotho and Nguni language groups, respectively.<br />
Languages belonging to the same language group are closely related, and to a large<br />
extent, mutually intelligible. Furthermore, since all three languages belonging to the<br />
Sotho group follow the disjunctive method of writing, the methodology utilised for<br />
part-of-speech tagging in Northern Sotho would to a large extent be applicable to the<br />
other two Sotho languages (Southern Sotho and Tswana) as well. The same holds true<br />
for Zulu with regard to the other Nguni languages (i.e., Xhosa, Swati and Ndebele),<br />
which are also conjunctively written languages. The South African Bantu languages<br />
are not yet fully standardised with regard to orthography, terminology and spelling<br />
rules, and, when compared to European languages, these languages cannot boast a<br />
wealth of linguistic resources. A limited number of grammar books and dictionaries<br />
117
are available for these languages, while computational resources are even scarcer. In<br />
terms of natural language processing, the Bantu languages, in general, undoubtedly<br />
belong to the lesser-studied languages of the world.<br />
In this paper, a concise overview is first given of the relevant Bantu morphology,<br />
and reference is made to the differing orthographical conventions. In the subsequent<br />
section, the available linguistic and computational resources for the two languages<br />
are compared, followed by a comparison between the approaches towards word class<br />
tagging for Northern Sotho and Zulu. In conclusion, future work regarding word class<br />
tagging for Bantu languages is discussed.<br />
2. Bantu Morphology and Orthography<br />
According to Poulos & Louwrens (1994:4), “there are numerous similarities that<br />
can be seen in the structure (i.e., morphology), as well as the syntax of words<br />
and word categories, in the various languages of this family.” These languages are<br />
basically agglutinating in nature, since prefixes and suffixes are used extensively in<br />
word formation.<br />
The focus in this concise discussion on aspects of Bantu morphology is on the<br />
two basic morphological systems: the noun class system, and the resulting system of<br />
concordial agreement.<br />
2.1 Noun Classes and Concordial Agreement System<br />
The noun class system classifies nouns into a number of noun classes, as signalled<br />
by prefixal morphemes also known as noun prefixes. For ease of analysis, these noun<br />
prefixes have been divided into classes with numbers by historical Bantu linguists,<br />
and represent an internationally accepted numbering system. In general, noun<br />
prefixes indicate number, with the uneven class numbers designating singular and the<br />
corresponding even class numbers designating plural. The following are examples of<br />
Meinhof’s (1932:48) numbering system of some of the noun class prefixes:<br />
118
Class # Northern Sotho Zulu<br />
Table 1: Noun Class System: An Illustrative Excerpt<br />
Preffix Example Prefix Example<br />
1 (sg) mo- motho “person” umu- umuntu „person”<br />
2 (pl) ba- batho “persons” aba- abantu “persons”<br />
1a(sg) Ø- makgolo “grandmother” u- udokotela “doctor”<br />
2b(pl) bo- bomakgolo “grandmothers” o- odokotela “doctors”<br />
3 (sg) mo- mohlare “tree” umu- umuthi “tree”<br />
4 (pl) me- mehlare “trees” imi- imithi “trees”<br />
7 (sg) se- setulo “chair” isi- isitsha “dish”<br />
8 (pl) di- ditulo “chairs” izi- izitsha “dishes”<br />
14 bo- botho “humanity” ubu- ubuntu “humanity”<br />
The correspondence between singular and plural classes is not, however, perfectly<br />
regular, since some nouns in so-called plural classes do not have a singular form; in<br />
Zulu, class 11 nouns take their plurals from class 10, while a class such as 14 is not<br />
associated with number.<br />
The significance of noun prefixes is not limited to the role they play in indicating<br />
the classes to which the different nouns belong. In fact, noun prefixes play a further<br />
important role in the morphological structure of the Bantu languages, in that they<br />
link the noun to other words in the sentence. This linking is manifested in a system<br />
of concordial agreement, which is the pivotal constituent of the whole sentence<br />
structure, and governs grammatical agreement in verbs, adjectives, possessives,<br />
pronouns, and so forth. The concordial morphemes are derived from the noun prefixes<br />
and usually bear a close resemblance to the noun prefixes, as illustrated by the bold<br />
printed morphemes in the following Northern Sotho example:<br />
Figure 1: Concordial Agreement – Northern Sotho<br />
In this sentence, three structural relationships can be identified. The class 2 noun<br />
bašemane ‘boys’ governs the subject concord ba- in the verb ba ka bala ‘they may<br />
read’ (1), as well as the class prefix ba- in the adjective bagolo ‘big’ (2), and the<br />
119
demonstrative pronoun ba, preceding the adjective. The corresponding Zulu example<br />
would be as follows, where (1) indicates subject-verb agreement and (2) is agreement<br />
between the noun and the adjective concord aba- in the qualificative abakhulu.<br />
The class 10 noun izincwadi ‘books’ determines concordial agreement of the object<br />
concord -zi- in the verb (3).<br />
Figure 2: Concordial Agreement – Zulu<br />
The predominantly agglutinating nature of the Bantu languages is clearly illustrated<br />
in the above sentences, where each word consists of more than one morpheme. This<br />
complex morphological structure will be discussed very briefly by referring to two of<br />
the most complex word types, namely nouns and verbs.<br />
2.2 Morphology of Nouns<br />
Nouns as well as verbs in the Bantu languages are constructed by means of the two<br />
generally recognised types of morphemes, namely roots and affixes, with the latter<br />
subdivided into prefixes and suffixes. The majority of roots are bound morphemes,<br />
since they do not constitute words by themselves, but require one or more affixes to<br />
complete the word. The root is generally regarded to be “the core element of a word,<br />
the part which carries the basic meaning of a word” (Poulos & Msimang, 1996:170).<br />
For instance, in the Northern Sotho example dipuku ‘books’, the root that conveys<br />
the semantic significance of the word is -puku ‘book’, the morpheme di- being the<br />
class prefix of class 10. In the Zulu word izincwadi, the prefixes are i- and -zin-, with<br />
-ncwadi carrying the basic meaning ‘book.’ By adding the suffixes –ng (Northern Sotho)<br />
and -ini (Zulu), and the prefix e- (in the case of Zulu) to the noun, a locative meaning<br />
is imparted:<br />
Northern Sotho: dipukung di-puku-ng ‘in the books’<br />
Zulu: ezincwadini e-(i)-zin-ncwadi-ini ‘in the books’<br />
120
2.3 Verbal Morphology<br />
In the case of the verb, the core element that expresses the basic meaning of the<br />
word is the verb root. The essential morphemes of a Bantu verb are a subject concord<br />
(except in the imperative and infinitive), a verb root, and an inflectional ending. Over<br />
and above the subject concord (s.c.), the form of which is determined by the class<br />
of the subject noun, a number of other morphemes may be prefixed to a verb root.<br />
These include morphemes such as object concords (o.c.), potential and progressive<br />
morphemes, as well as negative morphemes. Compare the following example in this<br />
regard:<br />
Table 2: Verbal Morphology - Northern Sotho & Zulu<br />
N.S ba ka di bala ba ka di bal- -a<br />
Z bangazifunda ba- -nga- -zi- -fund- -a<br />
“they can read them” s.c. cl 2 potential morpheme o.c. cl 10 Verb<br />
root inflectional ending<br />
It should be noted that whereas object concords also show concordial agreement<br />
with the class of the object noun, all other verbal affixes are class independent.<br />
Furthermore, verbal affixes have a fixed order in the construction of verb forms, with<br />
the object concord prefixed directly to the verb root.<br />
Derivational suffixes may be inserted between the verb root and the inflectional<br />
ending. In the following examples, it will be noted that the inflectional ending has<br />
changed to the negative –e/-i in accordance with the negative prefix ga-/a-, for<br />
example:<br />
Table 3: Verbal Derivation by Means of Suffixes<br />
N.S. ga ba rekiše ga ba rek- -iš- -e<br />
Z abathengisi a- -ba- -theng- -is- -i<br />
“they do not sell” negative morpheme s.c. cl 2 verb root suffix inflectional ending<br />
3. Conjunctive Versus Disjunctive Writing Systems<br />
Following this explanation of the morphological structure of the Bantu languages,<br />
a few observations will be made regarding the different writing systems that are<br />
followed in the Bantu languages, with specific reference to Northern Sotho and Zulu.<br />
121
These different writing systems impact directly on POS-tagging, as will be explained<br />
below. The following example illustrates the difference in these writing systems:<br />
O r t h o g r a p h i c a l<br />
representation<br />
Table 4: Conjunctivism Versus Disjunctivism<br />
Morphological<br />
analysis<br />
N.S ke a ba rata ke a ba rat- -a<br />
Z ngiyabathanda ngi- -ya- -ba- -thand- -a<br />
“I like them” s.c. 1p.sg PRES o.c. cl 2 verb root inflectional ending<br />
The English translation ‘I like them’ consists of three orthographic words, each of<br />
which is also a linguistic word, belonging to a different word category. In the case of<br />
the Zulu sentence, where the conjunctive system of writing is adhered to, we observe<br />
one orthographic word that corresponds to one linguistic word, which is classified<br />
by Zulu linguists as a verb. The orthographic word ngiyabathanda is therefore also a<br />
linguistic word, belonging to a particular word category. This correspondence between<br />
orthographic and linguistic words is a characteristic feature of Zulu that distinguishes<br />
it from Northern Sotho. In the disjunctively written Northern Sotho sentence, four<br />
orthographic words constitute one linguistic word that is again classified as a verb.<br />
In other words, in the latter case, four orthographic elements making up one word<br />
category are written as separate orthographic entities.<br />
The reason for the utilisation of different writing systems is based on both historical<br />
and phonological considerations. When Northern Sotho and Zulu were first put to<br />
writing, mainly by missionaries in the second half of the nineteenth century, they<br />
intuitively opted for disjunctivism when writing Northern Sotho, and conjunctivism<br />
when writing Zulu. Thus, an orthographic tradition was initiated that prevails even<br />
today. Although based on intuition, the decision to adopt either a conjunctive or<br />
a disjunctive writing system was probably guided by an underlying realisation that<br />
the phonological systems of the two languages necessitated different orthographical<br />
systems. As Wilkes (1985:149) points out, the presence of phonological processes<br />
such as vowel elision, vowel coalescence and consonantalization in Zulu makes a<br />
disjunctive writing system highly impractical: the disjunctive representation of the<br />
sentence Wayesezofika ekhaya ‘He would have arrived at home’ as W a ye s’ e zo fika<br />
ekhaya is almost impossible to read and/or to pronounce. In Northern Sotho, these<br />
phonological processes are much less prevalent, and, furthermore, most morphemes<br />
in this language are syllabic, and therefore pose no problems for disjunctive writing.<br />
122
What needs to be pointed out at this stage, however, is that there is indeed some<br />
overlap with regard to the orthographical systems used by the two languages, and<br />
that Northern Sotho and Zulu should rather be viewed as occupying different positions<br />
on a continuum ranging from complete conjunctivism to complete disjunctivism.<br />
The diagrams below illustrate the degree of overlap between the writing systems<br />
of the two languages (dashed lines indicate morphological units, solid lines indicate<br />
orthographical units). It can be observed that the disjunctive writing convention in<br />
Northern Sotho is mainly applicable to prefixes preceding the class prefix and prefixes<br />
preceding the verb root.<br />
Figure 3: Overlap Between Conjunctivism and Disjunctivism<br />
123
At this stage, it is important to note that the different writing systems utilised by<br />
the two languages actually obscure the underlying morphological similarities. These<br />
disjunctive versus conjunctive writing systems in the Bantu languages have direct<br />
implications for word class tagging, as will be demonstrated later in this paper. In<br />
the next section, the available computational resources for the two languages are<br />
compared.<br />
4. Computational Linguistic Resources<br />
Existing linguistic and computational resources should be exploited as far as<br />
possible in order to facilitate the task of word class tagging. Both languages have<br />
unannotated electronic corpora at their disposal – approximately 6.5 million tokens<br />
for Northern Sotho, and 5.2 million tokens for Zulu. These corpora were compiled<br />
in the Department of African Languages at the University of Pretoria and consist of<br />
a mixed genre of texts, including samples of most of the different literary genres,<br />
newspaper reports, academic texts, as well as Internet material. Since most of the<br />
texts incorporated in the corpora were not available electronically, OCR scanning was<br />
done, followed by manual cleaning of scanned material.<br />
The corpora have so far been utilised, among others, for the generation of frequency<br />
lists, which are of specific importance for the development and testing of word class<br />
tagging, especially in disjunctively written languages. In Northern Sotho, for instance,<br />
the top 10,000 types by frequency in the corpus represent approximately 90% of the<br />
tokens, whereas in Zulu the top 10,000 types represent only 62% of the tokens. This<br />
observation is directly related to the conjunctive versus disjunctive writing systems.<br />
Since frequency counts in an unannotated corpus are based on orthographical units,<br />
a large orthographic chunk such as ngiyabathanda found in Zulu would have a much<br />
lower frequency rate than the corresponding units ke, a, ba and rata in Northern Sotho.<br />
This implies that the correct tagging of the top 10,000 tokens in Northern Sotho, be it<br />
manual, automatic, or a combination of both, results in a 90% correctly tagged corpus.<br />
The low relation between types versus tokens in Zulu, however, results in a much<br />
smaller percentage, that is, only 62% of the corpus being tagged. It furthermore impacts<br />
directly on the methodology used for word class tagging in the two languages: the low<br />
type/token relationship in Zulu necessitates the use of an additional tool (such as a<br />
morphological analyser prototype as described in Pretorius & Bosch 2003) to achieve<br />
a higher percentage in the automatic tagging of the Zulu corpus. Let us look at the<br />
following examples, which have been analysed by the above-mentioned analyser:<br />
amanzi ‘water/that are wet’<br />
a[NPrePre6]ma[BPre6]nzi[NStem]<br />
124
a[RelConc6]manzi[RelStem]<br />
yimithi ‘they are trees’<br />
yi[CopPre]i[NPrePre4]mi[BPre4]thi[NStem]<br />
ngomsebenzi ‘with work’<br />
nga[AdvForm]u[NPrePre3]mu[BPre3]sebenzi[NStem]<br />
bangibona ‘they see me’<br />
ba[SC2]ngi[OC1ps]bon[VRoot]a[VerbTerm]<br />
abathunjwa ‘(they) who are taken captive/they are not taken captive’<br />
aba[RelConc2]thumb[VRoot]w[PassExt]a[VerbTerm4]<br />
a[NegPre]ba[SC2]thumb[VRoot]w[PassExt]a[VerbTerm4]<br />
Examples with more than one analysis exhibit morphological ambiguity that,<br />
in most cases, can only be resolved by contextual information. Nevertheless, a<br />
morphologically analysed corpus provides useful clues for determining word class<br />
tags, since the output of the morphological analysis is a rich source of significant<br />
information that facilitates the identification of word classes. For example, the above<br />
morphologically analysed words lead to the following information regarding further<br />
processing on word class level:<br />
Output of morpho-logical<br />
analysis<br />
[NPrePre] and/or [BPre] +<br />
[NStem] + …<br />
Table 5: Zulu Morphological Analysis and Word Classes<br />
Word class Examples<br />
NOUN<br />
amanzi<br />
[CopPre] + [NStem] + … COPULATIVE yimithi<br />
[SC] + [VRoot] + …<br />
OR<br />
[NegPre] + [SC] + [VRoot] + …<br />
VERB bangibona<br />
125<br />
abathunjwa<br />
[RelConc] + … QUALIFICATIVE abathunjwa; amanzi<br />
[AdvForm] + … ADVERB ngomsebenzi
Concerning the tags used in the above morphological analysis, it should be noted<br />
that “tags were devised that consist of intuitive mnemonic character strings that<br />
abbreviate the features they are associated with.” (Pretorius & Bosch 2003:208).<br />
The word class tagset for Zulu is based on the classification by Poulos and Msimang<br />
(1996:26). More will be said about this tagset further on in the paper. The features<br />
and tags concerned are as follows:<br />
Table 6: Zulu Tags – An Illustrative Excerpt<br />
Tag Feature<br />
[AdvForm] Adverbial formative<br />
[BPre6] Basic prefix class 6<br />
[CopPre] Copulative prefix<br />
[NegPre] Negative prefix<br />
[NPrePre6] Noun preprefix class 6<br />
[NStem] Noun stem<br />
[OC1ps] Object concord 1st pers singular<br />
[PassExt] Passive extension<br />
[RelStem] Relative stem<br />
[SC2] Subject concord class 2<br />
[VRoot] Verb root<br />
[VerbTerm] Verb terminative<br />
In this paper, it is argued that the difference in writing systems dictates the need<br />
for different architectures, specifically for a different sequencing of tasks for POS-<br />
tagging in Northern Sotho and Zulu. The approaches followed to implement word class<br />
taggers for Northern Sotho and Zulu will be presented in the following section.<br />
5. Comparison of Approaches Towards Word Class Tagging for Northern Sotho<br />
and Zulu<br />
With regard to Northern Sotho, the term POS-tagging is used in a slightly wider<br />
sense, following Voutilainen (Mitkov 2003:220) who states that POS-taggers usually<br />
produce more information than simply parts of speech. He indicates that the term<br />
‘POS-tagger’ is often regarded as being synonymous with ‘morphological tagger’,<br />
‘word class tagger’ or even ‘lexical tagger.’ POS-tagging for Northern Sotho results in<br />
a hybrid system, containing information on both morphological and syntactic aspects,<br />
although biased towards morphology. This approach is dictated, at least in part, by<br />
the disjunctive method of writing, in which bound morphemes such as verbal prefixes<br />
126
show up as orthographically distinct units. As a result, in Northern Sotho, orthographic<br />
words do not always correspond to linguistic words, which traditionally constitute word<br />
classes or parts of speech. Rather than to see this as a disadvantage, it was decided<br />
to make use of the morphological information already implicit in the orthography,<br />
thus doing morphological tagging in parallel to a more syntactically-oriented word<br />
class tagging. It is, therefore, not necessary to develop a tool for the separation<br />
of morphemes, since this is largely catered for by the disjunctive orthography of<br />
Northern Sotho. As a result, all verbal prefixes can, for example, be tagged by making<br />
use of standard tagging technology, even though they are actually bound morphemes<br />
belonging to a complex verb form. A further motivation for the tagging of these bound<br />
morphemes is the fact that they are grammatical words or function words belonging<br />
to closed classes that normally make up a large percentage of any Northern Sotho<br />
corpus. Tagging of these forms would therefore result in a large proportion of the<br />
corpus being tagged. The decision to tag all orthographically distinct surface forms,<br />
regardless of whether these are free or bound morphemes, resulted in a tagset that<br />
is somewhat larger than normal: even though only nine word classes are traditionally<br />
distinguished for Northern Sotho, the proposed tagset contains thirty-three tag types.<br />
This number is further increased by the distinction of class-based subtypes for some of<br />
these tag types: the category EMPRO (emphatic pronoun) for example, has seventeen<br />
subtypes in order to account for the pronouns of the first and second person, as well<br />
as those of the different noun classes. The total number of tags comes to 155. (For a<br />
full discussion of the tagset design, see Prinsloo & Heid in this volume.)<br />
However, the existence of complex morphological units whose parts are not<br />
realised as surface forms necessitates a multi-level annotation. A separate tool such<br />
as a morphological analyser would be needed for the analysis of inter alia verbal<br />
derivations of Northern Sotho. Typical examples that would need to be analysed by<br />
such a tool would be verbal suffixes. Such a multi-level approach could be represented<br />
as follows:<br />
127
Figure 4: Multi-level Approach Towards Word Class Tagging<br />
It should be noted that there are cases where the object concord appears within<br />
the verbal structure, notably the object concord of the first person singular. This<br />
particular object concord distinguishes itself from other object concords in that it is<br />
phonologically and orthographically fused to the verbal root. All other object concords<br />
are written separately from the verbal root and are thus easily identifiable, except for<br />
the object concord of class 1 before verb stems commencing with b-, for example, mo<br />
+ bona > mmona ‘see him/her.’ A procedure similar to the one illustrated above would<br />
be needed for these cases.<br />
In the case of Zulu, morphological aspects need not be included in the word class<br />
tagging, since these are already accounted for in the morphological analysis. This<br />
difference in approach to the tagsets can be mainly ascribed to the different writing<br />
systems. The word class tagset for Zulu used for purposes of illustration above is based<br />
on the classification by Poulos & Msimang (1996:26) according to which “words which<br />
have similar functions, structures and meanings (or significances) would tend to be<br />
classified together as members of the same word category […]” The tagset comprises<br />
the following: Noun, Pronoun, Demonstrative, Qualificative, Verb, Copulative, Adverb,<br />
Ideophone, Interjection, Conjunction, and Interrogative. It is well known that the<br />
128
degree of granularity of a tagset should be appropriate to the purposes of the tagged<br />
corpus (Allwood et al. 2003:230).<br />
The following diagram is a summary of the distinct approaches towards word class<br />
tagging as exemplified in the two Bantu languages, Northern Sotho and Zulu. The tasks<br />
that need to be performed are similar, but the approaches and sequencing of tasks<br />
differ significantly. It is noticeable that, in Northern Sotho, no dedicated tool is needed<br />
for the separation of morphemes, since this is already implicit in the disjunctive<br />
writing system. The tagger caters to a certain extent for morphophonological rules,<br />
but is especially significant for the second level, where morphosyntactic classification<br />
of morphemes takes place. Analysis of word formation rules would only need to be<br />
done on level II, for which a morphological analyser is needed.<br />
In the case of Zulu, the morphological analyser plays a significant role in levels I<br />
and II, where constituent roots and affixes are separated and identified by means of<br />
the modelling of two general linguistic components. The morphotactics component<br />
contains the word formation rules, which determine the construction of words from the<br />
inventory of morphemes (roots and affixes). This component includes the classification<br />
of morpheme sequences. The morphophonological alternations component describes<br />
the morphophonological changes between lexical and surface levels (cf. Pretorius &<br />
Bosch 2003:273-274). Finally, Northern Sotho and Zulu are on a par in level III, where<br />
the identification of word classes, associated with the assigning of tags, takes place.<br />
Figure 5: Task Sequencing in Northern Sotho and Zulu<br />
129
6. Conclusion and Future Work<br />
In this paper, a comparison of approaches towards word class tagging in two<br />
orthographically distinct Bantu languages, namely Northern Sotho and Zulu, was drawn.<br />
The disjunctive versus conjunctive writing systems in these two South African Bantu<br />
languages have direct implications for word class tagging. Northern Sotho on the one<br />
hand resorts to a hybrid system, which contains information on both morphological<br />
and syntactic aspects, although biased towards morphology. In the case of Zulu, on<br />
the other hand, morphological aspects need not be included in the word class tagging,<br />
since these are already accounted for in the morphological analysis. Word class tags<br />
for Zulu are associated with syntactic information. The work described in this paper is<br />
of crucial importance for pre-processing purposes, not only for automatic word class<br />
taggers of Northern Sotho and Zulu, but also for the other languages belonging to the<br />
Sotho and Nguni language groups.<br />
Regarding future work, two significant issues have been identified. First, cases of<br />
ambiguous annotation require the application of disambiguation rules based mainly<br />
on surrounding contexts. A typical example of ambiguity is that of class membership,<br />
due to the agreement system prevalent in these languages. For instance, in Northern<br />
Sotho as well as Zulu, the class prefix of class 1 nouns is morphologically similar<br />
to that of class 3 nouns, that is, mo- (N.S) and umu- (Z). This similarity makes it<br />
impossible to correctly assign class membership of words such as adjectives, which<br />
are in concordial agreement with nouns, without taking the context into account.<br />
Secondly, the standardisation of tagsets for use in automatic word class taggers of<br />
the Bantu languages needs serious attention. A word class tagset based on standards<br />
proposed by the Expert Advisory Group on Language Engineering Standards (EAGLES)<br />
was recently proposed for Tswana, a Bantu language belonging to the Sotho language<br />
group, by Van Rooy & Pretorius (2003). Similarly, Allwood et al. (2003) propose a<br />
tagset to be used on a corpus of spoken Xhosa, a member of the Nguni language group.<br />
In order to ensure standardisation, and therefore achieve reuseability of linguistic<br />
resources such as word class tagsets, this initial research on the standardisation of<br />
tagsets needs to be extended to all the Bantu languages.<br />
7. Acknowledgements<br />
We would like to thank Uli Heid for unselfishly sharing his knowledge and expertise<br />
with us. His comments on an earlier version of this paper added immeasurable value<br />
to our effort.<br />
130
131<br />
References<br />
Allwood, J., Grönqvist, L. & Hendrikse, A.P. (2003). “Developing a Tagset and Tagger<br />
for the African Languages of South Africa with Special Reference to Xhosa.” Southern<br />
African Linguistics and Applied Language Studies 21(4), 223-237.<br />
“Eagles.” <strong>Online</strong> at: http://www.ilc.cnr.it/EAGLES/home.html.<br />
Meinhof, C. (1932). Introduction to the Phonology of the Bantu Languages. (trad. van<br />
Warmelo, N). Berlin: Dietrich Reimer/Ernst Vohsen.<br />
Poulos, G. & Louwrens, L.J. (1994). A Linguistic Analysis of Northern Sotho. Pretoria:<br />
Via Afrika Limited.<br />
Poulos, G. & Msimang, T. (1996). A Linguistic Analysis of Zulu. Pretoria: Via Afrika<br />
Limited.<br />
Pretorius, L. & Bosch, S.E. (2003). “Computational Aids for Zulu Natural Language<br />
Processing.” Southern African Linguistics and Applied Language Studies 21(4), 267-<br />
82.<br />
Prinsloo, D.J. & Heid, U. (this volume). “Creating Word Class Tagged Corpora for<br />
Northern Sotho by Linguistically Informed Bootstrapping”, 97-115.<br />
Van Rooy, B. & Pretorius, R. (2003). “A Word-class Tagset for Setswana.” Southern<br />
African Linguistics and Applied Language Studies 21(4), 203-222.<br />
Voutilainen, A. (2003). “Part-of-Speech Tagging.” Mitkov, R. (ed.)(2003). The Oxford<br />
Handbook of Computational Linguistics. Oxford University Press: Oxford, 219-232.<br />
Wilkes, A. (1985). “Words and Word Division: A Study of Some Orthographical Problems<br />
in the Writing Systems of the Nguni and Sotho Languages.” South African Journal of<br />
African Languages 5(4), 148-153.
Grammar-based Language Technology<br />
for the Sámi Languages<br />
133<br />
Trond Trosterud<br />
Language technology projects are often either commercial (and hence closed for<br />
inspection), or small projects that run with no explicit infrastructure. The present<br />
article presents the Sámi language technology project in some detail and is our<br />
contribution to a concrete discussion on how to run medium-scale, decentralised,<br />
open-source language technology projects for minority languages.<br />
1. Introduction<br />
This article presents a practical framework for grammar-based language<br />
technologies for minority languages. Such matters are seldom the topic of discussion;<br />
one usually goes directly to the scientific results. In order to obtain these results,<br />
however, a good project infrastructure is needed. Moreover, for minority languages,<br />
the bottleneck is often represented by the lack of human expertise, that is people with<br />
a knowledge of the language, linguistics, and language technology. In such situations,<br />
we need to organise work in order to facilitate cooperation and avoid duplication of<br />
effort. Although the model presented here can hardly be considered the ultimate one,<br />
it is the result of accumulated experience gained from different types of projects,<br />
commercial, academic and grass-roots Open source, and we hereby present it as a<br />
possible source of inspiration.<br />
The Sámi languages make up one of the seven subbranches of the Uralic language<br />
family, Finnish and Hungarian being the most well-known members of two of the other<br />
sub-branches. From the point of view of typology, the Sámi languages have many<br />
properties in common with the other Uralic languages, but several non-segmental<br />
morphological processes have entered the languages as well. There are six Sámi<br />
literary languages: North, Lule, South, Kildin, Skolt and Inari Sámi. All of them are<br />
written with the Latin alphabet (including several additional letters), except Kildin<br />
Sámi, which uses the Cyrillic alphabet.<br />
Prior to our project, the main focus within Sámi computing was the localisation<br />
issue. Four of the six Sámi languages have letters that are not to be found in the Latin<br />
1 (or Latin 2) code table. At present, this issue is more or less resolved and North Sámi<br />
is the language with fewest speakers that at the same time is localised — out of the
ox — on all three major operating systems. No other language technology application<br />
existed prior to our work.<br />
2. Project Status Quo, Goals and Resources<br />
The work is organised in two projects, with slightly different goals. It started out<br />
as a university-based project, with the goal of building a morphological parser and<br />
disambiguator for North, Lule and South Sámi, in order to use it for scientific purposes<br />
(i.e., creating a tagged corpus with a Web-based graphical interface and using it for<br />
syntactic, morphological and lexical research, publishing reverse dictionaries, etc.).<br />
In 2003, the Norwegian Sámi parliament asked for advice on how to build a Sámi<br />
spellchecker. They considered the construction of this tool as vital for the use of<br />
North Sámi as an administrative language. As a result of this, there are now three<br />
people working on the University project and four a half people working on the Sámi<br />
parliament project. These projects will run with the present financing for another 2<br />
years.<br />
The status quo is that we have a parser with a recall of 80 - 93% on the grammatical<br />
analysis of words in running text (modulo genre) and we disambiguate the morphological<br />
output with a recall of 99% and a precision of 93%; the outcome is slightly worse<br />
for syntactic analysis. The parsers behind these results contain approximately 100<br />
morphophonological rules, 600 continuation lexica and 2000 disambiguation rules.<br />
The figures below show the output for the morphological parser for the sentence<br />
Mii háliidit muitalit dan birra “We would like to tell about it”.<br />
134
Figure 1: Morphological Analysis of a Sámi Sentence<br />
Figure 2 shows the same sentence in disambiguated mode. Here, all irrelevant<br />
morphological readings are removed, and in addition, syntactic information has been<br />
added on the basis of the information given by the morphological disambiguator.<br />
Figure 2: Disambiguated Version of the Same Sentence<br />
As for the speller project, there is an alpha version, made with the aspell utility.<br />
The parser has also been put to use in interactive pedagogical programs, and there<br />
are concrete plans for making a Sámi text-to-speech application.<br />
135
3. Choice of approach<br />
3.1 A Grammatical Versus Statistical Approach<br />
We use a grammar-based, rather than a statistical approach (proponents of the<br />
statistical approach often refer to this dichotomy as a choice between a ’symbolic’<br />
and a ’stochastic’ approach), which means that our parsers rely on a set of grammar-<br />
based, manually written rules, that can be inspected and edited by the user. There<br />
are several reasons for our choice:<br />
• We think some of the prerequisites for good results with the statistical<br />
approach are not present in the Sámi case;<br />
• We want our work to produce grammatical insight, not only functioning<br />
programs, and,<br />
• On the whole, we think the grammatical approach is better for our<br />
purposes.<br />
Addendum to (1): Good achievements with a statistical approach require both<br />
large corpora and a relatively simple morphological structure (low wordform/lemma<br />
ratio), as is the case for English. Sámi, as in the case with many other languages, has<br />
a rich morphological structure and a paucity of corpus resources, whereas the basic<br />
grammatical structure of the language is reasonably well understood.<br />
Addendum to (2): Our work is a joint academic and practical project. Work on<br />
minority languages will typically be carried out as cooperative projects between<br />
research institutions and in-group individuals or organisations devoted to the<br />
strengthening of the languages in question. Whereas private companies will look<br />
at the ratio of income to development cost and care less about the developmental<br />
philosophy, it is important for research institutions to work with systems that are<br />
not ‘black boxes’, but that are able to give insight into the language beyond merely<br />
producing a tagger or a synthetic voice.<br />
Addendum to (3): We are convinced that grammar-based approaches to both parsing<br />
and machine translation are superior to the statistical ones. Studies comparing the<br />
two approaches, such as Chanod & Tapanainen (1994), support this conclusion.<br />
This does not mean that we rule out statistical approaches. In many cases, the<br />
best results will be achieved by combining grammatical and statistical approaches.<br />
A particularly promising approach is the use of weighted automata, where frequency<br />
considerations are incorporated into the arcs of the transducers. We plan to apply<br />
standalone statistical methods after plan the grammatical analysis gives in. In other<br />
words, the cooperation should be ruled by the motto: ‘Don’t guess if you know.’<br />
136
3.1 Choosing Between a ‘Top-down’ and a ‘Bottom-up Approach’<br />
Within grammatical approaches to parsing there are two main approaches, which<br />
we may brand ‘top-down’ and ‘bottom-up’. The top-down approach tries to map a<br />
possible sentence structure upon the sentence, as a possible outcome of applying<br />
generative rules on an initial S node. If successful, the result is a syntactic tree<br />
displaying the hierarchical structure of the sentence in question.<br />
The bottom-up approach, on the other hand, takes the incoming wordforms and<br />
the set of their possible readings as input. Then they disambiguate multiple readings<br />
based upon context and build structures.<br />
We chose a bottom-up approach, because it proved robust, was able to analyse any<br />
input and gave good results.<br />
4. Linguistic Tools<br />
4.1 The Tools Behind our Morphological Analyser<br />
For our morphological analyser, we build finite-state transducers and use the<br />
finite-state tools provided by Xerox (documented in Beesley & Karttunen 2003). For<br />
morphophonological analysis, we have the choice of using the parallel, two-level<br />
morphology model (dating back to Koskenniemi 1983) with twolc or the sequential<br />
model (presented in Karttunen et al.1992) with xfst. Xerox’ advice is to use<br />
the latter; we use the former, but we see this mainly as a matter of taste. The<br />
morphophonological and lexical tools are composed into a single transducer during<br />
compilation, as described in the literature (cf. the figure below).<br />
Figure 2: A Schematic Overview of the Lexicon and Morphophonology of the Parser.<br />
137
A more serious question is the choice of Xerox tools versus Open Source tools. In our<br />
project, we have no wish to modify the source code of the rule compilers themselves,<br />
but we notice that all binary files compiled by the xfst, lexc and twolc compilers are<br />
copyrighted property of the Xerox Corporation. It is as if you were to write your own ‘C’<br />
program, but the compiled version of your program was copyright-owned by Kernighan<br />
and Ritchie, the authors of the C compiler. That said, it has been a pleasure working<br />
with Xerox: they have been very helpful, and as they see no commercial potential in<br />
Sámi, we notice no practical consequences of using proprietary compilers.<br />
4.2 The Tools Behind our Disambiguator<br />
For disambiguating the output of the morphological transducer we use constraint<br />
grammar. This is a framework dating back to Karlsson (1990), and the leading idea<br />
being that, for each wordform of the output, the disambiguator looks at the context<br />
and removes any reading that does not fit the context. The last reading can never be<br />
removed and, in the successful case, only the appropriate reading is left. The Brill<br />
tagger can be seen as a machine-learning variety of the constraint grammar parser.<br />
There are several versions of the constraint grammar compilers. The original one<br />
was written in Lisp by Fred Karlsson. Later, Pasi Tapanainen wrote a compiler in ‘C’,<br />
called CG-2; this version may be licensed from http://www.connexor.com. We use an<br />
open source version of this compiler, made by Eckhard Bick. It must be stressed that<br />
the debugging facility of the Connexor compiler is superior to its competitors.<br />
The optimal implementation would probably be to write the constraint grammar<br />
as a finite state transducer, as suggested in the Finite State Intersection Grammar<br />
framework. So far, nothing has come out of this work.<br />
4.3 One-base, Multi-purpose Parsers<br />
Working with minority languages, the lack of human resources is often as hard a<br />
problem as the lack of financial ones. With this in mind, avoiding duplicating work<br />
becomes crucial. The most time-consuming task in any linguistic project is building<br />
and maintaining a lexicon, be it in the form of a paper dictionary, a semantic wordnet,<br />
or the lexicon for a parser. The optimal solution is to keep only one version of the<br />
lexicon and extract relevant information from it, in order to automatically build paper<br />
and electronic dictionaries, orthographical wordlists or parsers. In our project, this<br />
has not yet been implemented, but for new languages we try out prototype models<br />
in order to make this work for new languages. Our plan is to use XML as text storage,<br />
and various scripts to extract the relevant lexicon versions.<br />
138
It goes without saying that we use only source for a morphological transducer for<br />
linguistic analysis, pedagogical programs, spellers, and so forth. These applications<br />
often need slightly different transducers, in which case we mark the source code so<br />
that it is possible to compile different transducers from the same source code. For<br />
the academic project we make a tolerant parser that analyses as much of the attested<br />
variation as possible. The spellchecker has a totally different goal: here we build a<br />
stricter version that only accepts the forms codified in the accepted standard. This<br />
approach is even more appropriate, as we are the only language technology project<br />
working on Sámi. Any further application will build upon our work, and our goal is to<br />
make it flexible enough to facilitate this.<br />
5. Infrastructure<br />
5.1 Computer Platform<br />
Our project is run on Linux and Mac OS X (Unix). The Xerox tools come in a Windows<br />
version as well, but the lack of a decent command-line environment and automatic<br />
routines for compiling makes it impractical. The cvs base is set up on a central Linux<br />
machine. We also use portable Macintoshes, both because they have a nice interface<br />
and because they offer programs that make it easier to work from different locations,<br />
such as the SubEthaEdit program mentioned below.<br />
5.2 Character Set and Encoding<br />
Most commercially interesting languages are covered by one of the 8-bit ISO<br />
standards. Many minority languages fall outside of this domain. It is our experience<br />
that it is both possible and desirable to use UTF-8 (multi-byte Unicode) in our source<br />
code (i.e., build the parser around the actual orthography of the language in question,<br />
rather than to construct some auxiliary ASCII representation). With the latest versions<br />
of the Linux and Unix operative systems and shells, we have access to tools that are<br />
UTF-8 aware and, although it takes some extra effort to tune the development tools<br />
to multi-byte input, the advantage is a more readable source code (with correct<br />
letters instead of digraphs) and an easier input/output interface, as UTF-8 now is the<br />
de facto standard for digital publishing.<br />
There is one setting where one could consider using a transliteration, namely, for<br />
languages using syllabic scripts, such as Inuktitut and Cherokee. In a rule stating that<br />
a final vowel is changed in a certain environment, a syllabic script will not give any<br />
single vowel symbol to change; rather than changing, for instance, a to e in a certain<br />
context, the rule must change the syllabic symbol BA to BE, DA to DE, FA to FE, GA to<br />
139
GE, and so forth. It still may be better, however, to use the original orthography; each<br />
case requires its own evaluative process.<br />
5.3 Directory Structure<br />
We have put some effort in finding a good directory structure for our files. The<br />
philosophy is as follows: different types of files are kept separate. (The source files<br />
have their own directory, and binary and developer files are kept separate.)<br />
Figure 3: Directory Structure<br />
All our source and documentation files are under version control using cvs. This<br />
means that the original files are stored on our central computer (with backup routines),<br />
and that each co-worker ‘checks out’ a local copy that becomes his or her version to<br />
work on. After editing, the changed files are then copied back, or ‘checked in’ to the<br />
central repository. For each check-in, a short note on what has been done is written.<br />
We also have set up a forwarding routine, so that all co-workers get a copy of all cvs<br />
log messages via email.<br />
140
Figure 4: Quote from cvs Log<br />
Using cvs (or some other version control system) is self-evident to any programmer,<br />
and it may be seen as quite embarrassing that such a trivial fact should be even<br />
mentioned here. It is our experience that the use of version control systems is by<br />
no means standard within academic projects, and we strongly urge anyone not using<br />
such tools to consider doing so. Backup routines become easier, and, when expanding<br />
from one-person projects to large projects, it is a prerequisite for when several co-<br />
workers collaborate on the same source files. We even recommend cvs for one-person<br />
projects. Using cvs, it is easier to document what has been done earlier, and to go<br />
back to previous versions to find out when a particular error may have crept in.<br />
5.5 Building with ‘Make’<br />
Another self-evident programmer’s tool is the use of makefiles, via the program<br />
make. In its basic form, make functions like a script and saves the work of typing the<br />
same long set of compilation commands again and again. With several source files, it<br />
becomes important to know whether one should compile or not. The make program<br />
keeps track of the age of the different files, and compiles a new set of binary files<br />
only when the source files are newer then the target binary files. The picture shows<br />
the dependencies between the different source and binary files.<br />
141
5.6 Tracing Bugs<br />
Figure 5: Dependencies in the Project’s Makefile<br />
As the project grows, so do the number of people debugging it, and hence the number<br />
of bugs and errors. We have designed an error database, in our case Bugzilla, which<br />
keeps track of errors. The database can be found at the address http://giellatekno.<br />
uit.no/bugzilla/. Interested persons may visit the URL. There is a requirement that<br />
you log in with an e-mail account and (preferably) a name, but otherwise the bug<br />
database is open for inspection.<br />
5.7 Internal Communication in a Decentralised Project<br />
We have co-workers in Tromsø, Kautokeino and Helsinki. Crucial for the project’s<br />
progress is the possibility of coordinating our work. For that, we have the following<br />
means:<br />
142
• We have made a project-internal newsgroup. Discussion is carried out<br />
in this environment rather than in personal emails, since more than one person<br />
may have something to say on the issue, and since it is easier to go back to earlier<br />
discussions using the newsgroup format.<br />
• For simultaneous editing of the same document, be it source code or a<br />
meeting memo, we use a program called SubEthaEdit (http://www.codingmonkeys.<br />
de/subethaedit/ - [only for Mac OS X]). This program makes it possible for several<br />
users to edit the same file at the same time. Combined with the use of the telephone<br />
(or voice chat!), we may discuss complicated matters on a common rule set while<br />
editing together, even though we sit in different countries.<br />
• For informal discussions, we use chat programs. The built-in Mac OS X<br />
chat application iChat also facilitates audio and video chats with decent to high<br />
quality of the video and sound (mainly restricted by the available bandwidth).<br />
• We have meetings over the phone; although we planned to conduct them<br />
using iChat (with up to ten participants in the same audio chat), technical problems<br />
with a firewall has stopped us from doing this.<br />
• The cvs version control and Bugzilla error database also facilitate working<br />
in several locations.<br />
5.8 Source Code and Documentation<br />
In our experience, a systematic approach to documentation is also required also<br />
when the project engages only one worker, and it is indispensable when the number of<br />
workers grows beyond two. Working on the only Sámi language technology project in<br />
the world, we acknowledge that all future work will take our work as a starting point.<br />
We thus work in a one hundred-years perspective, and write documentation so that<br />
those who follow us will be able to read what we have done.<br />
We document:<br />
• The external tools we use (with links to the documentation provided by<br />
the manufacturer);<br />
• The infrastructure of our project; and,<br />
• Our source files and the linguistic decisions we have made.<br />
In an initial phase, we used to write documentation in HTML, which was available<br />
only internally on the project machines. We now write documentation in XML, and<br />
convert it to HTML via the XML publishing framework (Forrest, http://forrest.apache.<br />
org/). Documentation can be published in many ways, but it is our experience that it<br />
is most convenient to read the documentation in a hypertext format such as HTML .<br />
143
Since the documentation has grown, we also use a search engine (provided by Forrest)<br />
to find what we have written on a given topic.<br />
The internal documentation of our project is open for inspection at the website<br />
http://divvun.no/ (the proofing tools project), as well as http://giellatekno.uit.no<br />
(the academic disambiguator project). The technical documentation is in English, and<br />
can be found under the tab TechDoc.<br />
Our source code is open as well, it is downloadable via anonymous cvs via our<br />
technical documentation. We believe that publishing the source code of projects like<br />
this will lead to progress within the field, not only in general, but especially for<br />
minority language projects.<br />
By publishing the documentation and the source code, we make it easy to explain<br />
what we do; we hope that it will inspire others to perhaps give us some constructive<br />
feedback as well. The only possible drawback of this openness is that it exposes our<br />
weaknesses to the whole world. So far, we have not noticed any negative effects in<br />
this regard.<br />
6. Costs<br />
Except for the computers themselves and the operating system and applications<br />
that come with them, we have mostly used free or open-source software for all our<br />
tasks. In the few cases where we have paid for software, there are free or open-source<br />
alternatives. The notable exception is the set of compilers for morphophonological<br />
automata. For analysing running text and generating stray forms, the Xerox tools can<br />
be used in the versions found in Beesley & Karttunen (2003). For our academic project,<br />
these tools have proven good enough, but in order to generate larger paradigms, the<br />
commercial version of the tools is needed.<br />
As for the computers, the only really demanding task is compiling the transducers.<br />
If one is willing to wait five minutes for the compilation, any recent computer can do<br />
fine, but the top models cut compilation time to less than half of what the cheapest<br />
models can do. Macs turned out to be a good choice for our projects, and the cheapest<br />
Mac can be bought for roughly 500 USD/EUR. One good investment, though, is to buy<br />
more RAM than what can be found on the standard configuration.<br />
7. Summary<br />
When doing language technology for minority languages, we are constantly faced<br />
with the fact that there are few people working with each language, and that different<br />
language projects set off in different directions, often due to mere coincidence. Our<br />
answer to this challenge is to share both our experience and our infrastructure with<br />
144
others. By doing this, we hope that people will borrow from us, and comment upon<br />
what we do and how we do it. We also look forward to being confronted with other<br />
solutions and to borrowing improvements back.<br />
145
146<br />
References<br />
Beesley, K.R. & Karttunen, L. (2003). Finite State Morphology. Stanford: CSLI<br />
Publications. http://www.fsmbook.com/.<br />
Bick, E. (2000). The Parsing System “Palavras”: Automatic Grammatical Analysis of<br />
Portuguese in a Constraint Grammar Framework. Dr. phil. thesis, Aarhus University<br />
Press.<br />
Brill, E. (1992). “A Simple Rule-based Part of Speech Tagger.” Proceedings of the Third<br />
Conference on Applied Natural Language Processing. ACL, Trento, Italy, 1992.<br />
“Bugzilla.” <strong>Online</strong> at http://giellatekno.uit.no/bugzilla/.<br />
“Bures boahtin sámi giellateknologiija prošektii.” <strong>Online</strong> at<br />
http://giellatekno.uit.no.<br />
Chanod, J.P. & Tapanainen, P. (1994). “Tagging French: Comparing a Statistical and<br />
a Constraint-based Method.” Seventh Conference of the European Chapter of the<br />
Association for Computational Linguistics, 149-156.<br />
“Connexor.” <strong>Online</strong> at http://www.connexor.com.<br />
“Divvun - sámi korrektuvrareaiddut.” <strong>Online</strong> at http://divvun.no/.<br />
“Forrest.” <strong>Online</strong> at http://forrest.apache.org/.<br />
Jelinek, F. (2004). “Some of my best friends are linguists.” LREC 2004. http://www.<br />
lrec-conf.org/lrec2004/doc/jelinek.pdf.<br />
Karlsson, F. (1990).“Constraint Grammar as a Framework For Parsing Running <strong>Text</strong>.” Karlgren,<br />
H. (ed.) (1990). Papers presented to the 13th International Conference on Computational<br />
Linguistics, 3, 168–173, Helsinki, Finland, August. ICCL, Yliopistopaino, Helsinki.
Karttunen, L., Kaplan, R.M. & Zaenen, A. (1992). “Two-level morphology with<br />
composition.” COLING’92, Nantes, France, August 23-28, 141-148.<br />
Koskenniemi, K. (1983). Two-level Morphology: A General Computational Model for<br />
Word-form Production and Generation. Publications of the Department of General<br />
Linguistics, University of Helsinki.<br />
Samuelsson, C. & Voutilainen, A. (1997). “Comparing a linguistic and a stochastic<br />
tagger.” Proceedings of the 35th Annual Meeting of the Association for Computational<br />
Linguistics, 1997.<br />
“SubEthaEdit.” <strong>Online</strong> at http://www.codingmonkeys.de/subethaedit/.<br />
Voutilainen, A. Heikkilä, J. & Anttila, A. (1992). Constraint Grammar of English, A<br />
performance-Oriented Introduction, 21. Helsinki: Department of General Linguistics,<br />
University of Helsinki.<br />
147
The Welsh National <strong>Online</strong><br />
Terminology Database<br />
Dewi Bryn Jones and Delyth Prys<br />
Terminology standardisation work has been on-going for the Welsh language for<br />
many years. At an early date, a decision was taken to adopt international standards<br />
such as ISO 704 and ISO 860 for this work. It was also decided to store the terminologies<br />
in a standard format in electronic databases, even though the demand in the early<br />
years was for traditional paper-based dictionaries. Welsh is now reaping the benefits of<br />
those far-seeing early decisions. In 2004, work began on compiling a national database<br />
of bilingual (Welsh/English) standardised terminology. Funded by the Welsh Language<br />
Board, it will be made freely available on the World Wide Web. Electronic databases<br />
already in existence have been revisited and reused for this project, with a view to<br />
updating them to conform to an ISO terminology mark-up framework (TMF) standard.<br />
An additional requirement of this project is that the term lists should be packaged and<br />
made available in a compatible format for downloading into popular Termbase systems<br />
found in translation tool suites such as Trados, Déjà Vu and Wordfast. As far as we<br />
know, this is the first time that a terminology database has been developed to provide<br />
a freely available Termbase download utility at the same time as providing an online<br />
searchable facility. Parallel work of utilising an ISO lexical mark-up framework (LMF)<br />
compliant standard for another project, namely, the LEXICELT Welsh/Irish dictionary,<br />
has provided the opportunity to research similarities and differences between a<br />
terminological concept-based approach and a lexicographical lexeme-based one.<br />
Direct comparisons between TMF and LMF have been made, and both projects have<br />
gained from new insights into their strengths and weaknesses. This paper will present<br />
an overview of a simple implementation for the online database, and attempt to show<br />
how frugal reuse of existing resources and adherence to international standards both<br />
help to maximise sparse resources in a minority language situation.<br />
1. Introduction<br />
Terminology for Welsh has seen increased activity over the last ten years, in<br />
particular in government administration and the public sector, following the passing<br />
of the Welsh Language Act 1994 and the establishment of the National Assembly for<br />
Wales. Many bilingual Welsh/English dictionaries have been published by various<br />
organisations operating in Wales covering subject fields within secondary education,<br />
149
academia, health, social services and public administration. Welsh terminology is<br />
generally perceived as merely an aid to standardised translation for English terms (cf.<br />
Prys 2003).<br />
Depending on the organisation responsible, for the commissioning of a dictionary,<br />
dissemination to translators and the public at large is done via paper-based and/or<br />
electronically-based means. As a result, however, provision of standardised terminology<br />
is fragmented and dispersed in nature. Translators have to keep and maintain their<br />
own collection of paper-based dictionaries, and/or keep track of where and how to<br />
access electronic versions. Finding a Welsh translation may involve laboriously looking<br />
in more than one dictionary.<br />
Meanwhile, the public at large, who would not have such a collection of<br />
terminology dictionaries, would not be part of a determinologization process, where<br />
specialised terms become incorporated into general language, thereby safeguarding<br />
or increasing the presence of Welsh in the commercial, printed media and popular<br />
culture sectors.<br />
Thus, the Welsh Language Board commissioned the e-Welsh team to develop the<br />
Welsh National <strong>Online</strong> Terminology Database in order to centralise and facilitate<br />
efficient terminology dictionary dissemination.<br />
2. Requirements for the Welsh National <strong>Online</strong> Terminology Database<br />
The Welsh National <strong>Online</strong> Terminology Database project requirements were<br />
comprised of two parts. First, previously published dictionaries of standardised<br />
terminology, in particular those commissioned by the Welsh Language Board, were<br />
compiled and stored into a new online terminology database system. This new online<br />
terminology database would constitute the second part of the requirements, wherein<br />
its purpose would be to provide a freely available central Web-based resource<br />
supporting the dissemination of Welsh standardised terminology via the greatest<br />
number of formats, channels and mechanisms.<br />
The system would support:<br />
• searching and presenting term translations across one or more dictionaries<br />
stored in the system;<br />
• downloading of complete dictionaries in various formats for offline<br />
use and integration into the translators’ own Translation Memory environments’<br />
termbase tools such as Trados Multiterm, Deja Vu, WordFast and a dictionary<br />
product developed by e-Welsh called Cysgeir; and,<br />
150
• presentation of its data as XML for possible incorporation with other<br />
online terminology systems.<br />
Since the database system would be a repository of published standardised<br />
terminology there were no requirements for wider-scoped terminology management<br />
functionalities such as editing and standardisation process support.<br />
3. Standards in Welsh Terminology<br />
Since an early date, ISO international standards have been adopted in Welsh<br />
terminology. In 1998, members of the e-Welsh team compiled a guideline document<br />
on the use of ISO standards (Prys & Jones 1998) for all terminology standardisation<br />
providers in Wales.<br />
The document mandated the use of principles and methods from ISO 704 and<br />
860. The guidelines helped to raise the discipline of terminology standardisation for<br />
Welsh above what might otherwise be typical of a lesser-used and resourced language<br />
where:<br />
• the work may be led by linguists with insufficient subject specialist<br />
knowledge, or subject specialists with insufficient linguistic expertise;<br />
• less technically competent subject specialists experts would independently<br />
develop a paper-based dictionary in a word processing application; and,<br />
• new words and terms are too easily coined along with spelling reforms in<br />
a misguided attempt to widen the appeal of the language.<br />
The guidelines mandated the use of databases in accordance with ISO/TR<br />
12618:1994 for terminology. The guideline document advised on the structure of such<br />
databases, as well as on the fields (or data categories) to be populated for any Welsh<br />
terminology dictionaries.<br />
The development of dictionaries would be conducted in tabular format with columns<br />
to store fields such as terms, term plurals, disambiguators for homonyms and Welsh<br />
grammatical information such as parts-of-speech. Crucially, each row represented the<br />
concept level.<br />
Thus, by employing a single table in a simple database application, no special<br />
terminological tools are required. A consistent bidirectional dictionary is easily<br />
created, whilst report and macro functionality found in Office productivity software<br />
can be used to create printed versions. A single table is also simple to host on a<br />
website.<br />
Below is a typical example of a Welsh/English terminology dictionary entry:<br />
home help (of person) cynorthwyydd cartref eb cynorthwywyr cartref;<br />
151
home help (of service) cymorth cartref eg<br />
cynorthwyydd cartref eg cynorthwywyr cartref home help;<br />
In effect, this simple adoption of recommendations from the ISO standards made all<br />
previous terminology dictionaries commissioned in Wales ‘future proof’ and available<br />
to the Welsh National <strong>Online</strong> Terminology Database project electronically and already<br />
in a database format.<br />
Over the course of numerous dictionary commissions, the guidelines have become<br />
outdated compared to the latest ISO standards. Further needs have been identified,<br />
including those to improve the interoperability of data and the reuse of software<br />
components. Weaknesses were identified in the guidelines both in the structure<br />
of terminological data and in the specification of data category selection. More<br />
specifically:<br />
Structure – a single database table is too rigid a structure. Columns were sometimes<br />
duplicated as a means of overcoming this inflexibility in order to store multiple terms<br />
for a single concept. This is bad database design and practice.<br />
Data Category Selection - although a data category selection was specified, their<br />
actual names for use in database tables were not. Therefore, across many dictionary<br />
database tables, the field for containing, for example, the English term, has different<br />
names such as [English] and [Saesneg] and [Term Saesneg].<br />
Thus, the Welsh National <strong>Online</strong> Terminology Database project presented an<br />
opportunity to update and extend our adoption of standards.<br />
4. Welsh National <strong>Online</strong> Terminology Database<br />
4.1 Use of Standards for the Welsh National <strong>Online</strong> Terminology Database<br />
The Welsh National <strong>Online</strong> Terminology Database would need an improved structure<br />
in order to scale up to the number of terms and richness of data that terminological<br />
entry records may be expected to store in the future. The improved structure would<br />
come from ISO 16642, a.k.a. Terminological Mark-up Framework (TMF).<br />
The TMF simply defines an abstract or meta-model for terminological entries. From<br />
the meta-model, Terminological Mark-up Languages (TML) can be derived to facilitate<br />
the representation and transfer of terminology data. Thus, adoption of the TML data<br />
model, as illustrated in the following figure, can be seen to provide a much-improved<br />
representation for all terminology entries in the Welsh National <strong>Online</strong> Terminology<br />
Database.<br />
152
Figure 1: ISO 16642 / TMF Meta-Model Structure for Terminological Entries<br />
The structure is concept-based, with a hierarchical structure containing multiple<br />
language sections, each containing one or more terms in the language that represents<br />
the concept in question. At the conceptual level, conceptual or classification system<br />
data can be added to increase the granularity of terms classification up and beyond<br />
the containing dictionary and the subject field implied by the dictionary title.<br />
Language sections provide the means for storing terms from one, two or any number<br />
of languages. This gives the potential for multilingual Welsh terminology dictionaries<br />
and incorporating with languages that are related or widely used by Welsh speakers<br />
such as Spanish and other Celtic languages.<br />
The term section provides an efficient means for associating many terms that<br />
represent a concept in a particular language. The TMF also specifies the mechanism<br />
by which its structure is populated with data categories selected from a data category<br />
register or catalogue as defined in ISO 12620:1999.<br />
The catalogue contains well-defined data categories and pick list values for<br />
use within a TML structural representation. Essentially, ISO12620:1999 provides<br />
standardised names for data categories.<br />
153
4.2 Implementing TMF: a Simple First Implementation<br />
a) TMF Structure Compliance<br />
A very simple implementation was completed that allowed us to quite easily use<br />
and benefit from aspects of the TMF. We did not intend to derive our own terminology<br />
representation from the TMF, preferring instead to use an already existing XML format.<br />
The TMF XML implementation chosen was TBX 1 . TBX complies with the requirements of<br />
the TMF meta-model. It is a flexible format that allows users to specify their own data<br />
categories selection from ISO 12620 and specification of their own data categories via<br />
its eXtensible Constraint Specification (XCS).<br />
A number of resources from the TBX home page at http://www.lisa.org are available<br />
to aid in its adoption, in particular, documentation to describe the standard further. To<br />
describe the structure, an XML Schema definition is provided. Also, a collection of ISO<br />
12620 data categories are described and provided in a default eXtensible Constraints<br />
Specification file (TBXDv04.xml).<br />
There is limited tool support for TBX, but with a freely available XML Schema<br />
Definition tool within the Microsoft .NET framework - xsd.exe 2 , we were able to<br />
generate serializable C# classes. The generated C# code would be available to any<br />
other code written for constructing TMF compliant representations of terminological<br />
entries for inclusion in the Welsh National <strong>Online</strong> Terminology Database system.<br />
Construction would simply involve constructing object instances of the generated TBX<br />
classes and assigning values to member variables. When such objects instances are<br />
serialized with via the .NET framework’s XML serializer, the resulting XML conforms to<br />
the original TBX schema. The following shows the generated C# class for the top level<br />
TBX TermEntry element.<br />
[System.Xml.Serialization.XmlRootAttribute(Namespace=””, IsNullable=false)]<br />
public class termEntry<br />
}<br />
/// <br />
public note<strong>Text</strong>_impIDLangTypTgtDtyp descrip;<br />
/// <br />
///<br />
[System.Xml.Serialization.XmlElement]<br />
1 http://www.lisa.org/tbx<br />
2 http://msdn.microsoft.com/library/en-us/cptools/html/cpconXMLSchema<br />
DefinitionToolXsdexe.asp<br />
154
public descripGrp[] descripGrp;<br />
/// <br />
public note<strong>Text</strong>_impIDLangTypTgtDtyp admin;<br />
/// <br />
public adminGrp adminGrp;<br />
/// <br />
public transacGrp transacGrp;<br />
/// <br />
public note<strong>Text</strong>_impIDLang note;<br />
/// <br />
public @ref @ref;<br />
/// <br />
public xref xref;<br />
/// <br />
[System.Xml.Serialization.XmlElementAttribute(«langSet»)]<br />
public langSet[] Items;<br />
/// <br />
[System.Xml.Serialization.XmlAttributeAttribute(DataType=”ID”)]<br />
public string id;<br />
}<br />
b) Data Category Selection<br />
Data categories, in particular TBX’s eXtensible Constraints Specification support,<br />
also need to be available for the construction of terminological entries. However,<br />
we decided, since there aren’t yet a great number of data categories used by Welsh<br />
dictionaries, that it was simpler to hardcode the placement and setting of data<br />
155
categories into the TBX structure with wrapper code for the generated C# TBX code.<br />
Thus, the wrapper code provides easy access to the superset of all fields or data<br />
categories from all imported dictionaries.<br />
The default selection of data categories given by TBX correspond to most fields<br />
used in previous dictionaries. Newly utilised categories from ISO 12620 would aid in<br />
the machine processing of terms. For example, the SortKey is used for implementation<br />
of Welsh sort order for all destined lists or dictionaries exports. Some data categories<br />
were created in order to store efficiently data for the typical Welsh/English dictionary<br />
entry given earlier.<br />
• Welsh Part-Of-Speech (WelshPartOfSpeech)<br />
The standard picklist of values for representing part-of-speech for Welsh terms<br />
could be maintained with the addition of the WelshPartOfSpeech data category.<br />
• Welsh Plural (WelshPlural)<br />
Further grammatical information for a term such as the plural can be stored under<br />
this data category.<br />
• Disambiguator (concept-disambiguator)<br />
A simple disambiguating field containing a brief explanation for the context of term<br />
had been mandated by previous guidelines in cases of Welsh or English homonyms, for<br />
example:<br />
primate (=bishop) achesgob<br />
primate (=monkey) pmat<br />
The default language for the concept-dissambiguator data category is English.<br />
However, when a Welsh language disambiguator needs to be included, this would be<br />
contained within TBX’s XML tag.<br />
• Dictionary (originatingDictionary)<br />
The dictionaries from which a term originates can be noted in the TBX representation<br />
via the use of this new category.<br />
• Responsible Organisation (responsibleOrganisation)<br />
The organisation responsible for the standardisation of the term can be noted in<br />
this new category. The additional data categories specification is given in the extract<br />
below from TBXDv04_CY.xml:<br />
<br />
<br />
156
termEntry <br />
<br />
<br />
<br />
adf ans be eb eg eg/b ell adj n v<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
langSet termEntry term <br />
<br />
<br />
<br />
termEntry term <br />
<br />
An example of a TBX XML string with Welsh data categories in action:<br />
<br />
<br />
(of person)<br />
<br />
consolidatedElement<br />
<br />
<br />
157
home help<br />
entryTerm<br />
<br />
singular<br />
<br />
<br />
home help$sf$(of person)$sf$<br />
<br />
<br />
<br />
<br />
<br />
<br />
cynorthwyydd cartref<br />
entryTerm<br />
<br />
eg<br />
<br />
<br />
cynorthwywyr cartref<br />
<br />
<br />
singular<br />
<br />
<br />
<br />
cynorthwyydd cartref$sf$$sf$eg<br />
<br />
<br />
<br />
<br />
158
c) Database Design<br />
The TBX XML representation needs to be stored in a relational database. Database<br />
schemas can also usually be generated from XML Schemas. Since the Welsh National<br />
<strong>Online</strong> Terminology Database has the simple purpose of being a storage facility for the<br />
dissemination and presentation of standardised and thus fixed or published terminology<br />
data in various formats, a sufficient yet effective solution would be to store the entire<br />
TBX representation in the database as a string. Thus, the database design contains a<br />
single table for containing all terminological entries.<br />
Table 1: Database Table Design for Storing the Terminological Entries<br />
159<br />
TermEntries<br />
PK TermEntry_ID<br />
TermEntry<br />
Term data within XML strings cannot be accessible in a relational manner for search<br />
and querying, and so forth. Therefore, an index table is added, consisting of each<br />
term’s XPaths that point to its locations in the containing TBX strings stored under<br />
TernEntry_ID.<br />
PK<br />
PK<br />
PK<br />
Table 2: Database Table Design for Index to Terminological Entries<br />
TermIndex<br />
TermIndex_ID<br />
TermEntry_ID<br />
Language_ID<br />
Term<br />
TermEntry_XPath<br />
SortKey
Table 3: An Example Entry in the TermIndex Table<br />
TermIndex Value<br />
TermIndex_ID<br />
2614<br />
TermEntry_ID<br />
Language_ID<br />
Term<br />
TermEntry_XPath<br />
SortKey<br />
4.3 TBX transformations<br />
1328<br />
cy<br />
golwg grŵp<br />
//termGrp[@id=’tid-tg-2554-cy-1’]<br />
golwg grwp$sf$$sf$eb<br />
Presenting and transforming TBX strings in the database in various other formats<br />
such as HTML, CSV, and so forth would involve using simple XSLT transformations.<br />
• CSV<br />
A review of the import process of commercial termbase products used by Welsh<br />
translators showed that, in one way or another, all of them support importing<br />
terminology data from CSV formatted files.<br />
The following program creates an English to Welsh CSV format data for the term<br />
‘group view’ (XPath = termGrp id=”tid-tg-2554-en-1”):<br />
<br />
<br />
<br />
<br />
<br />
“<br />
<br />
„“,<br />
„”,<br />
“”,<br />
“”,<br />
<br />
160
“\”,<br />
“”<br />
“”<br />
<br />
<br />
<br />
<br />
<br />
<br />
The sample outputs CSV lines that provide each Welsh term translation for a specific<br />
English term for a specific concept (and concept-disambiguator value.<br />
A specific XPath (and therefore XSLT) is required for each terminological entry<br />
transformation. This has made a negligible difference to the performance of<br />
transforming entire dictionary contents and/or multiple entries.<br />
• Trados<br />
Trados’ termbase product MultiTerm imports terms marked up in its own MultiTerm<br />
XML format, as well as CSV. Importing terms from CSV files in most translation memory<br />
systems (including Trados) is quite a manual and complex task for some translators,<br />
despite both having wizards to help in mapping CSV fields to their terrmbase fields<br />
and structures. Fortunately, Trados MultiTerm has its own XML mark-up language to<br />
facilitate importing terminology. Although not entirely compliant to TMF or TBX,<br />
MultiTerm XML does follow the TMF meta structure. Data category selection and<br />
definition is described in XML Termbase Definition files (with extension .xdt).<br />
Thus, the Welsh National <strong>Online</strong> Terminology Database employs a simple XSLT to<br />
construct Multiterm XML from TBX by mapping data categories in identical structures,<br />
and provides a fixed accompanying .xdt file.<br />
<br />
“<br />
<br />
<br />
161
<br />
<br />
<br />
<br />
“<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
162
<br />
<br />
<br />
<br />
<br />
• Deja Vu<br />
<br />
<br />
<br />
<br />
<br />
<br />
Deja Vu’s Termbase can import terminology from a number of simple formats for<br />
example, CSV text, Excel, and Access. It does not support import of terminological<br />
data in any flavour of XML.<br />
Its import process contains target templates that are similar in structure and data<br />
categories to other common termbases such as Eurodictautom and TBX. However,<br />
the user is left with the manual task of mapping the CSV/Excel flat structure to<br />
the destination template. Simplifying the import process is under continued<br />
investigation.<br />
• WordFast<br />
WordFast does not support CSV, but its terminology support is a simple bilingual<br />
glossary text file with a source column and a target column that is easily loaded into<br />
the Wordfast toolbar. The XSLT transformation for exporting to this format is trivial as<br />
it is a subset of CSV.<br />
• Cysgeir<br />
Cysgeir is a Welsh/English dictionary software product produced by the e-Welsh<br />
team that incorporate many Welsh NLP features such as a lemmatizer for aiding in the<br />
search for Welsh words and terms. Cysgeir installs onto Windows PCs and integrates<br />
with popular Office productivity software and some translation memory packages, such<br />
163
as WordFast. Cysgeir supports loading terminology and browsing multiple dictionaries<br />
described in Cysgeir’s own proprietary formatted dictionary files. Integrating<br />
functionality for exporting Cysgeir dictionary files from TBX descriptions stored in the<br />
Welsh National <strong>Online</strong> Terminology Database system is under development.<br />
• Raw TBX<br />
Simple ASP.NET code can expose the raw TBX XML strings to other online terminology<br />
systems from specially crafted URLs. An example URL supported by the Welsh National<br />
<strong>Online</strong> Terminology Database is: http://www.e-gymraeg.co.uk/bwrdd-yr-iaith/<br />
termau/tbx.aspx?en=view.<br />
The result returned is merely the concatenation of all TBX strings found in the<br />
database containing the English term ‘view’. The system adds header information<br />
along with a link to the data category selection described in our eXtensible Constraints<br />
Specification file - TBXv04_CY.xml.<br />
4.4 Current State<br />
The national online database has been online since August 2005 and was launched<br />
with standardised terminology for Information Technology. The URL is http://<br />
www.e-gymraeg.org/bwrdd-yr-iaith/termau. Despite being a very simple use and<br />
implementation of the TMF on a Microsoft .NET/ ASP.NET/ SQL Server platform, it<br />
has already proven to be effective in its support of Welsh translators and software<br />
localisers. The website provides a simple search panel and presents results to the<br />
right of the screen, as seen in the screenshot:<br />
Figure 2: Screenshot of Search Results from the Welsh National <strong>Online</strong> Terminology<br />
Database<br />
164
During a period of ten weeks between August and mid-October 2005, over six<br />
hundred queries were made to this one dictionary, and it has been downloaded in its<br />
entirety as a CSV file over thirty times with twenty downloads requesting Welsh as<br />
the source language. The work of importing other pre-existing electronic dictionaries<br />
is about to begin and two new dictionaries covering ecology and retail will soon be<br />
added.<br />
5. Perspectives and Conclusion<br />
This paper has illustrated, for the interest of other lesser-resourced languages,<br />
a very simple implementation of online dissemination of terminology and adoption<br />
of the TMF standards. International standards have helped to steer terminology<br />
standardisation for Welsh (despite being a lesser spoken and resourced language) on a<br />
productive and sound course. Adoption of these standards in the past facilitated the<br />
creation of many dictionaries that, in the future, would prove easy to include in the<br />
Welsh National <strong>Online</strong> Terminology Database. Frugal reuse of existing resources is key<br />
for the development of language technology for lesser-resourced languages such as<br />
Welsh.<br />
In the meantime, the scope of ISO TC37 has expanded to cover the principles,<br />
methods and applications related to terminology, language resources and knowledge<br />
organisation for the multilingual information society. A family of standards are being<br />
developed with common principles that deal with lexicons, terminology, morpho-<br />
syntactic annotation, word segmentation and data category management. Therefore,<br />
we consider that continued gradual adoption of the ISO standards on terminology and<br />
lexicography will maximise reuse levels of linguistic data and software components<br />
even further.<br />
Parallel work also conducted by the e-Welsh team within the LEXICELT Welsh/<br />
Irish dictionary project, where the ISO lexicography standard (the Lexical Mark-up<br />
Framework ISO/CD 24613) has been used, identified similarities with the ISO TMF, and<br />
opportunities for reusability and merging of language resources.<br />
LMF aims to be a standard for MRD and NLP lexicon standards, and therefore define,<br />
in a similar way to the TMF, a meta structure along with mechanisms for data category<br />
selections from meta data registries. Nevertheless, whereas TMF is concept oriented,<br />
LMF is lexeme-based. Most existing lexical standards are said to fit with the LMF, such<br />
as the <strong>Text</strong> Coding Initiative (TEI), OLIF, CLIPS, WordNet, and FrameNet.<br />
165
Ultimately, the same structure and content can be used for a number of different<br />
purposes, from speech technology and printed dictionaries, to machine translation,<br />
where, at the moment, individual standalone application specific standards exist.<br />
Data from interoperable terminology and lexicography resources would not only<br />
enrich each other, but also use the same software components and systems.<br />
A bridge identified by many between terminological and lexical meta-models links<br />
a term entry in the TMF to the lexeme entry in the LMF. Thus, either of the two<br />
database implementations could be linked, or a super structure could be constructed<br />
containing a superset of term entry and sense entry data categories organised on<br />
concept or lexeme basis.<br />
Figure 3: Bridging ISO’s TMF and LMF Meta_Model Structures<br />
Linking lexical data and terminological data can improve the richness of a term’s<br />
grammatical information beyond the capabilities of the TMF. Already noted above,<br />
such a link would be an aid in the Welsh terminology standardisation process. In<br />
adhering to ISO 860 recommendations, term candidates are evaluated for inflected<br />
forms. Such inflected forms may exist already under the corresponding lexeme in its<br />
Form->Inflections section, or may be simply added to improve the lexical database.<br />
Larger opportunities for reuse exist if we realise that the adoption of standards in<br />
the Welsh terminology could go beyond merely standardised translations, in particular,<br />
adopting advanced aspects of the standards such as concept modelling.<br />
166
Classification of terms and their finer-grained organisation into a conceptual<br />
organisation may offer users of the Welsh National <strong>Online</strong> Terminology Database system<br />
better discovery and comprehension of terms in Welsh than in English. It may even<br />
contribute to improving English terminology standardisation, since many organisations<br />
and/or terminology dictionaries have different terms representing identical concepts<br />
that hamper and frustrate cooperation.<br />
LMF supports translation by linking senses in the same way that TMF’s term entries<br />
are linked as subordinates to the same concept entity. Simple, direct and bidirectional<br />
links between senses in respective LMF entries may be sufficient for simple bilingual<br />
dictionaries, but more complicated and multilingual lexical dictionaries require an<br />
interlingual concept system to handle the various levels of concept precision. Thus,<br />
further opportunities for the reuse of software components for concept modelling<br />
exists.<br />
Such observations and gradual expansion of international standards adoption for<br />
Welsh language technology can only further ‘future-proof’ Welsh against all future<br />
developments and applications in language technology such as semantic web,<br />
knowledge bases, and machine translation.<br />
6. Acknowledgements<br />
The Welsh National <strong>Online</strong> Terminology Database project is funded by the Welsh<br />
Language Board.<br />
167
168<br />
References<br />
“Cronfa Genedlaethol o Dermau.” <strong>Online</strong> at http://www.e-gymraeg.org/bwrdd-yr-<br />
iaith/termau.<br />
“Déjà Vu.” <strong>Online</strong> at http://www.atril.com/default.asp.<br />
“Microsoft .NET framework - xsd.exe”. On line at:<br />
http://msdn.microsoft.com/library/en-us/cptools/html/cpconXMLSchemaDefinitionToolXsdexe.asp<br />
ISO/TR 12618:1994 Computational aids in terminology – Creation and use of<br />
terminological databases and text corpora. (TC37/SC3).<br />
ISO 860:1996 Terminology work – Harmonization of Concepts and Terms,<br />
(TC37/SC1).<br />
ISO 12620:1999 Computer Applications in Terminology – Data Categories<br />
(TC37/SC3).<br />
ISO 704:2000 Terminology Work – Principles and Methods (TC37/SC1).<br />
ISO 16642:2003 Computer Applications in Terminology - Terminological Markup<br />
Framework (TC37/SC3).<br />
ISO/CD 24613 Language Resource Management – Linguistic Annotation<br />
Framework (TC37/SC4).<br />
Prys, D. (2003). “Setting the Standards: Ten Years of Welsh Terminology Work.”<br />
Proceedings of the Terminology, Computing and Translation Conference,<br />
Swansea University, March 27-28, 2004. Swansea: Elsavier.
Prys, D. & Jones, J.P.M. (1998). The Standardization of Terms Project. Report prepared<br />
for the Welsh Language Board.<br />
“TBX TermBase eXchange.” <strong>Online</strong> at http://www.lisa.org/tbx.<br />
“Trados.” <strong>Online</strong> at http://www.trados.com/.<br />
“Wordfast.” <strong>Online</strong> at http://www.wordfast.net/.<br />
169
SpeechCluster: A Speech Data Multitool<br />
171<br />
Ivan A. Uemlianin<br />
When collecting and annotating speech data, to build a database, for example,<br />
speech researchers face a number of obstacles. The most obvious of these is the<br />
sparseness of data, at least in a usable form. A less obvious obstacle, but one that is<br />
surely familiar to most researchers, is the plethora of available tools with which to<br />
record and process the raw data. Some example packages include: EMU, Praat, SFS,<br />
JSpeechRecorder, Festival, HTK, and Sphinx. Although, prima facie, an embarrassment<br />
of riches, each of these tools proves to address a slightly different set of problems,<br />
to be slightly (or completely) incompatible with the other tools, and to demand a<br />
different area of expertise from the researcher. At best, this is a minor annoyance; at<br />
worst, a project must expend significant resources to ensure that the necessary tools<br />
can interoperate. As this work is no doubt repeated in unrelated projects around the<br />
world, an apparently minor problem becomes a possibly major - and undocumented<br />
- drag on progress in the field. This danger is especially extreme in research on<br />
minority and lesser-spoken languages, where a lack of resources or expertise may<br />
completely preclude research. Researchers need some way of abstracting from all<br />
these differences, so they can conduct their research. The simplest approach would<br />
be to provide an interface that can read and write the existing formats, and provide<br />
other facilities as required.<br />
On the WISPR project-- developing speech processing resources for Welsh and Irish-<br />
- we have adopted this approach in developing SpeechCluster. The intention behind<br />
SpeechCluster is to enable researchers to focus on research rather than file conversion<br />
and other low-level, but necessary preprocessing. SpeechCluster is a freely available<br />
software package, released and maintained under an open-source license.<br />
In this paper, we present SpeechCluster (reviewing the requirements it addresses and<br />
its overall design), we demonstrate SpeechCluster in use, and finally, we evaluate its<br />
impact on our research and outline some future plans.<br />
1. The Context<br />
Lesser-used languages (LULs) are often lesser-resourced languages. Majority<br />
languages have wealthy patron states, with the money and the labour force to develop<br />
language resources as required. For example, Microsoft alone has over 6000 hours of<br />
US English speech data at its disposal (Huang 2005). Patron organisations of lesser-
used languages often do not have access to such power, and they must use their<br />
resources wisely.<br />
Research and development in language technology brings many stimulating<br />
challenges. With LULs especially, these challenges may include considerations about<br />
the status and use of the language (users and patrons of the language are likely to<br />
take an active interest, and often language technology research can become part of<br />
the life of the language itself).<br />
Research and development in language technology also brings a great deal of<br />
tedious labour. Data must be collected and archived, and there are several layers of<br />
processing that need to be done before any ‘interesting’ R&D can begin. Often, the<br />
physical forms of the storage and the processing tools- the file formats and software<br />
implementations- provide obstacles of their own.<br />
Since these obstacles are contingent upon the machinery rather than the research<br />
problem, they are often categorised as ‘chores’ and tackled quite differently to other<br />
tasks on the project. At worst, these obstacles will be tackled manually; at best,<br />
scripts will be written ad hoc as the need arises, to be discarded (or ‘archived’) at<br />
the end of the project. These approaches are inefficient, especially when compared<br />
to the conscious and investigative approach taken with other parts of the project.<br />
Resources are wasted, and specialists can spend significant portions of their time<br />
involved with inappropriate (and more importantly, unpleasant) activities.<br />
In the speech research department of a large corporation, the costs associated<br />
with this waste can be passed on to the customer; in smaller research establishments,<br />
these costs may preclude speech research altogether.<br />
2. Our Problem<br />
Our goals on the WISPR project are to develop speech technology resources for<br />
Welsh and Irish that we can make freely available to commercial companies. There<br />
are currently no such resources at all for Irish, and very limited resources for Welsh<br />
(language resources available for Welsh include two text resources: CEG, a 1 million<br />
word balanced and tagged corpus (Ellis et al. 2001) and a large collection of webpages<br />
(Scannell 2004], both of which are for non-commercial use only; a telephone speech<br />
corpus (Jones et al. 1998); and a small, experimental speech database (Williams<br />
1999).<br />
The project must therefore begin from the bottom, starting with data collection<br />
and annotation, moving on to developing necessary speech databases, acoustic models<br />
(AMs) and language models (LMs), and, finally, developing packaged software artefacts<br />
172
that can be used by external developers. With limited resources of time, money and<br />
labour, every administrative chore added to the workflow reduces resources available<br />
for more delivery-oriented tasks.<br />
The first decision to the problem of ‘chores’ was that the solution should be a<br />
Speech Processing Resource in its own right. The solution should consist of a set of<br />
reusable, extensible, shareable tools to be made available to (a) ourselves on future<br />
projects, and (b) other teams working on speech processing projects around the<br />
world.<br />
2.1 Requirements<br />
The main design goals of our solution are as follows:<br />
• researchers should be able to work independently of data format<br />
restrictions;<br />
• necessary, complicated, but uninteresting tasks should be automated;<br />
• interesting, but complicated tasks should be made simple;<br />
• researchers should be able to address linguistic problems with linguistic<br />
solutions;<br />
• the toolkit should be increasingly simple to maintain and develop; and,<br />
• the toolkit should encourage its own use and development.<br />
• Researchers should be able to work independently of data format<br />
restrictions.<br />
Data can be collected, transcribed and stored in a range of formats. Each of the<br />
range of available tools for language technology research accepts or generates data<br />
in its own format, or in a limited range of standard formats. Researchers should not<br />
have to worry about which format works with which application: they should be able<br />
to pick the application necessary (or preferred) for the research problem, and the<br />
data should be readily accessible in the correct format.<br />
• Necessary, complicated, but uninteresting tasks should be automated.<br />
This applies to life in general, of course.<br />
• Interesting but complicated tasks should be made uncomplicated.<br />
Due to the nature of the field, where it is often necessary to process large sets of<br />
data, many of the more interesting problems (e.g. building AMs for speech recognition)<br />
involve procedures that are repetitive (e.g. those that have to be applied to every<br />
item in a corpus) or complicated (e.g. initialising a system). Researchers learning<br />
about a new area are hampered when these tasks dominate learning time.<br />
173
• Researchers should be able to address linguistic problems with linguistic<br />
solutions.<br />
Often, a linguistic problem, or a problem initially described in language terms<br />
(e.g. retranscribing the data using a different phoneset) has to be redescribed in<br />
programming terms before it can be addressed. Problems should be addressable in the<br />
terms in which they occur.<br />
• The toolkit should be increasingly simple to maintain and develop.<br />
Over its lifetime, any toolkit increases in functionality: new problems occur and<br />
new tasks become possible. If extensions are increasingly difficult to implement, the<br />
toolkit eventually disintegrates (e.g. into a library of loosely-related scripts), becomes<br />
impossible to maintain, and falls into disuse. A well-designed toolkit can avoid this<br />
fate.<br />
• The toolkit should encourage its own use and development.<br />
It should be preferable to use the toolkit than to revert to the bad old ways.<br />
Nevertheless, further use of the toolkit should stimulate researchers to confront it<br />
with new problems, and to think of new areas in which the toolkit might be used.<br />
If possible, the toolkit should be extensible by the researchers themselves, rather<br />
than having to rely on a separate maintainer. In this case, the design of the toolkit<br />
should promote the writing of readable, reusable code.<br />
3. A Solution<br />
3.1 Introduction<br />
Our first (and current) attempt at a software artefact that meets these requirements<br />
is the SpeechCluster package (Uemlianin 2005a). SpeechCluster is a collection of small<br />
programs written in a programming language called Python. Python has a very clear,<br />
readable syntax, and is especially suited for projects with several programmers, or<br />
with novice programmers. As such, it suited our aim of encouraging non-programmers<br />
to write their own tools.<br />
The SpeechCluster package consists of a main SpeechCluster module with the basic<br />
API, and a number of modules that can be used as command line tools. The tools are<br />
intended to be used as such, but they can also be used as ‘examples’, or as a basis for<br />
customisation or further programming with SpeechCluster.<br />
Table 1 shows a list of the tools currently available as part of SpeechCluster: Below,<br />
we look at two of these in more detail before exploring the use of SpeechCluster as an<br />
API. Finally, we look at SpeechCluster in a larger system.<br />
174
Table 1: SpeechCluster command-line tools<br />
Tool Function<br />
segFake ‘Fake autosegmentation’ of a speech audio file<br />
segInter Interpolates labels into a segmented but unlabelled segment tier<br />
segMerge Merges separate label files<br />
segReplace Converts labels in a label file<br />
segSwitch Converts label file format<br />
splitAll Splits audio/label file pairs<br />
3.2 Using SpeechCluster I: The Tools<br />
a) SegSwitch<br />
SegSwitch is a label file format converter. It converts label files between any of the<br />
formats supported by SpeechCluster (currently, Praat <strong>Text</strong>Grid, esps and the various<br />
HTK formats [i.e., the simple .lab format and the multi-file .mlf format]). This kind of<br />
format conversion is a very common task. For example, HTK requires files to be in its<br />
own esps-like format, but our team prefers to handlabel files in Praat, which outputs<br />
its own <strong>Text</strong>Grid format. Festival uses an esps-like format that is slightly different<br />
from HTK’s.<br />
SegSwitch has a simple command-line interface (see Table 2), in which single files<br />
or whole directories can be converted easily and perfectly.<br />
Usage:<br />
Examples:<br />
Table 2: segSwitch usage<br />
segSwitch -i -o <br />
segSwitch -d -o <br />
segSwitch -i example.lab -o example.<strong>Text</strong>Grid<br />
segSwitch -d labDir -o textGrid<br />
A simple facility like this has a remarkable effect on the efficiency of a team. The<br />
team no longer has to worry about in what file format they have to work. They can<br />
concentrate on the research task converting files in and out of particular formats as<br />
needed. In a sense, the two parts of the work- the research and the bookkeeping-<br />
have been separated, and the bookkeeping is done by the tools. This division of labour<br />
is repeated between the tools and the SpeechCluster module itself. As much of the<br />
low-level data manipulation as possible is carried out by SpeechCluster, so that the<br />
tools themselves can be written in simple, task-oriented terms.<br />
Table 3 shows the main code for segSwitch (excluding the command-line parsing and<br />
the loop over files in a directory): all of the work of file format conversion is done by<br />
175
the code shown. Looking past the Python syntax, this code is a direct implementation<br />
of an intuitive statement of the task (see Table 4).<br />
Table 3: Simplified python code for segSwitch<br />
Line Code<br />
1 from speechCluster import *<br />
2<br />
3 def segSwitch(inFn, outFn):<br />
4 “””<br />
5 Args: string inFn: input filename<br />
6 string outFn: output filename<br />
7 Returns: None<br />
8 Uses filename extensions to determine input<br />
9 & output formats.<br />
10 “””<br />
11 spcl = SpeechCluster(inFn)<br />
12 ofext = os.path.splitext(outFn)[1]<br />
13 outFormat = SpeechCluster.formatDict[ofext.lower()]<br />
14 out = spcl.write_format(outFormat)<br />
15 open(outFn, ‘w’).write(out)<br />
Table 4: segSwitch task statement<br />
Line(s) Task<br />
11 read in an input file<br />
12-13 work out from the output filename what the output format should be<br />
14 generate the output format data<br />
15 write the data out to a file, using the output filename given.<br />
All of the hard programming is hidden in the SpeechCluster module, which is<br />
imported in line 1, and which provides useful facilities like formatDict(ionary) and<br />
write_format(format).<br />
b) SplitAll<br />
One of the special features of SpeechCluster is that it will treat a sound file (i.e.,<br />
recorded speech) and its associated label file as a pair, and can manipulate them<br />
together. SplitAll shows this in action.<br />
SplitAll addresses the problem of the researcher who requires a long speech file to<br />
be split into smaller units along with its associated label file; for example, one may<br />
require a long utterance containing pauses to be split into its constituent phrases. Of<br />
course, data can be recorded or segmented into shorter units before it is labelled, but<br />
when data is re-used, its requirements often change.<br />
176
This task is a minor inconvenience if you just have one or two files, but if you have<br />
five hundred (or even just fifty) it becomes important to automate it. Furthermore, it<br />
would be better psychologically if the researcher could envisage this as a single task,<br />
rather than two related tasks (i.e., splitting the wav file; then splitting the label file<br />
to match). The best option is to delegate the task to a machine.<br />
As with segSwitch, splitAll has a simple command-line interface (see Table 5).<br />
Table 5: splitAll usage<br />
Usage<br />
splitAll -n -t [-l ]<br />
inDir outDir<br />
Example Splits into<br />
splitAll -n 5 -t Phone inDir outDir 5 phone chunks<br />
splitAll -n 1 -t Word inDir outDir single words<br />
splitAll -n 5 -t Second inDir outDir 5s chunks<br />
splitAll -n 1 -t Phone -l sil inDir outDir split by silence<br />
SplitAll is intended to be used on directories of files and can process hundreds of<br />
speech/label file pairs in moments. Again, the effect is to separate the researcher<br />
from the drudgery of looking after files.<br />
Apart from a function that parses the command-line parameters into the variable<br />
splitCriteria, the code for splitAll is just as simple as that for segSwitch (see Table<br />
6). The excerpt seen here loops through the filestems in a directory, a filestem being<br />
a filename without its extension (e.g. example.wav and example.lab have the same<br />
filestem ‘example’). Line 8 generates a speechCluster from a filestem: this means<br />
that all files with the same filestem– (e.g. a wav file and a lab file) are read into the<br />
one speechCluster. Line 9 then calls split, saving the results into the given output<br />
directory.<br />
Table 6: Simplified python code for splitAll<br />
Line Code<br />
1 from speechCluster import *<br />
2<br />
3 def splitAll(splitCriteria, inDir, outDir):<br />
4 stems = getStems(inDir)<br />
5 for stem in stems:<br />
6 fullstem = ‘%s%s%s’ % (inDir, os.path.sep, stem)<br />
7 print ‘Splitting %s.*’ % fullstem<br />
8 spcl = SpeechCluster(fullstem)<br />
9 spcl.split(splitCriteria, outDir)<br />
177
This codewalk tells you nothing about how SpeechCluster.split(splitCriteria) works,<br />
but that’s the point. The SpeechCluster module provides facilities like split() that<br />
allow the researcher to phrase problems and solutions in task-oriented terms rather<br />
than programming-oriented terms.<br />
3.3 SpeechCluster as an API<br />
The two main design features of SpeechCluster are:<br />
7 it stores segmentation details internally in an abstract format; and,<br />
8 it can treat an associated pair of sound and label files as a unit.<br />
In terms of the facilities SpeechCluster provides, these features translate into the<br />
methods shown in Table 7.<br />
Table 7: SpeechCluster methods<br />
Interface (i.e. read/write) methods<br />
read_format(fn) write_format(fn)<br />
read_ESPS(fn) write_ESPS(fn)<br />
read_HTK_lab(fn) write_HTK_lab(fn)<br />
read_HTK_mlf(fn) write_HTK_mlf(fn)<br />
read_HTK_grm(fn) write_HTK_grm(fn)<br />
read_stt(fn) write_stt(fn)<br />
read_<strong>Text</strong>Grid(fn) write_<strong>Text</strong>Grid(fn)<br />
read_wav(fn) write_wav(fn)<br />
Methods for manipulating label and sound files (and label/sound file pairs)<br />
merge(other)<br />
replaceLabs(replaceDict)<br />
setStartEnd(newStart, newEnd)<br />
split(splitCriteria, saveDir, saveSegFormat)<br />
When programming using SpeechCluster as a library, the researcher/developer can<br />
program using the linguistic terms of the problem, not the programming terms of the<br />
programming language.<br />
There is documentation available (Uemlianin 2005a), and the Python pydoc facility<br />
allows the researcher/developer to access documentation ‘interactively’ (see Figure<br />
1).<br />
3.4 Using SpeechCluster II: Making a New Script<br />
Although there are tools provided as part of the SpeechCluster package, the<br />
SpeechCluster module itself presents an accessible face, and it is hoped that<br />
178
esearcher/developers are able to use SpeechCluster to build new tools for new<br />
problems.<br />
a) SegFake<br />
SegFake provides an example of using SpeechCluster to help write a script to<br />
address a specific problem. Handlabelling is passé. It is laborious, tedious and error-<br />
prone; but sometimes researchers in LULs have to do it. If there are no AMs to do<br />
time-alignment, there seems to be no alternative to labelling the files by hand.<br />
When labelling prompted speech (e.g. recited text), the phone labels are more-<br />
or-less given (i.e., from a phonological transcription of the text). The labeller is not<br />
really providing the labels, only the boundary points. It would be helpful if the task<br />
could be reduced to specifying phone boundaries in a given label file. In other words,<br />
if the task could be divided between SpeechCluster and the human: SpeechCluster<br />
generates a label file in a requested format with approximate times, and the human<br />
corrects it.<br />
This was the idea behind segFake. SegFake detects the end-points of the speech in<br />
the wav file (currently it assumes a single continuous utterance) and evenly spreads a<br />
string of labels across the intervening time. A resulting <strong>Text</strong>Grid is shown in Figure 1.<br />
Clearly, the probability of any boundary being correct approaches zero, but the task<br />
facing the human labeller has been substantially simplified.<br />
179
Fig. 1<br />
We can phrase a more explicit description of the problem (see Table 8); once the<br />
problem has been thus specified, translating it into Python is simple (see Table 9), and<br />
then this tool can be used from the command line (see Table 10). segFake results,<br />
viewed in Praat, are shown in Figure 2.<br />
Table 8: Pseudocode representation of fakeSeg problem<br />
GIVEN: wav file<br />
in the wav file, identify endpoints of speech: START, END<br />
T = END – START<br />
L = T / N<br />
Specify label boundaries, starting at START and incrementing by L<br />
180<br />
list of N labels
Line Code<br />
Fig. 2<br />
Table 9: segfake in python<br />
1 def fakeLabel(fn, phoneList, tierName=’Phone’, outFormat=’<strong>Text</strong>Grid’):<br />
2 seg = SpeechCluster(fn)<br />
3 segStart, segEnd, fileEnd = seg.getStartEnd()<br />
4 width = (segEnd - segStart)*1.0 / len(phoneList)<br />
5 tier = SegmentationTier()<br />
6 # start with silence<br />
7 x = Segment()<br />
8 x.min = 0<br />
9 x.max = segStart<br />
10 x.label = SILENCE_LABEL<br />
11 tier.append(x)<br />
12 for i in range(len(phoneList)):<br />
13 x = Segment()<br />
14 x.label = phoneList[i]<br />
15 x.min = tier[-1].max<br />
16 x.max = x.min + width<br />
17 tier.append(x)<br />
18 # end with silence<br />
19 x = Segment()<br />
20 x.min = tier[-1].max<br />
21 x.max = fileEnd<br />
181
22 x.label = SILENCE_LABEL<br />
23 tier.append(x)<br />
24 tier.setName(tierName)<br />
25 seg.updateTiers(tier)<br />
26 outFormat = SpeechCluster.formatDict[‘.%s’ \<br />
% outFormat.lower()]<br />
27 return seg.write_format(outFormat)<br />
Table 10: segFake usage<br />
Usage<br />
segFake.py -f -o (<strong>Text</strong>Grid | esps | htklab )<br />
<br />
segFake.py -d -t <br />
-o <br />
Example<br />
segFake.py -f amser012.wav -o <strong>Text</strong>Grid<br />
m ai hh ii n y n j o n b y m m y n y d w e d i yy n y b o r e<br />
segFake.py -d wav -t trans.txt -o <strong>Text</strong>Grid<br />
3.5 Using SpeechCluster III: A Bigger Example: PyHTK<br />
So far, SpeechCluster has been shown in fairly limited contexts, essentially as a file<br />
management tool to protect researchers from administrative drudgery. This was one<br />
of the key goals of SpeechCluster. We have seen this producing a quantitative effect:<br />
giving the researcher a bit more time, but not really changing the kind of work a<br />
researcher can do. The next example shows that SpeechCluster can have a qualitative<br />
effect too.<br />
The Hidden Markov Model Toolkit (HTK) (Woodland et al. 1994) is a toolkit to build<br />
HMMs, primarily for Automatic Speech Recognition (ASR), but it is also beginning to<br />
be used for research in speech synthesis or <strong>Text</strong>-to-Speech (TTS). HTK also provides<br />
facilities for language modelling used in ASR, but is increasingly being applied to<br />
problems in other domains. Like DNA sequencing (e.g. Kin 2003). It is a de facto<br />
standard in academic speech technology research, and no doubt has similar penetration<br />
into commercial research and development, particularly with small and medium-sized<br />
enterprises (SMEs). Although it is not open-source software, it is available free of<br />
182
charge, and the models generated can be used commercially with no license costs.<br />
Compared with other such toolkits (e.g. Sphinx and ISIP) it is usable, powerful, and<br />
accurate. Nevertheless, it is still not easy to use. HTK is:<br />
• Difficult technically: the ideal HTK user is a computer scientist who<br />
understands HMM internals, is comfortable with the command-line, and can write<br />
supporting scripts as required; and,<br />
• Complicated and time-consuming: use of HTK involves writing long chains<br />
of heavily parametrised commands, tests, adjustments and iterations.<br />
This is no criticism of HTK, of course (HMM building is complicated), but the<br />
consequence is that its use is limited to computer scientists already working on speech<br />
technology research projects (mostly ASR). This is normal (all part of the academic<br />
way of institutionalising specialisation); however, it acts as a limit on the usability of<br />
language resources (i.e. corpora), and on the potential of language researchers.<br />
PyHTK (Uemlianin 2005b) aims to change all that. PpyHTK is a Python wrapper<br />
around HTK, hiding the complexities of building and using HTK models behind a very<br />
simple command-line interface. A selection of commands from an HTK session is shown<br />
in Figure 3.<br />
These commands are roughly equivalent to the command pyhtk.py –a hmm4.<br />
Nobody would type out all those HTK commands longhand. As in the case of some<br />
of the functions of SpeechCluster, each project will write their own little scripts<br />
to generate the commands. As with SpeechCluster, this reduplication of code is a<br />
huge and invisible waste of effort; and more so here than with SpeechCluster, writing<br />
scripts to run HTK requires a familiarity with HTK and at least some familiarity with<br />
the ins and outs of HMMs. HTK is very far from being ‘plug-n-play’.<br />
With pyHTK all that is required to get started is a speech corpus (i.e., a set of wav<br />
files) and some level of transcription. PyHTK uses SpeechCluster to put everything into<br />
the formats required by HTK, and then runs the necessary HTK commands to build a<br />
model and/or conduct recognition. In other words, SpeechCluster acts as an interface<br />
between HTK and your data, and pyHTK acts as an interface between you and HTK.<br />
With pyHTK as an interface, HTK can be used with no knowledge or understanding<br />
at all of the underlying technology. It is perhaps true that ‘a little knowledge is a<br />
dangerous thing,’ but pyHTK should not be seen as promoting a lack of understanding.<br />
Rather, with pyHTK you can:<br />
• ‘Try out’ ASR research and get more seriously involved if it looks<br />
worthwhile;<br />
183
Fig. 3<br />
• Learn about the technology at your own pace, while you work, instead of<br />
having to cram up-front; and,<br />
• Start working in a new area without having to hire a new team.<br />
As part of pyHTK, SpeechCluster can have a qualitative effect on a team’s<br />
potential: new areas of research and development (ASR, TTS and language modelling)<br />
become accessible. For example, we have built a diphone voice with Festival (Black<br />
et al. 1999). We have gathered the data (recording a phonetically balanced corpus<br />
of around 3000 nonsense utterances). Before we can build the voice, we must label<br />
the data (i.e., provide time-stamped phonological transcriptions). Labelling all the<br />
data by hand would have taken around 100 person-hours. In a small team, this kind of<br />
labour-time is not available.<br />
Instead, using SpeechCluster and pyHTK, we have been able to do the following:<br />
• Use segFake to generate a starter segmentation of the data (we also manually<br />
transcribed just under 100 of the files, i.e., about 3% of the data).<br />
• Iterate pyHTK twelve times overnight on the given segmentation. This<br />
involved: building an AM based on the given segmentation; re-labelling the<br />
184
segFake’d training data with the AM; and, saving the generated segmentation<br />
for the next iteration.<br />
The resulting segmentation would not satisfy a linguist- Figure 4 compares a segFake<br />
segmentation with an execution of this process– but the boundaries are sufficiently<br />
accurate to build a voice with Festival. Although we manually labelled a very small<br />
proportion of the data, we hypothesise that this had little effect on the quality of the<br />
final voice. In other words, SpeechCluster and pyHTK have enabled an almost fully<br />
automated build of a synthetic voice.<br />
4. Conclusion<br />
Fig. 4<br />
Using SpeechCluster can save time and avoid a lot of stress. Developing SpeechCluster<br />
(using resources that would have been spent on repetitive chores) has resulted in<br />
a deliverable artefact: a reusable, shareable and extensible software package for<br />
manipulating speech data.<br />
SpeechCluster has been developed as part of the WISPR project, and the facilities<br />
it offers reflect the tasks we have faced on WISPR. In future, SpeechCluster will<br />
accompany our work in a similar way; consequently it is not possible to predict entirely<br />
the development of SpeechCluster. However, two directions can be indicated:<br />
• Corpora: It would be useful if SpeechCluster could treat a speech corpus<br />
with the same abstraction as it treats a sound/label file pair. In this case, the<br />
corpus, as a unit, could be described (e.g. counts of and relations between various<br />
185
units) and manipulated (e.g. subset selection). This direction will include a layer<br />
for compatibility with EMU.<br />
• Festival: We are developing a Python wrapper for Festival, similar to pyHTK<br />
for HTK. This development may have implications for changes in SpeechCluster.<br />
SpeechCluster can be downloaded from the address given in the documentation<br />
(Uemlianin 2005a). We are making it available under an open-source (BSD) license.<br />
We take seriously the proposition that SpeechCluster should be a usable, shareable<br />
resource. We encourage researchers and developers in the field to use SpeechCluster,<br />
and we shall as far as possible maintain and update SpeechCluster in line with users’<br />
requests.<br />
5. Acknowledgements<br />
This work is being carried out as part of the project ‘Welsh and Irish Speech<br />
Processing Resources’ (WISPR) (Williams et al. 2005). WISPR is funded by the Interreg<br />
IIIA European Union Programme and the Welsh Language Board. I would also like<br />
to acknowledge support and feedback from other members of the WISPR team, in<br />
particular Briony Williams and Aine Ni Bhrian.<br />
186
187<br />
References<br />
Black, A. W., Taylor, P. & Caley, R. (1999). The Festival Speech Synthesis System.<br />
http://www.cstr.ed.ac.uk/projects/festival/.<br />
Ellis, N. C. et al. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 Million Word Lexical<br />
Database and Frequency Count for Welsh.<br />
http://www.bangor.ac.uk/ar/cb/ceg/ceg_eng.html.<br />
Huang, X. (2005). Challenges in Adopting Speech Technologies. CSTR-21. Edinburgh,<br />
September 2005.<br />
Jones, R.J. et al. (1998). “SpeechDat Cymru: A Large- Scale Welsh Telephony<br />
Database.” Proceedings of the Workshop on “Language Resources for European<br />
Minority Languages, May 27th 1998, Granada, Spain.<br />
Kin, T. (2003) “Designing Kernels for Biological Sequence Data Analysis.” Doctoral thesis.<br />
School of Knowledge Science, Japan Advanced Institute of Science and Technology.<br />
Scannell, K.P. (2004). Corpus Building for Minority Languages. <strong>Online</strong> at http://borel.<br />
slu.edu/crubadan/index.html<br />
Uemlianin, I. (2005a). SpeechCluster Documentation. <strong>Online</strong> at<br />
http://www.bangor.ac.uk/~cbs007/speechCluster/README.html<br />
Uemlianin, I. (2005b). PyHTK Documentation. <strong>Online</strong> at<br />
http://www.bangor.ac.uk/~cbs007/pyhtk/README.html<br />
Williams, B. (1999). “A Welsh Speech Database: Preliminary Results.” Proceedings of<br />
Eurospeech 99, September 1999, Budapest, Hungary.<br />
Williams, B., Prys, D. & Ni Chasaide, A. (2005). “Creating an Ongoing Research<br />
Capability in Speech Technology for two Minority Languages: Experiences from the
WISPR Project.” Interspeech 2005. Lisbon.<br />
http://www.bangor.ac.uk/ar/cb/wispr.php<br />
Woodland, P.C. et al. (1994). “Large Vocabulary Continuous Speech Recognition Using<br />
HTK.” Acoustics, Speech, and Signal Processing, ii, 125-128.<br />
http://htk.eng.cam.ac.uk/<br />
188
XNLRDF: The Open Source Framework for<br />
Multilingual Computing<br />
Oliver Streiter and Mathias Stuflesser<br />
XNLRDF (Natural Language Resource Description Framework) attempts to collect,<br />
formalise and formally describe NLP resources for the world’s writing systems so<br />
that these resources can be automated in language-related applications like Web-<br />
browsers, mail-tools, Web-crawlers, information retrieval systems, or computer-<br />
assisted language learning systems. XNLRDF is a free software, distributed in XML-<br />
RDF or as database dump. It proposes to replace idiosyncratic ad-hoc solutions for<br />
Natural Language Processing (NLP) tasks within the aforementioned applications with<br />
a standard interface to XNLRDF. The linguistic information in XNLRDF extends the<br />
information offered by Unicode so that basic NLP tasks like language recognition,<br />
tokenization, stemming, tagging, term-extraction, and so forth can be performed<br />
without additional resources. With more than 1000 languages at use on the Internet<br />
(and their number continually rising), the design and development of such software<br />
has become a pressing need. In this paper, we describe the basic design of XNLRDF,<br />
the metadata, the type of information the first prototype already provides, and our<br />
strategies to further develop the resources.<br />
1. XNLRDF as a Natural Extension of Unicode<br />
1.1 The Advantages of Unicode<br />
Unicode has simplified the completion of NLP tasks for many of the world’s writing<br />
systems. Whereas, in the past, specific implementations were required, nowadays<br />
programming languages like Java, C++, C or Perl provide an interface to Unicode<br />
properties and operations. Unicode not only describes ‘code elements’ 1 of scripts<br />
by assigning the code element a unique code point, but it also assigns properties like<br />
uppercase, lowercase, decimal digit, mark, punctuation, hyphen, separator, or the<br />
script to the code elements. In addition, it defines operations on the code elements<br />
such as uppercasing, lowercasing and sorting. Thus, computer applications, especially<br />
those operating in multilingual contexts, are better off when processing texts in<br />
Unicode than in traditional encodings such as LATIN1, BIG5 or KOI-r.<br />
1 Similar, but not identical to characters (cf. our discussion in section 2.1).<br />
189
1.2.The Inadequacy of the Notions of Unicode for NLP Metadata<br />
On the other hand, the conceptual framework of Unicode is limited. Its principal<br />
notions are code elements and scripts. Important notions such as character, language<br />
or writing system have, astonishingly, no place in Unicode. As Unicode describes<br />
mainly scripts, two languages that use the same script (e.g., Italian and French) are<br />
essentially the same! The fact that French uses with ‘ç’ (the cedille) a character<br />
unknown in Italian is not formally described in Unicode. For this reason, additional<br />
knowledge (e.g., about languages, regions or legacy encodings) has been integrated<br />
into Unicode/Internationalisation programming libraries for a limited number of<br />
languages (e.g., ICU, Basis Technology Rosette, Lextek Language Identifier). 2<br />
As for the notion of language, it is not only absent from the formal framework of<br />
Unicode, but to our knowledge, nobody has attempted, except for limited purposes,<br />
a large-scale mapping between Unicode scripts and the world’s most important<br />
language identification standards (i.e., ISO 639 and the SIL-codes of Ethnologue).<br />
This is astonishing, as neither the language code, nor the code of the locality of a<br />
document, nor the script taken in isolation are sufficiently rich to serve as metadata.<br />
Metadata in language-related applications have the function to map a document to<br />
be processed to the adequate NLP resources. In XNLRDF, the notion of writing system<br />
is used for this purpose. It represents a first large-scale attempt to map scripts onto<br />
language identification standards.<br />
1.3 The Writing System in XNLRDF as Metadata<br />
In XNLRDF, the writing system of a text document is an n-tuple of metadata categories,<br />
which include the language, the locality and the script as the most distinguishing ones.<br />
In Belgium, for example, text documents are produced (at least) in Dutch, French<br />
and German. The locality, therefore, is not enough as a single discriminative feature<br />
of these documents. Neither is the category language taken by itself, since Dutch,<br />
French and German are written in other countries as well. Furthermore, even the<br />
tuple language_locality, as it is frequently used (e.g., FR_be, NL_be), is not sufficient<br />
for all text documents and NLP resources. There exist localities that have two or<br />
more alternative scripts for the same language. For example, Serbian in Serbia and<br />
Montenegro is written with the Latin or Cyrillic scripts, and Hausa in Nigeria in the<br />
Latin or Ajami scripts.<br />
An extended analysis of the world’s writing systems reveals that at least four more<br />
categories are required for an unambiguous specification of the writing system. These<br />
2 For a detailed survey, see Unicode Inc. (2006).<br />
190
categories are: the orthography, the writing standard, the time period of the writing<br />
system, and (for transliterations), a reference to another writing system.<br />
Supporting evidence for the necessity of these categories comes, for example,<br />
from Abkhaz. Not only has Abkhaz been written with two different Cyrillic alphabets,<br />
but also with two different Latin alphabets, one between 1926 and 1928, and another<br />
between 1928 and 1937. One might want to distinguish these writing systems by their<br />
name (the standard) or by the time period. In such cases, we do not exclude the first<br />
solution, although there is frequently no standard name for the standard. If possible,<br />
we prefer to use the time period, as it offers the possibility to calculate intersections<br />
with other time constraints (e.g., the production date of the document).<br />
The writing standard is best explained by the different, concurring, isochronic<br />
writing standards for Norwegian: Nynorsk, Bokmål, Riksmål and Høgnorsk are different<br />
contemporaneous conventions essentially representing the same language (http://<br />
en.wikipedia.org/wiki/Norwegian).<br />
The orthography is best illustrated by the spelling reform of German, where the new<br />
orthography came into force in different localities at different times, and overlapped<br />
with the old spelling for a different number of years. Again, use of the time period<br />
is a nice feature, but it does not allow dispensing with the category of orthography.<br />
Unfortunately, orthographies, also frequently lacking a standard name, are referred<br />
to at the time of their introduction as ‘new’ in opposition to ‘old.’ This denomination<br />
of orthographies, however, becomes meaningless in a diachronic perspective where<br />
each ‘new’ orthography will eventually grow ‘old.’<br />
1.4 The Case of Braille and other Transliterations<br />
Reference is necessary to correctly represent transliterations, that is,<br />
transliterations in the sense of one-to-one mappings, but also as one-to-many or<br />
many-to-many mappings. Reference introduces recursiveness into the metadata of<br />
the writing system, a complexity that is hard to avoid. Braille is a good example of<br />
a transliteration system that changes with the standards and spelling reforms of the<br />
referenced writing system. There exists a Norwegian Braille derived from Nynorsk, and<br />
a Norwegian Braille derived from Bokmål. By the same principle, Braille of the new<br />
German orthography is different from Braille based on the old German orthography.<br />
Similarly, Braille changes with respect to the locality of the Braille documents<br />
that might differ in origin from the locality of the referenced writing system. For<br />
example, Spanish Braille in a Spanish-speaking country is different from the Spanish<br />
of a Spanish-speaking country represented as Braille in the USA. We can only handle<br />
191
this complexity precisely when we allow writing systems to refer to each other<br />
recursively. Thus Braille, as with other transliteration systems, is represented as a<br />
writing system with its own independent locality, script, standard (e.g., contracted<br />
and non-contracted), and time period. The language of the transliteration and the<br />
referenced writing system are nevertheless the same, although XNLRDF would allow<br />
this to change for the transliteration as well.<br />
A transliteration is thus marked by a reference to another writing system, and, in<br />
the descriptive data, mappings between these two systems in the form of a mapping<br />
table, (e.g., between Bokmål and Bokmål Braille). Mappings between writing systems<br />
are a natural component in the description of all writing systems, even if they do<br />
not represent transliterations of each other, (e.g., mappings between hànyŭ pīnyīn,<br />
wade-giles, gwoyeu romatzyh and bopomofo/zhùyīn fúhào). The Braille of Mandarin<br />
in the People’s Republic, incidentally, is a transliteration of hànyŭ pīnyīn.<br />
To sum up, the metadata needed to identify the appropriate or best NLP resources<br />
for the processing of a text-document are much more complex than what current<br />
standards have defined. In other words, relying on only one part of the metadata,<br />
such as the Unicode scripts or the language codes combined with locality codes,<br />
is not always accurate and thus not completely reliable for automated NLP-tasks.<br />
If NLP-technologies on the Web have, until now, not suffered from this important<br />
misconception (e.g., in the metadata specification in the HTTP or XML header), it<br />
is because they either target about two dozen common languages (applying default<br />
assumptions that prevent less frequently used writing systems from being correctly<br />
processed), or because a linguistically informed human mediates between documents<br />
and resources.<br />
2. XNLRDF and Information Needs Beyond Unicode<br />
Let us assume, for expository purposes, that an NLP-application can correctly<br />
identify the writing system of a document to be processed, and that this writing system<br />
contains references to Unicode scripts or code points. In effect, little follows from<br />
this, as Unicode defines only a very limited amount of operations, and defines them<br />
only for a script and not a writing system. The task of XNLRDF is thus to reformulate<br />
the operations defined in Unicode in the terms of a writing system, and, secondly, to<br />
enlarge the linguistic material so that more operations than those defined in Unicode<br />
become possible.<br />
192
2.1 Unicode and Characters: Uppercasing, Lowercasing and Sorting<br />
Contrary to a common sense understanding of Unicode, the conceptual design of<br />
Unicode avoids the notion of character, since this is a language-specific notion, and<br />
languages are not covered by Unicode. Unicode refers instead to code elements (),<br />
which frequently coincide with characters, but also contain combining characters<br />
such as diacritics. Characters and code elements further differ, if ligatures (Dutch ‘ij’,<br />
Spanish ‘ll’, ‘ch’, Belorussian Lacinka ‘dz’) are to be treated as one character in a<br />
language. Uppercasing of ligatures is thus essentially undefined, and will produce from<br />
‘xy’ uniformly either ‘Xy’ or ‘XY’, without knowing the requirements of the writing<br />
system. It is thus obvious that specifying the character set of writing systems and<br />
describing the mapping between the characters (e.g., for uppercasing and lowercasing)<br />
is one principle task in XNLRDF, just as lowercasing, for example, is an important step<br />
in the normalisation of a string (e.g., for a lexicon lookup or information retrieval).<br />
Similarly, the sorting of characters, the second operation defined in Unicode (e.g.,<br />
for the purpose of presenting dictionary entries or creating indices), depends on the<br />
writing system, and can only be approximately defined on the basis of the script.<br />
Thus, Unicode might successfully sort ‘a’ before ‘b’, but already the position of ‘á’<br />
after ‘a’ or after ‘z’ is specific to each writing system. Another example is the Spanish<br />
‘ll.’ Although it is no longer considered a character, it maintains its specific position<br />
between ‘l’ and ‘m’ in a sorted list. Thus, sorting requires basic writing system-specific<br />
information, which XNLRDF sets out to describe. What this example also shows is<br />
that the definition of collating sequences for the characters of a writing system is<br />
independent from the status of the character (base character, composed character,<br />
contracted character, contracted non-character, context-sensitive character, foreign<br />
character, swap character, etc.).<br />
2.2 Linguistic Information: What Else?<br />
The operations covered by Unicode are limited, and most NLP-applications would<br />
require much more linguistic knowledge when processing documents in potentially<br />
unknown writing systems. First, an application might need to identify the encoding<br />
(e.g. KOI-R), the script (Cyrillic), the language (Russian), the standard (civil script),<br />
and orthography (after 1917) of a document. Part of this information might be drawn<br />
from the metadata available in the document, from the Unicode range, or the URL<br />
of a document (in our example, http://xyz.xyz.ru), but filling in the remaining gaps,<br />
(e.g., mapping from the encoding KOI-R to the language Russian, from the language<br />
to potential scripts, or from a script to a language) requires background information<br />
about the legacy encodings and writing systems. This background information is<br />
193
available in XNLRDF. Information supporting the automatic identification of writing<br />
systems with no or incomplete metadata will also be supported by XNLRDF in the<br />
form of character n-grams. These n-grams are compiled from classified text samples<br />
or corpora within or without XNLRDF. Thus, for each writing system, XNLRDF allows<br />
to give information on URLs of other documents (of the same writing system), to raw<br />
text collections, and to elaborated corpora.<br />
From the identified writing system, the application can start to retrieve additional<br />
resources that support the segmentation, stemming, hyphenation, and so forth of<br />
the document. A Web-crawler, for example, would try to find those text units (words<br />
and phrases) that are suitable for indexing. In most cases, the document will be<br />
segmented into words using a limited number of writing-specific word-separating<br />
characters (e.g., empty space, comma, hyphen, etc.). Although Unicode should provide<br />
this information, writing systems also differ as to which characters are unambiguous<br />
word separators, ambiguous ones, or not word separators at all. Thus, within those<br />
languages using Latin script, some integrate an empty space into a word, for example,<br />
Northern Sotho (Prinsloo & Heid 2006), while others like Lojban integrate ‘,’ and<br />
‘.’ in the middle of a word. Unconditionally splitting a text in these languages with<br />
the empty space character ( ), a comma (‘,’), or a period (‘.’) would cut words into<br />
undefined chunks.<br />
For writing systems that do not mark word boundaries (e.g., Han-characters or<br />
Kanji), the Web-crawler should index either each character individually (this is what<br />
Google does), or identify words through word lists and/or rules. Spelling variants<br />
(humour, humor), writing variants (Häuser, Haeuser, H&aauml;user or 灣,灣,Wan),<br />
inflected word forms (come, came), abbreviations (European Union, EU) should be<br />
mapped onto their base forms to improve the quality of document retrieval. All these<br />
are basic operations XNLRDF sets out to cover.<br />
3. Difficulties in Obtaining Information beyond Unicode<br />
The need for a linguistic extension of Unicode has long been recognised, and<br />
most of the information that applications, as the one sketched above, require is<br />
available from online resources. Thus, NLP-applications, at least theoretically, could<br />
get them automatically from the Web. If this were without problems, XNLRDF would<br />
be a redundant copy of other online information. However, for several reasons,<br />
the resources on the Web or the information contained within cannot be accessed,<br />
extracted and integrated by these applications (and by humans only with difficulty).<br />
First, there might exist difficulties to find and access the resources:<br />
• Resources can not be found because metadata are not available; or,<br />
194
• The resource is not directly accessible for applications: for example, accessing it<br />
requires transactions like registering, submitting passwords, entering the credit card<br />
number, etc.<br />
Then, once a resource is found and accessed, there might be difficulties to extract<br />
or understand the necessary information, such as:<br />
• The resource is not formally structured;<br />
• The information within the resource is formally structured, but the syntax of the<br />
structure is not defined: for example, fields are separated by a special character, but<br />
the character used is not specified;<br />
• The information is ambiguous, as in the following example: “Abkhaz is a North<br />
West Caucasian language with about 105,000 speakers in Georgia, Turkey and Ukraine<br />
(...) The current Cyrillic-based system” (http://www.omniglot.com/writing/abkhaz.<br />
htm), which does not specify which region actually is using or not using the Cyrillic-<br />
based script at present;<br />
• The syntax is defined, but the semantics of the units are not as defined as it could<br />
be through using XML namespace, and so forth. With a namespace, a NOUN-tag can<br />
be linked to the URL containing the definition of the tag. Thus, different NOUN-tags<br />
could be used without confusion;<br />
• The information in the different resources is not compatible, that is, the notion<br />
of language varies greatly between resources. To give one example, what the Office<br />
of the High Commissioner for Human Rights (Universal Declaration of Human Rights)<br />
describes as Zapoteco is not covered in Omniglot, and is split into more than fifty<br />
languages by Ethnologue and the Rosetta Project ; or<br />
• Most resources are language-centred and do not put the writing system into the<br />
centre of the description. To understand how serious this misconception is, imagine<br />
you search a Chinese document and get Chinese in Braille, which is Chinese to the<br />
same degree as what you expected to get.<br />
In view of all this, there is an enormous need to bring the available resources<br />
together and make them compatible, available and parsable; otherwise, the information<br />
will be barely usable for NLP-applications. This compiling work necessarily involves<br />
a combination of the linguists’ careful classification, description, and automated<br />
approaches to knowledge acquisition. Both techniques will first exploit other resources<br />
relevant for XNLRDF.<br />
195
4. Related Activities and Available Resources<br />
Fortunately, XNLRDF is embedded in a wide field of research activities that create,<br />
document and make accessible natural language resources. What makes XNLRDF<br />
stand out in its field is its focus on Natural Language Processing resources on the<br />
one hand, and fully-automated access to the data by an application on the other.<br />
Nevertheless, XNLRDF will try to profit from related projects and to comply with<br />
available standards.<br />
Repositories of the world’s languages are available online. Figuring most<br />
prominently among them are: Omniglot, Ethnologue , The Rosetta Project, TITUS,<br />
and the Language Museum (http://www.language-museum.com). Although these<br />
resources offer rich information on scripts and languages, they are almost unusable<br />
for computer applications, as they are designed for human users. The difficulties in<br />
using Ethnologue, for example, derive from its focus on spoken languages and its<br />
tendency to introduce new languages where others just see regional variants of the<br />
same language. This problem has been inherited by the Rosetta Project and the World<br />
Atlas of Language Structures (Haspelmath et al. 2005). In addition, some sites (e.g.,<br />
the Language Museum) use scanned images of characters, words and texts that of<br />
course are almost impossible to integrate into NLP resources. Still other sites (e.g.,<br />
TITUS) use mainly transcriptions or transliterations that are equally worthless without<br />
a formal definition of the mappings applied. Currently, the information available on<br />
these sites is checked and integrated manually into the XNLRDF data structure.<br />
OLAC, the Open Language Archives Community project , is setting up a network<br />
of interoperating repositories and services for hosting and accessing NLP resources.<br />
The project’s aims and approaches are very close to those of XNLRDF, and we foresee<br />
a considerable potential for synergy. The metadata and their definition is what will<br />
be most relevant to XNLRDF. However, the OLAC user scenario assumes a human user<br />
looking for resources and tools, whereas XNLRDF is designed to allow applications to<br />
find resources autonomously given a text document to be processed and a task to be<br />
achieved.<br />
Closely related to OLAC is the E-MELD project, which supports the creation of<br />
persistent and reusable language resources. In addition, queries over disparate<br />
resources are envisaged To which extent XNLRDF can profit from E-MELD has yet to be<br />
investigated in detail.<br />
Data consortia like ELRA or LDC host NLP resources that can be identified<br />
through the machine-readable metadata in OLAC. However, resources are not freely<br />
accessible. Commercial transactions are required between the identification of the<br />
196
esource and the access to the resource. For this reason, these resources will remain<br />
unexplored, even if prices are modest. Although ELRA and LDC have their merits, for<br />
small languages, better solutions are available for the hosting of data (cf. Streiter<br />
2005).<br />
Project Gutenberg provides structured access to its 16,000 documents (comprising<br />
about thirty languages) through an XML-RDF. Unfortunately, information characterising<br />
text T1 as translation of T2 is still not provided, that is, although parallel corpora are<br />
implicitly present, they are not identifiable as such. In theory, the documents of<br />
Project Gutenberg could be used to build up corpora in XNLRDF. Such a copying of<br />
resources, however, might only be justifiable for writing systems for which little corpus<br />
material is available. More important might be a mapping from the writing system of<br />
XNLRDF to the documents of Project Gutenberg, thus translating the available XML-<br />
RDF in terms of XNLRDF.<br />
Free monolingual and parallel corpora are available at a great number of sites,<br />
most prominently at http://www.unhchr.ch/udhr/navigate/alpha.htm (Universal<br />
Declaration of Human Rights in 330 languages), http://www.translatum.gr/bible/<br />
download.htm (The Bible), and The European Parliament Proceedings Parallel Corpus<br />
(http://people.csail.mit.edu/koehn/ publications/europarl), among others. Those<br />
documents that support otherwise underrepresented writing systems will be integrated<br />
into XNLRDF in the form of corpora.<br />
The Wikipedia project is interesting for XNLRDF for a number of reasons. First,<br />
it provides documents that can be used to build corpora without infringing upon<br />
copyrights. Second, as the Wikipedia is available in more than one hundred languages,<br />
thousands of quasi-parallel texts become accessible. Third, the model of cooperation<br />
in Wiki projects, and the underlying software, will indicate the way XNLRDF will go.<br />
Thus, XNLRDF will gradually enlarge the community of researchers involved to the<br />
point that the world’s linguists will be able to collect the data they need for their<br />
writing systems. This issue will be further discussed below.<br />
5. Conceptual Design of XNLRDF<br />
The purpose of XNLRDF is to find adequate NLP resources to process a text document.<br />
To this end, the metadata of the document and the resource are matched. The better<br />
the match, the more suitable the resource is for the processing of the document. The<br />
metadata matched are those categories that make up the writing system.<br />
197
5.1 Finding Resources via the Writing System<br />
The writing system has a function similar to SUBJECT.LANGUAGE in the OLAC-<br />
metadata, defined in Simon & Bird (2001) as “[…] a language which the content of<br />
the resource describes or discusses.” A writing system in XNLRDF is defined by the n-<br />
tuple of the category’s language, locality, script, orthography, standard, time period,<br />
and reference to another writing system. The writing system is a property of the<br />
text document and the resource. In XNLRDF, for each writing system there is a more<br />
abstract writing system (e.g., without constraints in locality) as a fallback solution to<br />
fill in empty categories with default assumptions. In general, for each language there<br />
is one writing system without a locality that provides a default locality in the event<br />
that no locality can be derived from the document. (Cf. Plate 1: Different Writing<br />
Systems for Mandarin Chinese. The first row is the fallback with the default locality.)<br />
These underspecified writing systems are currently also used (and perhaps incorrectly)<br />
for supranational writing systems, (e.g., English-based writing in the UN).<br />
Plate 1: Writing Systems for Mandarin Chinese. Note the first row showing Chinese without<br />
locality as a super-regional language. In case of doubt, the application has to assume<br />
China as the locality where the text-document originated.<br />
The inclusion of a writing system into XNLRDF is pragmatically handled. Included<br />
are all writing systems for which text documents with yet uncovered combinations of<br />
language, locality, and so forth can be found. The same pragmatic approach is used to<br />
(or not to) distinguish languages and dialects. Thus, dialects are treated identically to<br />
languages, whenever documents of that variant are found (e.g., Akan Akuapem, Akan<br />
Asante and Akan Fante). A writing that claims to be representing a language family is<br />
registered with a writing system of this language family. The same goes for localities;<br />
whenever a document is reasonably associated with one region - even if that region<br />
is not a recognised geographical, administrative or economical body - the region will<br />
be included as locality.<br />
198
5.2 The Names of Metadata Categories<br />
All this leads to the overall problem that, for the main categories of the writing<br />
system, no standardised identifiers are available. We already discussed the lack of<br />
standard names for the standard and orthography of a writing system. But in addition,<br />
languages, localities and scripts do not necessarily have standard names or standard<br />
codes, albeit XNLRDF tries to integrate the ISO 339 codes for languages (ISO 639 2006)<br />
(the 2-letter code for languages ISO-639-1 and the 3-letter code for languages ISO-<br />
639-2), the SIL-codesVersion 14 of Ethnologue, the Unicode-naming of scripts, and<br />
ISO-3166 (ISO 3166 2006) encoding of localities (countries, regions, and islands).<br />
A number of limitations, however, make these codes difficult to use: ISO-639-1<br />
covers only a few languages; ISO-639-2 assigns more than one code to one language;<br />
both ISO norms assign one code to sets of languages, language families, and so on;<br />
and, SIL-codes change from version to version (about every four years), and do not<br />
cover historic languages, artificial languages, language groups or languages that exist<br />
only as written standard.<br />
The situation for the encoding of languages will improve with the adoption of<br />
the draft ISO/DIS 639-3 as a standard (presumably in 2006), as it will combine the<br />
respective advantages of the SIL-codes and the ISO-codes. Until then, applications<br />
will continue to use the RFC 3066 standard for HTTP headers, HTML metadata and in<br />
the XML lang attribute. 2 and 3-letter codes are interpreted as ISO-639-1 or ISO-639-<br />
2 respectively. ISO-639-1 can be mapped on ISO-639-3, and ISO-639-2 is identical to<br />
ISO-639-3, so that, in the future, only ISO-639-1 (transitional) and ISO-639-3 will be<br />
needed (for more information on this development, consult the webpages http://<br />
en.wikipedia.org/wiki/ISO_639-3, http://www.ietf.org/rfc/rfc3066.txt and http://<br />
www.ethnologue.com/codes/ default.asp). SIL-codes will then become superfluous<br />
in XNLRDF, and languages that are not written can be removed from XNLRDF. The<br />
advantage of ISO-639-3 is that it can group together individual spoken languages (such<br />
as two dozen spoken Arabic languages) to ‘macro languages’ (Arabic), thus preventing<br />
writing systems from being fragmented due to the fragmentation of languages.<br />
Most reliably, however, the categories of the writing system can be accessed with<br />
their natural language name in one the world’s major writing systems, for which<br />
XNLRDF guarantees an unambiguous match. As a consequence of this recursion, as<br />
outlined in Gödel’s ‘Incompleteness Theorems’, neither the names nor the categories<br />
can be formally defined; they can only be explained by the use they are put to (e.g.,<br />
the material that is attached to a name). Fortunately, this problem is not inherent<br />
199
to XNLRDF, but is also shared by other classification standards like ISO norms and SIL<br />
codes.<br />
5.3 Linguistic Information for Writing Systems<br />
A writing system is associated via a resource type with the corresponding resources.<br />
Writing systems stand in a many-to-many relation to encoding (Plate 2), numerals (Plate<br />
3), and function words (Plate 4); characters; sentence separators; word separators;<br />
URLs (classified according to genres); dictionaries; monolingual and parallel corpora;<br />
and, n-gram statistics.<br />
Plate 2: A writing system (Mandarin Chinese in Taiwan) related to ENCODING.<br />
Plate 3: A writing system (Thai) related to NUMERALS.<br />
200
Plate 4: A writing system (Thai) related to FUNCTION_WORDS.<br />
5.4 Methods and Implementation<br />
The data-model is implemented in a relational database, which provides all means<br />
to control the coherence of the data, create backups, and allow people from different<br />
parts of the world to work on the same set of data. For applications working with<br />
relational databases, this data can be downloaded under the GNU Public License as<br />
database dump (PostgreSQL). An Interface to the database has been created as a<br />
working tool for the creation and maintenance of data.<br />
An additional goal is to make XNLRDF available in XML-RDF. RDF, a framework for<br />
the description of resources, has been designed by the W3C to describe resources<br />
with their necessary metadata for applications rather than for people (Manola & Miller<br />
2004). Whereas, in the relational database, the defaulting mechanism is programmed<br />
as a procedure, in XML-RDF defaults are compiled out. In this way, the information in<br />
XNLRDF can be accessed through a simple look-up with a search string such as ‘Thai’,<br />
‘Thailand’, ‘Thai;Thailand’, and so forth.<br />
201
6. Envisaged Usage and Impact<br />
In order to give a word-to-word translation, for example, within a Web-browser, the<br />
Web-browser has to know where to find a dictionary and how to use it. With only one<br />
such resource, a special function within the Web-browser might handle this (e.g., a<br />
number of FIREFOX add-ons do exactly this). But with hundreds of language resources,<br />
a more general approach is required that not only involves adequate resources, but<br />
also metadata with an NLP-specific metadata dictionary and metadata syntax. NLP-<br />
operations like tagging or meaning disambiguation for annotated reading have then<br />
to be defined recursively in the metadata syntax: in this way, a tagger can call a<br />
tokenizer if it can’t perform tokenization itself.<br />
The substantiation of the concept of XNLRDF will thus consist of compiling XNLRDF<br />
into an Mozilla-compatible RDF and integrating it into an experimental Mozilla module.<br />
Not only is Mozilla a base for a great number of very popular applications (e.g.,<br />
Firefox, Thunderbird, Bugzilla, Netscape, Mozilla Browser, and Mozilla e-mail), but it<br />
also disposes of an RDF-Machine that can be accessed via JavaScript and XPConnect<br />
(Boswell et al. 2002). A minor test-application of XNLRDF in Mozilla might thus have<br />
a tremendous impact.<br />
Less spectacular than the still pending integration into Mozilla is the testbed where<br />
XNLRDF is currently used. It serves as a linguistic database for Gymn@zilla, a CALL<br />
system that handles about twenty languages, with new languages added on a regular<br />
basis (Streiter et al. 2005). In general, CALL systems are very likely to be the first<br />
applications to profit from XNLRDF. They are frequently applicable to many languages<br />
and require relatively uncomplicated operations. In fact, many CALL modules are<br />
freely available, and, to some extent, language independent (e.g., Hot Potatoes).<br />
In practice, however, they are often only suited for an undefined group of languages<br />
(e.g., they require a blank to separate words). With the linguistic intelligence of<br />
XNLRDF, such modules could not only extend the range of languages, but also generate<br />
better exercises and provide better feedback.<br />
Web-crawlers and IR systems are other candidates that will certainly profit from<br />
XNLRDF. While most IRs may be tuned to one or a few languages, they generally lack<br />
the capacity to process a wide range of languages. The large amount of NLP systems<br />
integrated in Google shows the importance of linguistic knowledge in IR.<br />
To sum up, we not only hope to bring many more languages to text document<br />
processing applications, but hope to do this in a standard format that can be easily<br />
processed by XML or XML-RDF-enabled applications.<br />
202
7. Status of the Project and Future Developments<br />
The project is still an unfunded garage project. In the previous project phase,<br />
we defined the base and implemented the first model in a relational database.<br />
An interface to that database has been created to allow new data to be entered<br />
via the WWW. After inserting more than 1000 writing systems and getting a better<br />
understanding of the framework necessary to describe a writing system, we are<br />
currently adding linguistic information to describe the writing systems. The data<br />
structures for characters, corpora, dictionaries, and so forth are still changing when<br />
new requirements or linguistic complexities are encountered. URLs and corpora are<br />
collected to support the description of the writing system and as useful material to be<br />
integrated in XNLRDF for NLP-applications (e.g., for the creations of word lists).<br />
In the meantime, we hope to attract more researchers to collaborate in the project.<br />
It is impossible to answer now, whether or not the project will be as open as the<br />
Wikipedia. It is certain, however, that this endeavour will require the collaboration of<br />
a wide range of researchers around the globe. Very likely, small tools will be created<br />
around XNLRDF that will illustrate the use the resource can be put to, and motivate<br />
linguists to enter data for their language (writing system). Such tools will have the<br />
additional advantage to check the accuracy and completeness of the data.<br />
8. Glossary<br />
Language<br />
Language is one of the discriminating features that defines a writing system in XNLRDF.<br />
XNLRDF uses language identification standards such as ISO-639-1 and ISO-639-2 to map<br />
language names to unambiguous language codes.<br />
Locality<br />
Locality is one of the discriminating features that defines a writing system. ISO 3166<br />
is the standard that defines locality codes. However, XNLRDF pragmatically includes<br />
a region as locality, whenever there is a document that is reasonably associated with<br />
the region. This applies also even if the region is not a recognised geographical,<br />
administrative or economic body.<br />
Metadata Categories<br />
Metadata in language-related applications have the function to map a document to<br />
be processed to the appropriate NLP (natural language processing) resources. XNLRDF<br />
uses the categories of language, locality, script, orthography, standard, time period,<br />
203
and reference to another writing system. Together, these NLP metadata categories in<br />
XNLRDF define a writing system.<br />
Natural Language Resource<br />
A natural language resource in XNLRDF refers to structured linguistic information<br />
and/or NLP applications that are accessible to machines via a clearly defined writing<br />
system. Types of resources include, for example, encoding, numerals, function words,<br />
characters, sentence separators, word separators, URLs, dictionaries, corpora, and n-<br />
gram statistics, as well as applications for basic NLP tasks such as language recognition,<br />
tokenization, stemming, tagging, segmentation, hyphenation, and indexing of<br />
complex NLP implementations such as term extraction, document retrieval, meaning<br />
disambiguation, and ComputerAssisted Language Learning tools.<br />
Orthography<br />
Orthographies can sometimes be tracked using the time period category. However,<br />
different orthographies might coexist for a certain time span (e.g., in German,<br />
after the latest orthography reform). Therefore, locality is one of the discriminating<br />
features that defines a writing system.<br />
Reference<br />
Reference is used in XNLRDF to describe transliterations. The transliteration is a<br />
writing system on its own, but can only be understood and correctly processed when<br />
referring to another underlying writing system. For example, a text written in Braille<br />
can only be understood and processed when referred to the underlying writing system<br />
(e.g., Braille referring to standard German in Austria in ‘new orthography’). Reference<br />
is a recursive category. It is one of the discriminating features that defines a writing<br />
system.<br />
Script<br />
In Unicode, legacy scripts are named (e.g., Latin, Arabic and Cyrillic). XNLRDF uses<br />
these script names as a discriminating feature to define writing systems.<br />
Time Period<br />
The time period offers the possibility to calculate intersections with other time<br />
constraints (e.g., between the validity of an orthography and the production date<br />
of the document). Therefore, time period is one of the discriminating features that<br />
defines a writing system.<br />
204
Writing Standard<br />
Sometimes the same language can be written in different, concurring, isochronic<br />
writing standards. For example, Nynorsk, Bokmål, Riksmål and Høgnorsk are different<br />
contemporaneous conventions that represent Norwegian. Therefore, writing standard<br />
is one of the discriminating features that defines a writing system.<br />
Writing System<br />
The writing system helps to map a document to the adequate NLP resources necessary<br />
to process the document. A writing system in XNLRDF is defined by the n-tuple of<br />
language, locality, script, orthography, standard, time period and reference to another<br />
writing system. In XNLRDF, for each writing system there are also more abstract writing<br />
systems –(e.g., those without constraints in locality) as a fallback to fill in the missing<br />
information with default assumptions.<br />
XNLRDF<br />
XNLRDF stands for ‘Natural Language Resource Description Framework.’ It is an Open<br />
Source Framework for Multilingual Computing, designed to allow applications to find<br />
language resources autonomously, given a text document to be processed and a task<br />
to be achieved. XNLRDF is distributed either in XML-RDF or as database dump.<br />
205
206<br />
References<br />
Boswell, D. et al. (2002). Creating Applications with Mozilla. Sebastopol: O’Reilly.<br />
“E-MELD”. <strong>Online</strong> at http://emeld.org.<br />
“Ethnologue”. <strong>Online</strong> at http://www.enthnologue.com.<br />
“European Parliament Proceedings Parallel Corpus 1996-2003”. <strong>Online</strong> at http://<br />
people.csail.mit.edu/koehn/publications/europarl.<br />
Haspelmath, M. et al. (eds) (2005). The World Atlas of Language Structures. Oxford:<br />
Oxford University Press.<br />
“Hot Potatoes”. <strong>Online</strong> at http://web.uvic.ca/hrd/halfbaked. ISO 639 (1989). Code<br />
for the representation of the names of languages.<br />
ISO 3166-1 (1997). Codes for the representation of names of countries and their<br />
subdivisions -- Part 1: Country codes.<br />
ISO 3166-2 (1998). Codes for the representation of names of countries and their<br />
subdivisions -- Part 2: Country subdivision code.<br />
ISO 3166-3 (1999). Codes for the representation of names of countries and their<br />
subdivisions -- Part 3: Code for formerly used names of countries.<br />
Manola, F. & Miller, E. (eds) (2004). “W3C Recommendation 10”. RDF Primer, February<br />
2004. http://www.w3.org/TR/rdf-primer/.<br />
Norwegian (2006, March 7). In Wikipedia, The Free Encyclopedia. Retrieved March 7,<br />
2006. <strong>Online</strong> at http://en.wikipedia.org/wiki/Norwegian.<br />
“OLAC, the Open Language Archives Community project”. <strong>Online</strong> at http://www.
language-archives.org/documents/overview.html.<br />
“Omniglot”. <strong>Online</strong> at http://www.omniglot.com.<br />
Prinsloo, D. & Heid, U. (this volume). “Creating Word Class Tagged Corpora for Northern<br />
Sotho by Linguistically Informed Bootstrapping”, 97-115.<br />
“Rosetta Project”. <strong>Online</strong> at http://www.rosettaproject.org.<br />
Simons, G. & Bird S. (eds) (2001). OLAC Metadata Set. http://www.language-archives.<br />
org/OLAC/olacms.html.<br />
Streiter, O. (this volume). Implementing NLP-Projects for Small Languages: Instructions<br />
for Funding Bodies, Strategies for Developers, 29-43.<br />
Streiter, O. et al. (2005). “Dynamic Processing of <strong>Text</strong>s and Images for Contextualized<br />
Language Learning”. Proceedings of the 22nd International Conference on English<br />
Teaching and Learning in the Republic of China (ROC-TEFL), Taipei, June 4-5, 278-<br />
98.<br />
“TITUS”. <strong>Online</strong> at http://titus.uni-frankfurt.de.<br />
“Translatum”. <strong>Online</strong> at http://www.translatum.gr/bible/download.htm.<br />
“Unicode Enabled Products”. <strong>Online</strong> at<br />
http://www.unicode.org/onlinedat/products.html.<br />
“Universal Declaration of Human Rights” <strong>Online</strong> at<br />
http://www.unhchr.ch/udhr/index.htm.<br />
207
Speech-to-Speech Translation for Catalan<br />
Victoria Arranz, Elisabet Comelles and David Farwell<br />
This paper focuses on a number of issues related to adapting an existing interlingual<br />
representation system to the ideosyncracies of Catalan in the context of the FAME<br />
Interlingual Speech-to-Speech Machine Translation System for Catalan, English and<br />
Spanish. The FAME translation system is intended to assist users in making hotel<br />
reservations when calling or visiting from abroad. Following a brief presentation of<br />
the Catalan language, we describe the system and review the results of a major<br />
user-centered evaluation. We then introduce Interchange Format (IF), the interlingual<br />
representation system underlying the translation process, and discuss six types of<br />
language-dependent problems that arose in extending IF to the treatment of Catalan,<br />
along with our approach to dealing with these problems. They include the lack of<br />
dialog-level structural relationships, conceptual gaps, the lack of register distinctions<br />
(e.g. specifically and formality), the treatment of proper names, the lack of a method<br />
for dealing with partitives and conceptual overgranularity. Finally, we summarise the<br />
contents and suggest some future directions for research and development.<br />
1. Introduction<br />
The goal of this paper is to review a number of problems that arose in adapting<br />
an existing Interlingua, Interchange Format (IF), to the treatment of Catalan, and<br />
to describe our approach to dealing with them. As classes, these problems are not<br />
peculiar to Catalan per se, but the language presents an interesting case study in<br />
terms of their particular manifestation and how they might be dealt with. They<br />
include the need for representing dialogue-level structural relations, dealing with<br />
conceptual gaps, the need for representing register distinctions, a semi-productive<br />
method for dealing with proper names, the need for representing partitive references,<br />
and dealing with conceptual overgranularity. This effort was part of the development<br />
of the FAME Interlingual Speech-to-Speech Machine Translation System for Catalan,<br />
English and Spanish, which was carried out between 2001 and 2004.<br />
In section 2, we provide a background for the discussion, giving some information<br />
on Catalan language, and describing the project and the translation system. In section<br />
3, we briefly describe an evaluation procedure, and present some results from a<br />
major user-centred evaluation. This section proves the feasibility of the adaptation<br />
and the success of the system, which was publicly demonstrated at the 2004 Forum<br />
209
of Cultures in Barcelona (with a very positive outcome). In section 4, we discuss<br />
Interchange Format (IF), the interlingua underlying the translation process, the<br />
inadequacies encountered and the modifications made while adapting the framework<br />
to Catalan and Spanish. Finally, in Section 5, we summarize the results and conclude<br />
with a discussion of future directions.<br />
2. Background<br />
The Catalan language, with all its variants, is spoken in the Païssos Catalans which<br />
include the Spanish regions of Catalonia, Valencia and Balearic Islands, the French<br />
department of the Pyréneés Orientales, and the Italian area of Alghero. Inside the<br />
Spanish territory, Catalan is also spoken in some parts of Aragon and Murcia. Catalan<br />
is a Romance language and shows similarities with other languages belonging to the<br />
Romance family, in particular with Spanish, Galician and Portuguese. Nowadays,<br />
Catalan is undertood by 9 million people and spoken by 7 million people.<br />
The FAME Interlingual Speech-to-Speech Translation System for Catalan, English<br />
and Spanish was developed at the Universitat Politècnica de Catalunya (UPC), Spain,<br />
as part of the recently completed European Union-funded FAME project (Facilitating<br />
Agents for Multicultural Exchange) that focused on the development of multi-modal<br />
technologies to support multilingual interactions (see http://isl.ira.uka.de/fame<br />
for details). The FAME translation system is an extension of the existing NESPOLE!<br />
translation system (Metze et al. 2002; Taddei et al. 2003) to Catalan and Spanish in<br />
the domain of hotel reservations. At its core is a robust, scalable, interlingual speech-<br />
to-speech translation system having cross-domain portability that allows for effective<br />
translingual communication in a multi-modal setting. Although the system architecture<br />
was initially based on NESPOLE!, all the modules have now been integrated on an<br />
Open Agent platform (Holzapfel et al. 2003; for details see http://www.ai.sri.com/<br />
~oaa). This type of multi-agent framework offers a number of technical features for<br />
a multi-modal environment that are highly advantageous for both system developers<br />
and users.<br />
Broadly speaking, the FAME translation system consists of an analysis component<br />
and a generation component. The analysis component automatically transcribes spoken<br />
source language utterances and then maps that transcription into an interlingual<br />
representation. The generation component maps from interlingua into target language<br />
text that, in turn, is passed to a speech synthesiser that produces a spoken version<br />
of the text. The central advantage of this interlingua-based architecture is that, in<br />
adding additional languages to the system (such as Catalan and Spanish), it is only<br />
necessary to develop new analysis and generation components for each new language<br />
210
in order to be able to translate into and out of all of the other existing languages in<br />
the system. In other words, no source-language-to-target-language specific transfer<br />
modules are required, as would be the case for transfer systems, with the result that<br />
the development task is considerably simplified.<br />
For both Catalan and Spanish speech recognition, we used the JANUS Recognition<br />
Toolkit (JRTk) developed at Universität Karlsruhe and Carnegie Mellon University<br />
(Woszczyna et al. 1993). For the text-to-text component, the analysis side utilises<br />
the top-down, chart-based SOUP parser (Gavaldà 2000) with full domain action level<br />
rules to parse input utterances. Natural language generation is done with GenKit, a<br />
pseudo-unification-based generation tool (Tomita et al. 1988). For both Catalan and<br />
Spanish, we use a <strong>Text</strong>-to-Speech (TTS) system fully developed at UPC, which uses a<br />
unit-selection-based, concatenative approach to speech synthesis.<br />
The Interchange Format (Levin et al. 2002), the interlingua used by the C-STAR<br />
Consortium (see http://www.c-star.org for details), has been adapted for this effort.<br />
Its central advantage in representing dialogue interactions such as those typical of<br />
speech-to-speech translation systems is that it focuses on identifying the speech acts<br />
and the various types of requests and responses typical of a given domain. Thus,<br />
rather than capturing the detailed semantic and stylistic distinctions, it characterises<br />
the intended conversational goal of the interlocutor. Even so, in mapping into or<br />
out of IF, it is necessary to take into account a wide range of structural and lexical<br />
properties related to Catalan and Spanish.<br />
For the initial development of the Spanish analysis grammar, the already existing<br />
NESPOLE! English and German analysis grammars were used as a reference point.<br />
Despite using these grammars, great efforts had to be made to overcome important<br />
differences between English, German and the Romance languages in focus. The<br />
Catalan analysis grammar, in turn, was adapted from the Spanish analysis grammar,<br />
and, in this case, the process was rather straightforward. The generation grammars<br />
for Catalan and Spanish were mostly developed from scratch, although some of the<br />
underlying structure was adapted from that of the NESPOLE! English generation<br />
grammar. Language-dependent properties such as word order, gender and number<br />
agreement, and so forth needed to be dealt with representationally, but on the whole,<br />
starting with existing structural descriptions proved to be useful. On the other hand,<br />
the generation lexica play a significant role in the generation process and these had to<br />
be developed from scratch. As for the generation grammars, however, a considerable<br />
amount of work took place in parallel for both Romance languages, which contributed<br />
to a more efficient development of both the Catalan and Spanish generation lexica.<br />
211
3. Evaluation<br />
The evaluation performed was done on real users of the speech-to-speech<br />
translation system, in order to both:<br />
• examine the performance of the system in as real a situation as possible, as if<br />
it were to be used by a real tourist trying to book accommodation in Barcelona;<br />
and,<br />
• study the influence of using speech input, and thus Automatic Speech Recognition<br />
(ASR), in translation.<br />
3.1 Task-Oriented Evaluation Metrics<br />
A task-oriented methodology was developed to evaluate both the end-to-end<br />
system (with ASR and TTS) and the source language transcription to target language<br />
text subcomponent. An initial version of this evaluation method had already proven<br />
useful during system development, since it allowed us to analyse content and form<br />
independently, and thus contributed towards practical system improvements.<br />
The evaluation metric used recognises three main categories (Perfect, Ok and<br />
Unacceptable), where the second was further subdivided into Ok+, Ok and Ok-. During<br />
the evaluation, this metric was independently applied to two separate parameters,<br />
form and content. In order to evaluate form, only the generated output (text or<br />
speech) was considered by the evaluators. To evaluate content, evaluators took into<br />
account both the input utterance or text and the output text or spoken utterance.<br />
Accordingly, the meaning of the metrics varies depending on whether they are being<br />
used to judge form or to judge content:<br />
• Perfect: well-formed output (form) or communication of all the information the<br />
speaker intended (content).<br />
• Ok+/Ok/Ok-: acceptable output, grading from only some minor error of form (e.g.<br />
missing determiner) or some minor uncommunicated information (Ok+) to some<br />
more serious problem of form or uncommunicated information content (Ok-).<br />
• Unacceptable: unacceptable output, either essentially unintelligible (form) or<br />
information unrelated to the input (content).<br />
3.2 Evaluation Results<br />
The results obtained from the evaluation of the end-to-end translation system for<br />
the different language pairs are shown in Tables 1, 2, 3 and 4. The results obtained<br />
from the translation of clean audio-transcriptions are summarised in Tables 5, 6, 7<br />
and 8. From the results, we can conclude that many of the errors are caused by the<br />
212
ASR component. This is particularly so when translating from English 1 into Catalan<br />
or Spanish. For instance, if we consider the form parameter, Tables 7 and 8 show<br />
that there are no unacceptable translations when using the text-to-text interlingual<br />
translation system for the English-Catalan and English-Spanish, while Tables 3 and 4<br />
show that performance drops 5.99% and 9.6%, respectively, when using the speech-<br />
to-speech system.<br />
In fact, the interlingual translation component performs very well when used on<br />
text input and degrades when using speech input. However, it should be pointed<br />
out that, even so, results remain rather good for the end-to-end system. For the<br />
worst of our language pairs (English-Spanish), a total of 62.4% of the utterances were<br />
judged acceptable in regard to content. This is comparable to evaluation results of<br />
other state-of-the-art systems such as NESPOLE! (Lavie et al. 2002), which obtained<br />
slightly lower results and was performed on Semantic Dialog Units (see below) instead<br />
of utterances (UTT), thus simplifying the translation task. The Catalan-English and<br />
English-Catalan pairs were both quite good with 73.1% and 73.5% of the utterances<br />
being judged acceptable, respectively, and the Spanish-English pair performs very<br />
well with 96.4% of the utterances being acceptable.<br />
Table 1: Evaluation of End-to-End Translation (with ASR)<br />
for the Catalan-English Pair. Based on 119 UTTs.<br />
SCORES FORM CONTENT<br />
Perfect 70.59% 31.93%<br />
OK+ 5.04% 15.12%<br />
OK 6.72% 9.25%<br />
OK- 9.25% 16.80%<br />
Unacceptable 8.40% 26.90%<br />
Table 2: Evaluation of End-to-End Translation (with ASR) for the Spanish-English Pair.<br />
Based on 84 UTTs.<br />
SCORES FORM CONTENT<br />
Perfect 92.85% 71.42%<br />
OK+ 4.77% 11.90%<br />
1 It should be pointed out that the efforts to develop the ASR systems were focused on the Catalan and<br />
Spanish language models. The language model for the English ASR was used as is, when provided by the<br />
NESPOLE! partners. As a result, the English ASR was not as domain sensitive and, consequently, more<br />
error prone. The only work done was to enlarge its lexicon.<br />
213
OK 1.19% 7.14%<br />
OK- 0% 5.96%<br />
Unacceptable 1.19% 3.58%<br />
Table 3: Evaluation of End-to-End Translation (with ASR) for the English-Catalan Pair. Based<br />
on 117 UTTs.<br />
SCORES FORM CONTENT<br />
Perfect 64.96% 34.19%<br />
OK+ 15.39% 11.97%<br />
OK 8.54% 14.52%<br />
OK- 5.12% 12.82%<br />
Unacceptable 5.99% 26.50%<br />
Table 4: Evaluation of End-to-End Translation (with ASR) for the English-Spanish Pair.<br />
Based on 125 UTTs.<br />
SCORES FORM CONTENT<br />
Perfect 64.80% 17.60%<br />
OK+ 4.80% 10.40%<br />
OK 12.00% 18.40%<br />
OK- 8.80% 16.00%<br />
Unacceptable 9.60% 37.60%<br />
Table 5: Evaluation of Translation for Audio Transcription of the Catalan-English Pair. Based<br />
on 119 UTTs.<br />
SCORES FORM CONTENT<br />
Perfect 85.72% 73.10%<br />
OK+ 5.89% 13.45%<br />
OK 2.52% 4.20%<br />
OK- 4.20% 6.73%<br />
Unacceptable 1.69% 2.52%<br />
214
Table 6: Evaluation of Translation for Audio Transcription of the Spanish-English Pair. Based<br />
on 84 UTTs.<br />
SCORES FORM CONTENT<br />
Perfect 96.42% 91.66%<br />
OK+ 2.38% 3.57%<br />
OK 0% 0%<br />
OK- 0% 3.57%<br />
Unacceptable 1.20% 1.20%<br />
Table 7: Evaluation of Translation for Audio Transcription of the English-Catalan Pair. Based<br />
on 117 UTTs.<br />
SCORES FORM CONTENT<br />
Perfect 89.75% 88.89%<br />
OK+ 8.55% 1.70%<br />
OK 1.70% 0.85%<br />
OK- 0% 4.28%<br />
Unacceptable 0% 4.28%<br />
Table 8: Evaluation of Translation for Audio Transcription of the English-Spanish Pair. Based<br />
on 125 UTTs.<br />
SCORES FORM CONTENT<br />
Perfect 95.2% 82.4%<br />
OK+ 4% 7.2%<br />
OK 0.8% 3.2%<br />
OK- 0% 5.6%<br />
Unacceptable 0% 1.6%<br />
4. Interchange Format<br />
In this section, we discuss the use of the Interchange Format (IF) for Machine<br />
Translation, and then we examine some problems of applying IF to new languages such<br />
as Catalan and Spanish.<br />
4.1 Introduction to Interchange Format and Discussion<br />
IF is based on Searle’s Theory of Speech Acts (Searle 1969). It tries to represent<br />
the speaker’s intention rather than the meaning of the sentence per se. In the hotel<br />
reservation domain, there are several speech acts, such as giving information about<br />
215
a price, asking for information about a room type, verifying a reservation, and so<br />
on. Since domain concepts such as prices, room type and reservation are included in<br />
the representation of the act, in our interlingua, such speech acts are referred to as<br />
Domain Actions (DAs), and they are the type of actions that are discussed. These DAs<br />
are formed by different combinatory elements expressing the semantic information<br />
that needs to be communicated.<br />
Generally speaking, an IF representation has the following elements:<br />
Speaker’s Tag + DA + Arguments<br />
The Speaker’s Tag may be a for the agent’s contributions, or c for the client’s.<br />
Inside the DA we find the following elements:<br />
• Speech Act: a compulsory element that can appear alone or followed by other<br />
elements. Examples of Speech-Acts include: give-information, negate, request-<br />
information, etc.<br />
• Attitude: an optional element that represents the attitude of the speaker when<br />
explicitly present. Some examples are: +disposition, +obligation, and so on.<br />
• Main Predication: a compulsory element that represents what is talked about.<br />
Examples of these elements are: +contain, +reservation, and so on; and,<br />
• Predication Participant: optional elements that represent the objects talked<br />
about, for instance, +room, +accommodation, and so on.<br />
The DA is followed by a list of arguments. These elements are expressed by<br />
argument-value pairs positioned inside a list and separated by a “,”.<br />
By way of example, an IF representation of the sentence in (1) contains all the<br />
elements mentioned above:<br />
(1) Would you like me to reserve a room for you?<br />
IF: a: request-information+disposition+ reservation+room (for-whom=you,<br />
who=i, disposition=(who=you, desire), room spec=(quantity=1, room))<br />
From this representation we know that the speaker is the agent and that he is<br />
asking for some information. The attitude expressed here is a desire of a second<br />
person singular, that is, the client. The main predication is to make a reservation and<br />
the predication participant is one room.<br />
216
Interchange Format is heavily influenced by English, and this may cause problems<br />
when using it to represent Romance languages such as Catalan or Spanish. Most of<br />
these problems are solvable, however, and in general, IF works rather well to represent<br />
both languages. The following subsections describe six different issues that have been<br />
encountered when adapting the IF to Catalan and Spanish.<br />
4.2 Dialogue Context Ambiguity<br />
The meaning of an expression sometimes changes depending on the dialog context.<br />
That is to say, a unique expression can have different meanings depending on its place<br />
in the conversation. This is the case, for instance, of the Catalan expression digui’m.<br />
In Catalan, it has a different meaning depending on whether it used when answering a<br />
telephone call or when responding to a suggestion. This difference in meaning is seen<br />
here in examples (2) and (3):<br />
(2) 9-ENG-CLIENT: Shall I give you my Visa number then?<br />
10-CAT-AGENT: Digui’m.<br />
Go ahead.<br />
10-IF:a: request-action+proceed (who=you, communication-mode=phone)<br />
(3) CAT-AGENT: Viatges Fame, digui‘m?<br />
Fame Travel. Hello?<br />
IF: a: introduce-self (who=name-viajes_fame)<br />
a: dialog-greet (who=you, to-whom=I, communication-mode=phone)<br />
In example (2), the agent uses the expression to indicate to the client that he is<br />
ready and that the client may proceed to give his visa number. However, in (3), the<br />
expression appears at the beginning of a conversation, as a kind of greeting indicating<br />
to the client that the agent is already listening to him.<br />
There is currently no way to represent dialog structure information within the<br />
interlingual formalism, and so only one of the translations (go ahead) is used as a<br />
default; but a solution would not be difficult to implement. The first step would be to<br />
represent various types of conversational contexts (opening, response-to-offer, etc.),<br />
and then to modify the analysis grammars to parse differently according to context. In<br />
this case, the analyser recognises that it is parsing (and thus interpreting) a dialogueopening<br />
segment (indicating one meaning, i.e., hello) or a post offer-information<br />
segment (indicating a different meaning, i.e., go ahead).<br />
217
4.3 Formality Feature<br />
In Catalan and Spanish, there is a distinction between formal and informal personal<br />
pronouns, especially for second person singular and plural. However, as the IF is<br />
influenced by English, this distinction is not reflected in this interlingua. In example<br />
(4), the verbal form ajudar-lo (to help you) implies a formal relationship between<br />
the speaker and singular addressee, while in (5) ajudar-te (to help you), the implied<br />
relationship is familiar.<br />
(4) CAT-AGENT: ¿En què puc ajudar-lo?IF: a: offer+help (help=(who=i, to-<br />
whom= you))<br />
(5) CAT-AGENT: ¿En què puc ajudar-te?<br />
IF: a: offer+help (help=(who=i, to-whom= you))<br />
But if we inspect the IF representations for both examples, we see that they are<br />
the same. This is due to the lack of a formality feature in this interlingua. This does<br />
not imply any problem when translating from Catalan/Spanish into English, as the<br />
latter does not have any formal register; but it could cause a loss of meaning when<br />
translating from Catalan into Spanish or vice versa, for instance, or from either of<br />
these two languages into French, for example, which also makes a second person<br />
register distinction.<br />
To solve this problem of representing register, we can add a new argument-<br />
value pair to the IF with the argument [formal=] and the values (yes) or (no). When<br />
implementing this new feature, the IF representation for examples (4) and (5) would<br />
be (6) and (7), respectively.<br />
(6) IF: a: offer+help (help=(who=i,( to-whom= you, formal=yes)))<br />
(7) IF: a: offer+help (help=(who=i, (to-whom= you, formal=no)))<br />
Through the use of these new argument-value pairs, we would be able to<br />
communicate the feature of formality and have it available in the target language, if<br />
applicable.<br />
4.4 Conceptual Gaps in Catalan and Spanish<br />
Another problem we had to overcome when developing the Catalan and Spanish<br />
grammars had to do with the lexicons. Since IF was developed with English as point<br />
of reference, there are IF values that refer to lexical items that do not exist in<br />
Catalan or Spanish per se. In essence, the semantic field is not divided equivalently<br />
between the languages. Sometimes, it is a word or an expression, such as Christmas<br />
crackers, that does not exist either in the Catalan or Spanish culture. When facing<br />
this problem we maintain the same English word, as there is no cultural equivalent.<br />
218
Sometimes the solution is not that straightforward, however, given that the word<br />
without equivalent in Catalan or Spanish is an important word in the dialogue. This is<br />
the case of king-size bed and queen-size bed, as shown in example (8). Both words are<br />
rather important within the hotel reservation domain we work in, and what’s more,<br />
the client is supposed to be an English speaker, so he would most definitely use it. As a<br />
consequence, we could not adopt the solution proposed in the previous example, and<br />
we had to introduce phrasal equivalents based on already existing Catalan/Spanish<br />
words referring to bed types to cover those two values. The Catalan and Spanish<br />
equivalents would be un llit extragran and un llit gran, for Catalan, and una cama<br />
extragrande and una cama grande, for Spanish.<br />
(8) ENG-CLIENT: I would like a room with a king-size bed.<br />
IF: c: give-information+disposition+bed (disposition=(who=I, desire),<br />
room-spec= (quantity=1, room, contain=(quantity=1, king-bed)))<br />
4.5 Proper Nouns<br />
Currently, all proper names are included in the IF by the use of values. That is<br />
to say, each proper name is represented by a different value. In our domain, proper<br />
names are mainly person names, street names, city names, hotel names, names of<br />
monuments, museums and other attractions, and so on. For instance, the proper<br />
name Hotel Duc de la Victoria is represented in the IF under the class *barcelona-<br />
hotel-names* by the value [name-hotel_duc_de_la_victoria]. Although this is a good<br />
way to represent proper names when they are well known to the interlocutors, we<br />
should also point out that it implies a great effort on the developer’s part. Whenever<br />
a new proper name is added, it should be first included in the IF Specification files,<br />
and then both analysis and generation grammars of all languages have to be updated<br />
to include this new proper name. As a consequence, all developers have to be aware<br />
of the new values included in the IF specifications, especially those working on the<br />
analysis side. Otherwise, this proper name will not be analysed.<br />
Moreover, when developing our analysis grammars, we had to deal with the<br />
phenomena of bilingualism in Barcelona, and in Catalonia in general. In Barcelona,<br />
both Catalan and Spanish are spoken, and when using a proper name there is a certain<br />
degree of code mixing. As a result the name may be in Catalan, in Spanish, or in a<br />
mixture of both languages, as shown in (9). When including proper names in Catalan<br />
analysis grammars, we took into account all those forms: the Catalan name, the<br />
Spanish name and the hybrid version were added under the IF value representing the<br />
proper name, as shown in (10):<br />
(9) CAT: Carrer Pelai (Pelai Street)<br />
219
SPA: Calle Pelayo<br />
SPA/CAT: Calle Pelai<br />
(10) [name-carrer_pelai]<br />
(carrer pelai)<br />
(calle pelayo)<br />
(calle pelai)<br />
In any case, while this way of treating proper names is adequate, a good way to<br />
avoid the significant effort it entails would be to generate proper name representations<br />
automatically, or scrap proper name representations altogether and pass proper name<br />
forms directly to the target language as strings. Either way, the key is to deal with<br />
proper name translation independently from translation of other expressions.<br />
4.6 Catalan ‘de’ Partitive<br />
In Catalan there is a phenomenom called de partitive. This construction is used<br />
when a qualifying adjective has an elided head, or when it is used in construction<br />
with the impersonal pronoun en, as shown in example (11a). In the sentence (11b),<br />
there is a noun phrase llit extragran (king-size bed) formed by a head noun (llit – bed)<br />
and a qualifying adjective (extragran – king-size). In Catalan, this noun phrase can be<br />
transformed by eliding the head noun llit, inserting the pronoun en in its place and<br />
introducing the adjective extragran by the preposition de (in this case d’).<br />
(11a) CAT: En tenim un d’extragran<br />
ENG: We have one in king-size. b) CAT: Tenim un llit extragran<br />
ENG: We have a king-size bed<br />
At first sight, if we want to have an interlingua representation for (11a), it would<br />
be example (12). In this representation, we have an [object-spec=] argument that<br />
contains a subargument [size=] with the value (king-bed-size). The focus of the<br />
representation is on the size of the object without explicitly mentioning the type per<br />
se.<br />
(12) give-information+existence+object (provider=we, object-spec=<br />
(quantity=1, size=king-bed-size))<br />
However, (11a) actually continues to refer to a king-size bed. Furthermore, ideally,<br />
the interlingua should be language independent. It would not be fair to create a new<br />
value such as king-bed-size only for Catalan especially, since we could represent this<br />
sentence through already existing values and arguments, as in example (13).<br />
220
(13) give-information+existence+object (provider=we, object-spec=<br />
(quantity=1, bed-spec=king-bed))<br />
In this representation, the segment un d’extragran is represented by the<br />
subargument [bed-spec=], used to represent the types of beds.<br />
4.7 Excess of Conceptual Granularity<br />
The Interchange Format is a formalism intended to express the meanings of<br />
different parts of a sentence. In some cases, however, the representation of this<br />
meaning is too specific with respect to one or another of the languages in the system.<br />
This is especially common in regard to representing modifiers such as adjectives.<br />
For example, (14) shows two different IF values that, with respect to Catalan and<br />
Spanish, can both be taken to mean the same thing: ‘old.’ The corresponding terms<br />
for both values is vell in Catalan and viejo in Spanish. The difference between the<br />
English lexical counterparts has less to do with semantics as such, but rather with<br />
their distributional properties.<br />
(14) [ancient]<br />
[antique]<br />
In this case then, the IF values are too specific, since the meaning they convey<br />
could be included under one value. The solution is to introduce an IF class value<br />
for values [ancient] or [antique], and then to map the Catalan or Spanish lexeme<br />
to or from this class value. In translating from English to Catalan or Spanish, one<br />
simply moves from the particular value to the class value, since there are no possible<br />
equivalents associated with the particular values. When translating from Catalan or<br />
Spanish into English, the English appropriate lexeme is selected on the basis of class<br />
of head element modified.<br />
5. Conclusions<br />
This presentation began with a brief description of the FAME Speech-to-Speech<br />
Machine Translation System, and the results of a user-oriented evaluation of the<br />
system for both voice (with ASR) and clean audio-transcription inputs. It was observed<br />
that the interlingual translation component performs very well when used on clean<br />
input and that, as expected, worsens in performance when used on spoken input.<br />
Nonetheless, we are satisfied with system performance, although we acknowledge that<br />
further work should be done, especially in order to improve the ASR throughput.<br />
Next, Interchange Format (IF) was introduced, and we examined a number of<br />
language-particular problems that arose while applying IF to the representation of<br />
221
Catalan. In each case, we described the solutions we used, or propose to use, to<br />
overcome these problems, including improvements to IF that should widen its coverage<br />
and make it easier to be used by developers.<br />
lines:<br />
In the future, we hope to continue to develop the system along three general<br />
• We would like to implement the changes and improvements proposed for<br />
IF, and see how they work and in which way they help to widen the coverage of the<br />
interlingua;<br />
• We would like to improve the ASR component of our translation system,<br />
and try to find solutions to overcome possible problems due to spontaneous speech<br />
and disfluencies; and,<br />
• We also expect to extend the coverage of our grammars and lexica,<br />
not only to other areas of the travel domain, but also to other domains such as<br />
medicine.<br />
6. Acknowledgments<br />
This research has been partially financed by the FAME (IST-2001-28323) and ALIADO<br />
(TIC2002-04447-C02) projects. We would especially like to thank Climent Nadeu and<br />
Jaume Padrell, for all their help and support in numerous aspects of the project.<br />
We are also grateful to other UPC colleagues, such as Josè B. Mariño and Adrià de<br />
Gispert, and to our colleagues at CMU, Dorcas Alexander, Donna Gates, Lori Levin, Kay<br />
Peterson and Alex Waibel, for all their feedback and assistance.<br />
222
“C-STAR.” <strong>Online</strong> at http://www.c-star.org.<br />
“FAME.” <strong>Online</strong> at http://isl.ira.uka.de/fame.<br />
223<br />
References<br />
Gavaldà, M. (2000). “SOUP: A Parser for Real-world Spontaneous Speech.” Proceedings<br />
of the 6th International Workshop on Parsing Technologies (IWPT-2000), Trento,<br />
Italy.<br />
Holzapfel, H. et al. (2003). FAME Deliverable D3.1: Testbed Software, Middleware<br />
and Communication Architecture.<br />
Lavie, A. et al. (2002). “A Multi-Perspective Evaluation of the NESPOLE! Speech-to-<br />
Speech Translation System.” Proceedings of ACL-2002 Workshop on Speech-to-Speech<br />
Translation: Algorithms and Systems. Philadelphia, PA, 121-128.<br />
Levin, L. et al. (2002). “Balancing Expressiveness and Simplicity in an Interlingua<br />
for Task based Dialogue.” Proceedings of ACL-2002 Workshop on Speech-to-Speech<br />
Translation: Algorithms and Systems. Philadelphia, PA, 53-60.<br />
Metze, F. et al. (2002). “The NESPOLE! Speech-to-Speech Translation System.”<br />
Proceedings of HLT-2002, San Diego, California.<br />
Searle, J. (1969). Speech Acts: An Essay in the Philosophy of Language. Cambridge,<br />
UK: Cambridge University Press.<br />
Taddei, L. et al. (2003). NESPOLE! Deliverable D17: Second Showcase Documentation.<br />
http://nespole.itc.it.<br />
“The Open Agent ArchitectureTM.” <strong>Online</strong> at http://www.ai.sri.com/~oaa.<br />
Tomita, M. & Nyberg, E.H. (1988). “Generation Kit and Transformation Kit, Version 3.2,
User’s Manual.” Technical Report CMU-CMT-88-MEMO, Center for Machine Translation,<br />
Carnegie Mellon University, Pittsburgh, PA.<br />
Woszczyna, M. et al. (1993). “Recent Advances in JANUS: A Speech Translation System.”<br />
Proceedings of Eurospeech-1993, Berlin.<br />
224
Computing Non-Concatenative<br />
Morphology: The Case of Georgian 1<br />
225<br />
Olga Gurevich<br />
Georgian (Kartvelian) is a less commonly studied language, with a complex, non-<br />
concatenative verbal morphology. This paper examines characteristics of Georgian<br />
that make it a challenge for language learners and for current approaches to<br />
computational morphology. We present a computational model for generation and<br />
recognition of Georgian verb conjugations, and describe one practical application of<br />
the model to help language learners.<br />
1. Introduction<br />
Georgian (Kartvelian) is the official language of the Republic of Georgia and claims<br />
about 4 million native speakers. Georgian morphology is largely synthetic, with<br />
complex verb forms that can often express the meaning of a whole sentence. Georgian<br />
has sometimes been called agglutinative (Hewitt 1995), but such classification does<br />
not fully describe the complexity of the language.<br />
Descriptions of Georgian verbal morphology emphasise the large number of<br />
inflectional categories, the large number of elements that a verb form can contain,<br />
the dependencies between the occurrence of various elements, and the large number<br />
of regular, semi-regular, and irregular patterns of formation of verb inflections. All of<br />
these factors make computational modeling of Georgian morphology a rather daunting<br />
task. To date, no successful large-scale models of parsing or generation of Georgian<br />
are available.<br />
In this paper, I propose a computational model for parsing and generation of a<br />
subset of Georgian verbal morphology that relies on a templatic, word-based analysis<br />
of the verbal system, rather than assuming compositional rules for combining<br />
individual morphemes. I argue that such a model is viable, extensible, and capable of<br />
capturing the generalisations inherent in Georgian verbal morphology at various levels<br />
of regularity.<br />
1 This research was in part supported by the Berkeley Language Center. Thanks to Mark Kaiser, Claire<br />
Kramsch, Lisa Little, Nikolas Euba, David Malinowski, and Sarah Roberts for many hours of productive<br />
discussion and wonderful suggestions, and to Aaron Siegel for technical support. I am eternally grateful<br />
to Vakhtang Chikovani and Shorena Kurtsikidze for help in creating the website, and for introducing me<br />
to Georgian. I alone am to blame for any errors and omissions.
I begin with a brief overview of Georgian verbal morphology, emphasising the<br />
factors that complicate its computational modelling. I present an analysis grounded<br />
in word-based approaches to morphology and Construction Grammar, and suggest that<br />
this type of analysis lends itself more easily to computational implementations than<br />
analyses that assume morpheme-based compositionality. Following a brief overview<br />
of existing approaches to computational morphology, I propose a model for Georgian<br />
and describe it in detail. The model is currently implemented as a cascade of finite-<br />
state transducers (Beesley & Karttunen 2003), but probabilistic and connectionist<br />
extensions or alternative implementations are plausible. Finally, I describe a practical<br />
application of this model for language learning: an online database of Georgian verb<br />
conjugations.<br />
2. An Overview of Georgian Verbal Morphology<br />
The morphosyntax of Georgian verbs is characterised by a variety of lexical<br />
(irregular), semi-regular, and completely regular patterns. The verb forms themselves<br />
are made up of several kinds of morphological elements that recur in different<br />
formations. These elements can be formally identified in a fairly straightforward<br />
fashion; however, their function and distribution defy a simple compositional analysis,<br />
but instead are determined by the larger morphosyntactic and semantic contexts in<br />
which the verbs appear (usually tense, aspect, and mood) and by the lexical properties<br />
of the verbs themselves. The combination of morphosyntactic and lexical factors also<br />
determines the case marking on the verb’s arguments.<br />
The specific types of morphological elements and peculiarities in their function<br />
and distribution are described below. The main point of this section is that a language<br />
learner and a computational model are faced with patterns in which formal elements<br />
(morphs) do not have identifiable, context-independent meanings that can be combined<br />
compositionally to form whole words. Rather, they must contend with a variety of<br />
patterns at various degrees of regularity. In computational terms, this amounts to a<br />
series of rules of varying specificity, backed up by defaults.<br />
The linguistic analysis at the core of the computational model splits Georgian<br />
verbs into several lexical classes. The lexical classes are described on the basis of<br />
example paradigms, using frequent verbs belonging to each class. This is in contrast<br />
to a more rule-oriented description in which lexical classes may be identified by some<br />
morphological or syntactic feature. In the rest of this section, I argue that an example-<br />
226
ased description is the only one plausible for learners of Georgian, and provides a<br />
good basis for computational modeling as well.<br />
2.1 Series and Screeves<br />
Georgian verbs inflect in tense / mood / aspect (TAM) paradigms called screeves<br />
(from mck’rivi ‘row’). There are a total of eleven screeves in Modern Georgian,<br />
although only ten are actively used. Screeves can be grouped into three series based<br />
on morphological and syntactic commonalities, as in Table 1:<br />
Table 1 – Series and Screeves<br />
Series I Series II Series III<br />
Present sub-series Future sub-series (aorist) (perfect)<br />
Present Future<br />
Perfect<br />
Aorist<br />
Imperfect Conditional Pluperfect<br />
Present subjunctive Future subjunctive Aorist subjunctive (Perf. subj.) *<br />
Knowing the series and screeve of a verb form is essential for being able to conjugate<br />
it. Screeve formation exhibits a number of lexical, semi-regular, and regular patterns,<br />
some of which are examined below.<br />
Georgian verbs are often divided into four conjugation classes, based mostly on<br />
valency (cf. Harris 1981). For now, I will concentrate on transitive verbs; it will be<br />
necessary to mention the other classes (unergative, unaccusative, and indirect) in the<br />
discussion of case-marking below. The structure of a verb form can be described using<br />
the following (simplified) template:<br />
(Preverb 1 )-(Pron1 2 )-(PRV 3 )-root 4 -(TS 5 )-(Scr 6 )-(Pron2 7 ) 2<br />
The approximate function of each element is as follows:<br />
• Preverb – marks aspectual distinctions, lexically associated with each<br />
verb (similar to verbal prefixes in Slavic or German).<br />
• Pron1 – Prefixal pronominal agreement slot.<br />
• PRV – pre-radical vowel slot, serves a variety of functions in different<br />
contexts.<br />
• Root – the only required part of the verb form.<br />
• TS – Thematic Suffix. Participates in the formation of several tenses,<br />
predicts certain inflectional properties of the verb.<br />
* The Perfect Subjunctive is almost never used in contemporary Georgian.<br />
2 Cf. Hewitt 1995.<br />
227
• Scr – Screeve marker. This is a screeve (tense) ending which may depend<br />
on verb class and agreement properties.<br />
• Pron2 – suffixal agreement slot.<br />
The preverb, root, and thematic suffix must be lexically specified in all cases,<br />
although their distribution follows a somewhat regular pattern described in the next<br />
section. Other elements in the template are distributed according to more or less<br />
regular principles, although some lexical exceptions do exist.<br />
The templatic composition of the Georgian verb forms suggests, at first blush, an<br />
agglutinative structure. However, a closer examination of the morphological elements<br />
in the verbal template and their function provides evidence against such an analysis. In<br />
particular, the morphological elements do not have identifiable meanings independent<br />
of context, and their meanings do not compositionally comprise the meanings of the<br />
words in which they participate. As argued in Gurevich (2003), the morphological<br />
elements of Georgian cannot be thought of as morphemes, or smallest meaningful<br />
elements of form. Rather, word-level constructions determine both the meaning of<br />
the whole word, and the collection of morphological elements that comprise the word.<br />
This combination of templatic morphological structure and non-compositional meaning<br />
construction makes Georgian inflectional morphology look non-concatenative.<br />
As an illustration, let us examine the formation of the verb xat’va ‘paint’ in<br />
Table 2. The screeves (and, more generally, series) govern the distribution of the<br />
morphological elements.<br />
I<br />
Table 2: Screeves of xat’va ‘paint’<br />
Series Screeve 2SgSubj, 3Obj form<br />
Present xat’-av ‘You paint’<br />
Pres. subseries Imperfect xat’-av-di ‘You were painting’<br />
Pres. Subj. xat’-av-de ‘You should paint’<br />
Future da-xat’-av ‘You will paint’<br />
Fut. subseries Conditional da-xat’-av-di ‘You would paint’<br />
Fut. Subj. da-xat’-av-de ‘If you could paint’<br />
II<br />
Aorist<br />
Aor. Subj.<br />
da-xat’-e ‘You painted’<br />
da-xat’-o ‘You have to paint’<br />
III<br />
Perfect<br />
Pluperfect<br />
da-g-i-xat’-avs ‘You have painted’<br />
da-g-e-xat’-a ‘You should have painted’<br />
In addition to the multitude of morphological elements in any given verb form, the<br />
distribution and lexical dependency of the elements makes a learner’s task difficult.<br />
Preverbs, thematic suffixes and screeve endings present particular difficulties.<br />
228
The preverbs form a closed class of about eight. A preverb (da- for the verb ‘paint’)<br />
appears on forms from the Future subgroup of series I, and on all forms of series II<br />
and III in transitive verbs. The preverbs are by origin spatial prefixes that now mark<br />
perfective aspect. However, the presence of a preverb on a verb form signals more<br />
than just a change in aspect. For example, the preverb differentiates the Conditional<br />
from the Imperfect, and the meaning of the two screeves differs in more than aspect.<br />
An additional difficulty is in the lexical connection between prefixes and verb roots,<br />
similar to the verbal prefixes in Slavic or German. Table 3 demonstrates some of<br />
the lexically-dependent morphological elements, including several different preverbs<br />
(row ‘Future’).<br />
Similarly, thematic suffixes (otherwise known as screeve suffixes or screeve<br />
formants) form a closed class and are lexically associated with verb roots. In general,<br />
thematic suffixes do not appear to have independent meaning. Rather, they serve to<br />
mark the inflectional class of the verb, because they determine certain patterns of<br />
inflectional behavior in different screeves.<br />
On transitive verbs, thematic suffixes appear in all series I forms. Their behavior<br />
in other series differs by individual suffix: in series II, most suffixes disappear, though<br />
some seem to leave partial ‘traces’. In series III, all suffixes except –av/-am disappear<br />
in the Perfect screeve; and in Pluperfect, all suffixes disappear, but the inflectional<br />
ending that takes their place does depend on the original suffix (rows ‘Present’ and<br />
‘Perfect’ in Table 3).<br />
The next source of semi-regular patterns comes from the inflectional endings in<br />
the individual screeves and the corresponding changes in some verb roots (row ‘Aorist’<br />
in Table 3).<br />
Finally, another verb form relevant for learners is the masdar, or verbal noun,<br />
which is the closest substitute of the infinitive in Georgian. The masdar may or may<br />
not include the preverb and/or some variation of the thematic suffix (last row in Table<br />
3). The formation of the masdar is particularly important, as it is the reference form<br />
listed in most Georgian dictionaries, even though it might not even start with the<br />
same letter as an inflected verb form.<br />
Table 3: Lexical Variation<br />
‘Bring’ ‘Paint’ ‘Eat’<br />
Present igh-eb-s xat’-av-s ch’-am-s<br />
Future c’amo-ighebs da-xat’avs she-ch’ams<br />
Aorist,3Sg Subject c’amoigh-o daxat’-a shech’am-a<br />
Perfect c’amough-ia dauxat’-avs sheuch’am-ia<br />
Masdar (verbal noun) c’amo-gh-eba da-xat’-va ch’-am-a<br />
229
In many cases, the inflectional endings and root changes can be determined if we<br />
know the thematic suffix of the verb (cf. the painstakingly detailed description of such<br />
patterns in Hewitt 1995). However, there are exceptions to most such connections,<br />
and learning the patterns based on explicit rules seems virtually impossible.<br />
On the other hand, screeve formation in some instances presents amazing<br />
regularity. Thus, the Imperfect and First Subjunctive screeves are regularly formed<br />
from the Present. Similarly, the Conditional and Future Subjunctive are formed from<br />
the Future. And for most (though not all) transitive verbs, the Future is formed from<br />
the Present via the addition of a preverb.<br />
Additionally, the number of possible combinations of inflectional endings, root<br />
changes and other irregularities is also finite, and some choices tend to predict other<br />
choices in the paradigm of a given verb (e.g. the selection of thematic suffix or Aorist<br />
2Sg Subj ending often predicts the Aorist Subjunctive ending). Although the rule-<br />
based analysis is unproductive, Georgian verbs can be classified according to several<br />
example paradigms, or inflectional (lexical) classes. This is similar to the inflectional<br />
class distinctions made in Standard European languages; the major difference is that<br />
the number of classes is much greater in Georgian than in other languages. One such<br />
classification is presented in Melikishvili (2001), distinguishing seventeen inflectional<br />
classes for transitive verbs alone, and over sixty classes overall. While the exact<br />
number of inflectional classes is still in question (see the discussion in section 4.4),<br />
the general example-based approach seems the only one viable for Georgian.<br />
The next section deals with subject and object agreement, a completely regular<br />
yet non-concatenative phenomenon.<br />
2.2 Subject and Object Agreement<br />
A Georgian verb can mark agreement with both its subject and its object via a<br />
combination of prefixal and suffixal agreement markers, as in Table 4:<br />
Subj<br />
Table 4: Agreement in Present<br />
230<br />
OBJECT<br />
1SG 1PL 2SG 2PL 3<br />
1SG -- -- g-xat’av g-xat’av-t v-xat’av<br />
1PL -- -- g-xat’av-t g-xat’av-t v-xat’av-t<br />
2SG m-xat’av gv-xat’av -- -- xat’av<br />
2PL m-xat’av-t gv-xat’av-t -- -- xat’av-t<br />
3SG m-xat’av-s gv-xat’av-s g-xat’av-s g-xat’av-t xat’av-s
3PL m-xat’av-en gv-xat’av-en g-xat’av-en g-xat’av-en xat’av-en<br />
The distribution and order of attachment of agreement affixes has been the subject<br />
of much discussion in theoretical morphological literature (Anderson 1992; Halle &<br />
Marantz 1994; and Stump 2001). To simplify matters for the computational model, I<br />
assume here that the prefixal and suffixal markers attach to the verb stem at the same<br />
time, and indicate the combined subject and object properties of a paradigm cell.<br />
While the prefixal markers and the suffix –t appear in all screeves, the suffixes<br />
in 3Sg and 3Pl Subject forms are screeve-dependent (cf. row ‘Aorist’ in Table 3).<br />
These suffixes therefore belong to the semi-regular patterns, while the rest of the<br />
agreement system is completely regular.<br />
Another difficulty arises in series III for transitive verbs. Here, the subject and<br />
object agreement appears to be the inverse of that in series I and II (Table 5; notice<br />
the different designation of rows and columns). This phenomenon, called inversion,<br />
corresponds to a reverse case marking of the nominal arguments (see next section).<br />
Several analyses have been proposed suggesting that, in inversion, the semantic subject<br />
corresponds to a ‘surface’ indirect object, and the semantic object corresponds to a<br />
‘surface’ subject (Harris 1981). However, a simple difference in linking does not fully<br />
explain the paradigm composition. In inverted paradigms, plural number agreement<br />
is still sensitive to the semantic arguments (namely, the semantic subject / agent<br />
triggers plural agreement regardless of other agreement or case-marking facts).<br />
Object<br />
Table 5: Agreement in Perfect<br />
231<br />
Subject<br />
1SG 1PL 2SG 2PL 3<br />
1SG -- -- g-xat’av g-xat’av-t v-xat’av<br />
1PL -- -- g-xat’av-t g-xat’av-t v-xat’av-t<br />
2SG m-xat’av gv-xat’av -- -- xat’av<br />
2PL m-xat’av-t gv-xat’av-t -- -- xat’av-t<br />
3SG m-xat’av-s gv-xat’av-s g-xat’av-s g-xat’av-t xat’av-s<br />
3PL m-xat’av-en gv-xat’av-en g-xat’av-en g-xat’av-en xat’av-en<br />
2.3 Subject and Object Case Marking<br />
Case marking of nominal arguments in Georgian is not constant, but depends on<br />
the conjugation (valency) class of the verb and the series / screeve of the verb forms.<br />
Transitive verbs can follow one of three patterns, depending on series:<br />
(1) k’ac-i dzaγl-s xat’avs<br />
man-NOM dog-DAT paint.Pres.3SgSubj
“The man paints / is painting the dog.” (Series I, Present – Pattern A)<br />
(2) k’ac-ma dzaγl-i daxat’a<br />
man-ERG dog-NOM paint.Aor.3SgSubj<br />
“The man painted the dog.” (Series II, Aorist – Pattern B)<br />
(3) k’ac-s dzaγl-i t’urme dauxat’avs<br />
man-DAT dog-NOM apparently paint.Perf.3SgSubj<br />
“The man has painted the dog.” (Series III, Perfect – Pattern C)<br />
Table 6 demonstrates the case-marking patterns by series for all four conjugation<br />
classes. Only transitive and unergative (active intransitive) verbs show variability by<br />
series. Unaccusative verbs always follow Pattern A (similar to the standard nominative/<br />
accusative pattern in European languages), and indirect verbs always follow Pattern<br />
C (the inverse pattern). In order to assign correct case marking, a learner of Georgian<br />
must recognise the conjugation class of each verb, as well as the series / screeve for<br />
some of the verb classes.<br />
Table 6 – Case-Marking Patterns<br />
Series Transitive Unaccusative Unergative Indirect<br />
I A A A C<br />
II B A B C<br />
III C A C C<br />
2.4 Summary<br />
The formation of the screeves exhibits several irregular, semi-regular, and regular<br />
patterns. The morphological elements in the Georgian verb template are easy to<br />
identify, suggesting an agglutinative structure. However, closer inspection reveals<br />
that the morphological elements may not have easily identifiable meanings or<br />
functions (cf. preverbs, thematic suffixes, and screeve endings). Moreover, even if<br />
we manage to find meanings for these elements, the meanings will not predict the<br />
distribution of such elements across different verbs, verb types, and screeves. Such<br />
non-compositionality in meaning makes Georgian more similar to morphologically<br />
non-concatenative languages such as Arabic and Hebrew.<br />
On the basis of the data above, it is argued in (Gurevich 2003) and (Blevins<br />
forthcoming 2006) that a word-based morphological theory is more appropriate for<br />
Georgian. In such a theory, word formation is determined by whole-word patterns,<br />
such that the whole word carries morphosyntactic properties, and they need not be<br />
assigned to individual morphemes. Gurevich (forthcoming 2006) suggests that such<br />
232
patterns may be represented as constructions, or form-meaning pairings in which the<br />
elements of form need not match the elements of meaning one-to-one. The analysis is<br />
based on insights of Construction Grammar (Fillmore 1988; Goldberg 1995). It is argued<br />
that the main organising unit and the best level for morphosyntactic constructions in<br />
Georgian is the series. The series provides a base for expressing the more or less<br />
regular patterns of Georgian morphosyntax. The less regular and more lexicalised<br />
information, on the other hand, is best expressed using inflectional (lexical) classes<br />
of verbs.<br />
3. Approaches to Computational Morphology<br />
3.1 Standard Assumptions and Difficulties Presented by Georgian<br />
Many contemporary approaches to computational morphology are based on, or can<br />
be easily translated into, finite-state networks (FSN). In such approaches, an arc in the<br />
FSN often corresponds to a phoneme or morpheme, and the recognition or generation<br />
of each arc advances the state in the network. Many approaches, including Beesley &<br />
Karttunen (2003), are implemented as two-way finite-state transducers (FST) in which<br />
each arc corresponds to a mapping of two elements, for example, a phoneme and its<br />
phonetic realisation, or a morpheme and its meaning. As a result, FST morphology<br />
very often assumes morpheme-level compositionality, the idea that the meaning of<br />
a word is compositionally made up from the meanings of its constituent morphemes.<br />
FST morphology has, for the most part, been applied to concatenative morphological<br />
systems like Finnish, although there have been some recent applications to templatic<br />
morphology such as Arabic (Beesley & Karttunen 2003).<br />
As demonstrated in the previous section, assumptions of morphemic compositionality<br />
do not serve well to describe the verbal morphology of Georgian. The Georgian verb<br />
forms are made up of identifiable morphological elements (i.e., elements of form), but<br />
the meaning of these elements is not easily identifiable, and does not stay constant in<br />
different morphosyntactic contexts.<br />
A computational system appropriate for Georgian should be able to accommodate<br />
the templatic nature of Georgian verb forms and its patterns of regularity and sub-<br />
regularity. Overall it should be able to describe the following:<br />
• Meaning carried by a whole word form rather than by individual<br />
morphemes;<br />
• Lexical root alternations and suppletion;<br />
233
• Lexical class-dependent screeve formation (e.g. the endings in the<br />
Aorist);<br />
• The dependency between the formation of some screeves from that of<br />
others (e.g. the Imperfect from the Present); and<br />
• The multiple exponence of agreement, that is, the use of suffixes and<br />
prefixes simultaneously, and the simultaneous expression of subject and object<br />
agreement.<br />
The linguistic analysis of Georgian verbal morphology suggested in the previous<br />
section relies on insights from Construction Grammar. Unfortunately, there are<br />
currently no computational implementations of CG capable of handling complex<br />
morphological systems. Bryant (2003) describes a constructional syntactic parser,<br />
based on general principles of chart parsing. However, this parser cannot yet handle<br />
morphological segmentation, and adapting it for Georgian would require substantial<br />
revision.<br />
Fortunately, FST tools for computational morphology have advanced to the point<br />
where they can handle some aspects of non-concatenative morphology. The next<br />
section briefly describes the approach in Beesley & Karttunen (2003) and what makes it<br />
a possible candidate for modelling at least a subset of Georgian verbal morphology.<br />
3.2 Xerox Finite-State Morphology Tools<br />
Beesley & Karttunen (2003) present the state-of-the-art set of tools for creating<br />
finite-state morphological models. The book is accompanied by implementations of<br />
the two Xerox languages: xfst (designed for general finite-state manipulations) and<br />
lexc (designed more specifically for defining lexicons). Since our goal was to reproduce<br />
morphotactic rules of word formation rather than the structure of the lexicon, xfst<br />
was used.<br />
Xfst provides all of the basic commands for building up single or two-level finite-<br />
state networks (i.e., transducers), such as concatenation, intersection, and so forth.<br />
In addition, xfst has several built-in shortcuts that make network manipulation<br />
easier, such as various substitution commands. Xfst distinguishes between words of a<br />
natural language (composed of single characters) and multi-character symbols, used<br />
in our model to indicate morphosyntactic properties such as person or number. Each<br />
completed arc in a finite-state network compiled using xfst represents a mapping<br />
between a set of morphosyntactic and semantic properties (on the upper side) and a<br />
full word form that realises those properties (on the lower side).<br />
234
Another very useful feature of xfst is the ability to create scripts with several<br />
commands in a sequence. The later commands can operate on the output of earlier<br />
commands, and can thus create a cascade of finite-state transducers. Xfst also provides<br />
convenient ways of outputting all the words recognised by a given transducer, which<br />
proved very useful in the creation of the online reference (see section 5). An updated<br />
version of xfst (Beesley & Karttunen forthcoming 2006) also includes support for utf-<br />
8.<br />
While finite-state technology is very good at generating and recognising regular<br />
expressions, it has a harder time capturing other features of natural language such<br />
as non-concatenative morphological structure. The next section describes some<br />
adaptations that allow FST to handle many of the non-concatenative patterns in<br />
Georgian.<br />
In addition, FST is not designed to represent a dynamic, living mental lexicon of<br />
an actual speaker. It does not provide any mechanisms for probabilistic decisions, or<br />
for recognition and generation of novel inflectional forms. The concluding section<br />
discusses some possible future developments in this area.<br />
4. Computational Model of the Georgian Verb<br />
4.1 General Idea<br />
As argued above, Georgian verb morphology can be described as a series of patterns<br />
at various levels of regularity. Most of the patterns specify particular morphosyntactic<br />
or semantic properties of verb forms and the corresponding combinations of elements<br />
in the morphological templates. In the model proposed here, screeve formation is<br />
viewed as lexical or semi-regular, and pronominal agreement is viewed as completely<br />
regular.<br />
Screeve formation for different conjugation classes (transitive, unergative,<br />
unaccusative, and inverse) is fairly different in Georgian, and so each conjugation class<br />
is implemented as a separate network. Nevertheless, the principles for composing<br />
each network are the same.<br />
The model is implemented as a cascade of finite-state transducers, that is, as<br />
several levels of FST networks such that the result of composing a lower-level network<br />
serves as input to a higher-level network. The levels correspond to the division of<br />
templatic patterns into completely lexical (Level 1) and semi-regular (Level 2). Level<br />
3 contains completely regular patterns that apply to the results of both Level 1 and<br />
Level 2. The result of compiling Level 3 patterns is the full set of conjugations for the<br />
235
verbs whose lexical information is included in Level 1. The FST model can be used<br />
both for the generation of verbal inflections and for recognition of complete forms.<br />
In general, the most specific or irregular information is contained at the lower<br />
levels. The higher levels, by contrast, contain defaults that apply if there is no more<br />
specific information. The verbs explicitly mentioned in the lexical level (Level 1) are<br />
representative examples of lexical classes, as posited by the linguistic analysis in<br />
section 2. Through the use of diacritics and replacement algorithms, other verbs are<br />
matched to their lexical classes and are included in the resulting network.<br />
The main advantage of this implementation is in the separation of lexical,<br />
or irregular, verb formation patterns from the semi-regular or completely regular<br />
patterns. The initial input to the FST cascade includes only the necessary lexical<br />
information about each verb and verb class; the computational model does the rest<br />
of the work.<br />
The model described here served as the basis for an online reference on Georgian<br />
verb conjugation, described in section 5. This practical application underlies some of<br />
the specific choices in implementing the model.<br />
The current implementation of the model focuses on transitive verbs; however,<br />
there are obvious ways of extending the model to apply to other verb classes.<br />
4.2 Level 1: The Lexicon<br />
The first level of the FST model contains lexically specific information. There are<br />
two separate networks. The first network contains information about the gloss and<br />
masdar or the verb stem.<br />
The second network contains several complete word forms for each verb stem,<br />
providing all the lexically-specific information needed to infer the rest of the<br />
inflections. For the most regular verbs, these are:<br />
• Present screeve, no overt agreement (corresponds to 2Sg Subject, 3Sg<br />
Object;<br />
• Future screeve, no overt agreement;<br />
• Aorist screeve, no overt agreement;<br />
• Aorist, 3Sg Subject, no overt object agreement; and<br />
• Aorist Subjunctive.<br />
Some verbs need additional forms in order to describe their paradigms:<br />
• Present screeve, 3Pl Subject (most verbs have the ending –en, but some end<br />
in –ian); and<br />
• Perfect screeve.<br />
236
The inflected forms are represented as two-level finite-state arcs, with the verb<br />
stem and morphosyntactic properties on the upper side, and the inflected word on the<br />
lower side, as in Figure 1. The purpose of the stem is to uniquely identify each verb.<br />
Verb roots in Georgian are often very short and ambiguous; therefore a combination<br />
of the verb root plus thematic suffix was used. In some cases, even this combination<br />
is be insufficient to identify the verb uniquely; in such cases, the preverb may be<br />
necessary as well. It is only important that the verb stem can be uniquely matched in<br />
the network containing glosses; thus, the stem has no theoretical significance in this<br />
model.<br />
Another challenge is posed by the non-concatenative nature of verb agreement.<br />
Recall from section 2 that verb agreement is realised by a pre-stem affix and a final<br />
suffix. Since many of the word forms in Level 1 contain preverbs, the agreement affix<br />
would need to be infixed into the verb form at a later level. Beesley & Karttunen<br />
provide some fairly complex mechanisms for doing infixation in FST; however, the<br />
fixed position of the agreement affixes in the Georgian verb template allows for a<br />
much simpler solution. The forms on Level 1 contain a place holder “+Agr1” for the<br />
prefixal agreement marker (Figure 1), which is replaced by the appropriate marker in<br />
the later levels.<br />
The Level 1 network is produced via scripts from a table of verb forms containing only<br />
the necessary lexical information. Redundancy in human input is thus minimised.<br />
237
4.3 Level 2: Semi-regular Patterns<br />
Figure 1 – Simplified FST Script<br />
The purpose of Level 2 is to compile inflectional forms that are dependent on other<br />
forms (introduced in Level 1), and to provide default inflections for regular screeve<br />
formation patterns.<br />
An example of the first case is the Conditional screeve, formed predictably from<br />
the Future screeve. The FST algorithm is as follows:<br />
• Compile a network consisting of Future forms;<br />
• Add the appropriate inflectional suffixes (-di for 1st and 2nd person<br />
subject, -da for 3rd person subject);<br />
• Replace the screeve property “+Fut” with “+Cond”; and<br />
• Add the inflectional properties where needed.<br />
The replacement of screeve properties is done using the ‘substitute symbol’<br />
command in xfst; other operations are performed using simple concatenation<br />
commands.<br />
An example of the latter is the addition of 3Pl Subject forms of the Present screeve.<br />
The default suffix is –en, which is added to all verbs unless an exception is specified<br />
at Level 1. The basic algorithm is as follows:<br />
• Compile a network of Present forms, excluding the forms for which both<br />
3Pl Subject forms are already specified;<br />
238
• Add the suffix –en; and<br />
• Add the morphosyntactic property “+3PlSubj”.<br />
All of the patterns defined at Level 2 are then compiled into a single network,<br />
which serves as input to Level 3.<br />
4.4 Level 3: Regular Patterns<br />
The purpose of Level 3 is to affix regular inflection, namely, subject and object<br />
agreement. As described in section 2, agreement in Georgian is expressed via a<br />
combination of a prefix and a suffix that are best thought of as attaching simultaneously<br />
and working in tandem to express both subject and object agreement. Thus, the<br />
compilation of Level 3 consists of several steps, each of which corresponds to a<br />
paradigm cell.<br />
In each step, all of the word forms from Level 2 are taken as input. The place<br />
holder for the pre-stem agreement affix is then replaced by the appropriate affix (in<br />
some cases, this is null), and the appropriate suffix is attached at the end, as in Figure<br />
1. The resulting networks are then compiled into a single network.<br />
The only difficulty at this level arises when dealing with the ‘inverted’ screeves<br />
(Perfect and Pluperfect). As demonstrated in section 2, the morphological agreement<br />
in these screeves is sensitive to the case-marking of the nominal arguments, which<br />
is the reverse of the regular pattern. However, the composition of the agreement<br />
paradigm is sensitive to the semantic roles played by the arguments: plural number<br />
agreement is still triggered by the semantic agent. In this case, the computational<br />
implementation was motivated by the practical application of the model to the online<br />
reference. A separate set of paradigm cells was created for the inverted tenses,<br />
interpreting the properties ‘Subject’ and ‘Object’ as semantic. The resulting FST<br />
network thus shows no relation between inverted and non-inverted forms (i.e., it<br />
does not capture the generalisation behind inversion). Such an interpretation was<br />
sufficient for the purposes of the conjugation reference. However, the model could<br />
easily be amended to incorporate a different analysis of inversion that relies on the<br />
distinction between semantic and morphological arguments.<br />
4.5 Treatment of Lexical Classes<br />
The input to Level 1 contains a representative for each lexical class, supplied with a<br />
diacritic feature indicating the class number. Other verbs that belong to those classes<br />
could, in principle, be inputted along with the class number, and the FST model could<br />
239
substitute the appropriate roots in the process of compiling the networks. There are,<br />
however, several challenges to this straightforward implementation:<br />
• Verbs belonging to the same class may have different preverbs as well as<br />
different roots, thus complicating the substitution;<br />
• For many verbs, screeve formation involves stem alternations such as<br />
syncope or vowel epenthesis, again complicating straightforward substitution;<br />
and<br />
• Suppletion is also quite common in Georgian, requiring completely<br />
different stems for different screeves.<br />
As a result, even for a verb whose lexical class is known, several pieces of<br />
information must be supplied to infer the complete inflectional paradigm. The FST<br />
substitution mechanisms are fairly restricted, and so the compilation of new verbs<br />
is currently done using Java scripts performing simple string manipulations. Such<br />
an implementation still makes use of the division into lexical classes. The scripts<br />
make non-example verbs look like example verbs in Level 1 of the FST network by<br />
creating the necessary inflected forms, but the human input to the scripts need only<br />
include the information necessary to identify the lexical class of the verb. Future<br />
improvements to the computational model may include a more efficient method of<br />
identifying lexical classes within FST itself.<br />
The exact number of lexical classes has not been established with full certainly.<br />
Melikishvili (2001) relies entirely on morphological characteristics of verb inflection<br />
and categorises verb forms into sixty-three different classes; seventeen of those are<br />
for transitive verbs. This classification, however, makes some distinctions that can be<br />
merged in the computational model; for example, certain types of non-productive<br />
stem extensions can be considered part of the lexically specified verb stem.<br />
Another issue is the psychological reality of the lexical classes. A pilot survey of<br />
morphological productivity, conducted with adult speakers of Georgian, suggests that<br />
speakers conjugating nonce verbs rely more on frequent inflectional patterns than on<br />
a rule-based comparison with existing verbs based on morpho-phonological similarities<br />
with the nonce verbs (Gurevich forthcoming 2006). Such a reliance on frequency is<br />
not reflected in Melikishvili’s classification. The computational model proposed here<br />
takes a small step in this direction by relying on frequent verbs as example paradigms;<br />
however, the model does not have any built-in way to accommodate the relative<br />
frequency of different inflectional patterns. The concluding section suggests some<br />
possible improvements for the future.<br />
240
4.6 Case Frames<br />
As described in Section 2, another complicating feature of the Georgian verb is<br />
the variability of case-marking patterns for the verb’s arguments. For the purposes<br />
of the online conjugation reference, it was necessary to present the case-marking<br />
information with each verb. Fortunately, the case marking patterns depend almost<br />
entirely on the conjugation class and TAM series of the verb 3 . Since the goal of the<br />
online reference is to describe the morphosyntactic patterns of Georgian, it was<br />
sufficient to simply mention the case-marking pattern for each verb type.<br />
If the purpose of the morphological transducer is to supplement a syntactic<br />
parser, the case-marking information could be represented as a feature structure and<br />
associated with each verb type.<br />
4.7 Summary<br />
The computational model presented here accommodates many properties of<br />
Georgian verbal conjugation that make it challenging: the templatic structure of the<br />
verb forms; the non-concatenative nature of word meaning construction; the large<br />
number of irregular and semi-regular word formation patterns; and the interaction<br />
between word formation and case marking on the verb’s arguments. The model<br />
crucially relies on classification of verbs into lexical classes with example paradigms<br />
for each class. The two-level mappings inherent in FST mean that the model can be<br />
used for generation as well as recognition.<br />
5. Practical Application: An <strong>Online</strong> Reference<br />
5.1 Purpose<br />
The computational model described here serves as the basis for an online reference<br />
on Georgian verb conjugation. The goal of the online reference is to aid the learners<br />
of Georgian in a number of ways:<br />
verbs;<br />
• It provides complete conjugation tables for two hundred frequently-used<br />
• The verb database can be searched using any verb form or its English<br />
translation; and,<br />
• For many verb forms, real-life examples from the Internet, as well as<br />
audio and video sources, are provided (along with translations).<br />
3 One of the exceptions is the verb ic’is ‘he/she knows.’ Although this verb is transitive, its subject must<br />
always be in the Ergative and its object, in the Nominative.<br />
241
• Several types of exercises are available on the website; answers are<br />
automatically checked for correctness.<br />
• The online reference is meant as an addition to the classroom or self-<br />
study using a textbook, such as (Kurtsikidze forthcoming 2006).<br />
5.2 Website Design<br />
The website is divided into four sections: ‘Verb Conjugation’, ‘Examples’,<br />
‘Exercises’, and ‘Resources’.<br />
The section on verb conjugation is the core of the reference tool. It provides<br />
complete tables of verb conjugations, accessible through browsing by individual verb<br />
(in Georgian or in English), or by searching. The conjugated forms are produced using<br />
the FST model described in the previous section; the forms are then automatically<br />
inputted into a MySQL database and displayed on the website using PHP. In addition to<br />
displaying verb forms, the site allows the user to search for a given verb form, using<br />
the recognition capabilities of the FST network. This search capacity demonstrates a<br />
major advantage of online resources over print.<br />
Many of the verb forms are accompanied by handpicked examples of usage from<br />
print sources (mainly online newspapers and chat rooms), audio (from recorded<br />
naturalistic dialogues), and movie clips. The examples are provided as complete<br />
sentences and short paragraphs; translations are available for all examples. Audio<br />
and video examples are likewise accompanied by transcriptions and translations. I am<br />
very grateful to Vakhtang Chikovani for finding and translating the examples.<br />
The ‘Examples’ section of the website provides a different way to access the print,<br />
audio, and video examples. This can be done through browsing by verb, or by searching<br />
(again, in Georgian or in English).<br />
The ‘Exercises’ section contains several different types of exercises to provide<br />
additional practice for using and conjugating verbs. Many of the exercises are<br />
generated based on the conjugated forms or the handpicked examples, and so the<br />
correctness of the answers can be checked automatically.<br />
Finally, the ‘Resources’ section contains links to various online and bibliographical<br />
resources about Georgian, as well as technical suggestions for using Georgian fonts.<br />
The website will be operational in spring 2006; anyone interested in using it should<br />
contact this author.<br />
242
6. Conclusions and Further Work<br />
This paper represents a first attempt at modelling Georgian verbal morphology<br />
using easily available, off-the-shelf technology such as FST. Using some adaptations to<br />
accommodate the templatic and non-compositional structure of the Georgian verbs,<br />
we were able to make significant progress and produce one practical application of the<br />
computational model for language learners. In short, the model provides a convenient<br />
method for representing the existing lexicon for computational applications such as<br />
parsing or generation.<br />
Naturally, each technology has its drawbacks. FST provides no way to incorporate<br />
frequency information about the Georgian lexicon, and, in general, is not an accurate<br />
model for how verbs are learned. Unfortunately, creating a statistically sensitive model<br />
of the Georgian lexicon is not currently an easy proposition, as there are no available<br />
corpora of Georgian, and no immediate ways of obtaining statistical distributions.<br />
This project will develop in several ways in the future. First, the existing model<br />
will be enriched with more verb types and more inflectional parameters (such as the<br />
use of pre-radical vowels and productive passivization and causativization processes).<br />
Second, I plan to explore ways to incorporate statistical information into the model,<br />
either through the use of connectionist networks or by putting numerical transition<br />
probabilities on the different arcs in the FST transducers. The eventual goal would be<br />
to create a model that can be used for learning Georgian verb conjugations, which<br />
could produce a finite-state network of complete word forms. We also hope that this<br />
model and the online reference and collection of examples can serve as the basis for<br />
the creation of a corpus of spoken Georgian. Information collected in the corpus can<br />
then be used to inform and improve future computational models.<br />
243
244<br />
References<br />
Anderson, S.R. (1992). A-Morphous Morphology. Cambridge, England; New York:<br />
Cambridge University Press.<br />
Beesley, K. & Karttunen L. (forthcoming 2006). Finite-State Morphology. Second<br />
Edition. Cambridge/ New York: Cambridge University Press.<br />
Beesley, K. & Karttunen L. (2003). Finite-State Morphology. Cambridge / New York:<br />
Cambridge University Press.<br />
Blevins, J.P. (forthcoming 2006). Word-Based Morphology. Journal of Linguistics.<br />
Boeder, W. (1969) “Über die Versionen des georgischen Verbs.” Folia Linguistica 2,<br />
82-152.<br />
Fillmore, C.J. (1988) “The Mechanisms of ‘Construction Grammar’.” BLS 14, 35-55.<br />
Goldberg, A.E. (1995). Constructions: A Construction Grammar Approach to Argument<br />
Structure. Chicago: University of Chicago Press.<br />
Gurevich, O. (2003). “On the Status of the Morpheme in Georgian Verbal Morphology.”<br />
Berkeley Linguistic Society 29, 161-172.<br />
Gurevich, O. (forthcoming 2006). Constructional Morphology: The Georgian Version.<br />
PhD Dissertation, UC Berkeley.<br />
Halle, M. & Marantz, A. (1994). “Distributed Morphology and the Pieces of Inflection.”<br />
Hale K. and Keyser S.J.(eds) (1994). The View from Building 20: Essays in Linguistics<br />
in Honor of Sylvain Bromberger. Where:, MIT Press.<br />
Kurtsikidze, S. (forthcoming 2006). Essentials of Georgian Grammar. München: LINCOM<br />
Europa.
Melikishvili, D. (2001). Kartuli zmnis ughlebis sist’ema [Conjugation system of the<br />
Georgian verb]. Tbilisi: Logos presi.<br />
Stump, G.T. (2001). Inflectional Morphology: A Theory of Paradigm Structure.<br />
Cambridge, New York: Cambridge University Press.<br />
245
The Igbo Language and Computer Linguistics:<br />
Problems and Prospects<br />
247<br />
Chinedu Uchechukwu<br />
Computer Linguistics is a wholly undeveloped and an almost unknown area of research<br />
in the study of Nigerian languages. Two major reasons can be given for this state of<br />
affairs. The first is the lack of training of Nigerian linguists in this discipline, and the<br />
second is the general newness of computer technology in the country as a whole. This<br />
situation, however, is most likely to change as a result of the increasing introduction<br />
of computer technology in the country, and in the institutions of higher learning in<br />
particular. Such a change is highly promising and most welcome, but it also makes<br />
obvious three main aspects of computer technology that have to be properly addressed<br />
before one can speak with confidence of computer linguistics in connection with any<br />
Nigerian language. These three aspects are: appropriate font programs, good input<br />
systems, and compatible software. This paper looks at the Igbo language in the light of<br />
these points. Section 1, which serves as an introduction, presents the major problems<br />
confronting the language with regard to its realisation in the new technology. Section<br />
2 presents the strategies adopted to take care of these problems. Section 3 examines<br />
the benefits of such strategies on the development of an Igbo corpus and lexicography,<br />
as well as the issue of computer linguistic tools (like spell checkers) for the language.<br />
Finally, section 4, the conclusion, examines the prospects of full-fledged computer<br />
linguistics in the Nigerian setting.<br />
1. Introduction<br />
There are several issues that constitute a major hinderance to the development of<br />
computer linguistics in Nigeria. These range from the implementation problems that<br />
confront the information technology policies of the different Nigerian governments,<br />
to the lack of harmony between such policies and the Nigerian educational systems,<br />
as well as the effects of all these on the different Nigerian languages.<br />
With regard to the government policy, Nigeria has already had two different<br />
computer-related policies. The first was the Nigerian Computer Policy of 1988, and<br />
the second is the newly enacted Nigerian National Policy for Information Technology<br />
(IT) of 2001. As regards the Nigerian National Computer Policy, a comparison of<br />
its goals with actual practice has revealed that the policy itself has not been fully
implemented at all. While Yusuf (1998) sees its lack of success as resulting from<br />
teachers’ incompetence, Jegede & Owolabi (2003) have come to the following<br />
conclusions in their own survey: the policy’s software and hardware stipulations are<br />
completely outdated and not maintained, its teachers’ in-service training provision<br />
has never been practiced, and the stipulated number of computers per secondary<br />
school is rarely to be found. All these can only but confirm the conclusion that “the<br />
current pronouncements are obsolete and need to be updated within the dynamic<br />
world of computers” (Jegede & Owolabi 2003: 9). The recent Nigerian National<br />
Policy for Information Technology has not fared any better. With the assumption that<br />
information technology can be said to have started more intensively in the country<br />
with the return of democracy in 1999 (Ajayi 2003), the failure of the previous national<br />
computer policy is often overlooked, so that the failings of the older policy are simply<br />
being repeated. But some of what is being presented as the ‘achievements’ of the new<br />
policy actually diverts attention from simple core issues that need to be addressed<br />
before such achievements can be effective nationwide. The first example of such an<br />
achievement is a concentration of energy on the readily visible Internet access and on<br />
IT workshops for high officials of the government at the federal level (while leaving<br />
the average civil servants of the lower cadre to find the means of helping themselves).<br />
However, it is especially those of the lower cadre who are surely to be involved<br />
in the actual implementation/execution of the policies. The second example is the<br />
establishment of the National Information Technology Development Agency (NITDA)<br />
with the sum of about $10 million (Nigerian National Policy for Information Technology<br />
2001:vii). One of the agency’s achievements is its ‘Nigerian Keyboard’ project, which<br />
only led to the production of a downloadable single keyboard dll file for the Microsoft<br />
operating system (NITDA: http://www.nitda.org/projects/kbd/index.php). It shall be<br />
shown below how the effort that is being made by the private sector, with little or<br />
no financial support, is yielding more benefit. Finally, just like the computer policy of<br />
1988, the more recent policy has also not contributed much to the educational sector.<br />
In his examination of the impact of the most recent policy on Nigeria’s educational<br />
system, Yusuf (2005) sees it as not providing any specific provision for (or application<br />
to) education, being market driven, dependent on imported software, and without<br />
any specific direction at the institutional levels. His conclusions are that:<br />
The need for integration in teaching and learning, the need for quality professional<br />
development programs for pre-service and serving teachers, research, evaluation<br />
and development, and the development of local context software are not addressed<br />
(Yusuf 2005:320).<br />
248
The overall conclusion from this overview is that the two policies, as well as their<br />
implementation, have not contributed much to the Nigerian educational system.<br />
While agreeing with the analysis of the authors cited above, I would add, however,<br />
that one should also bear in mind that “nobody can give what he does not have.”<br />
For example, one should not expect the bureaucrats, who have always had their<br />
secretaries type their letters, to suddenly understand and fully implement IT policies<br />
that they did not encounter in the course of their training. The same also applies to<br />
the institutions of higher learning. Here one should not, for example, expect lecturers<br />
in the Departments of Linguistics, who did not do any computer-based research work<br />
in the course of their studies, to suddenly start supervising PhD projects in computer<br />
linguistics. In other words, two groups are involved: the civil service and the teaching<br />
force. The civil service is a system that has operated over many decades without the<br />
help of computer technology and that consequently can neither fully appreciate nor<br />
implement the computer technology-related policies as they affect the educational<br />
sector. That also explains why much of the input into making the computer technology<br />
immediately relevant to the Nigerian educational system is coming from other channels<br />
than the federal civil service system itself. The second side is the teaching force within<br />
the Nigerian educational institutions. The majority of Nigerian linguists of the past and<br />
present generation were not trained in the area of computer-based research for the<br />
simple reason that the ordinary typewriter usually was the best machine available to<br />
them at the time they were trained. It is obvious that these scholars would train their<br />
successors in line with what they knew. For this simple reason, it is unjustifiable to<br />
expect them, as well as those they trained, to suddenly start teaching computational<br />
linguistics. The conclusion is that, just like within the civil service, the proper use<br />
of computer technology within the educational sector, especially with regard to the<br />
Nigerian languages, has to come through other channels than those established by<br />
the government, and would involve the inevitable but voluntary contribution of both<br />
private institutions and individuals. This is the simple reality that confronts most<br />
Nigerian languages today.<br />
Finally, the above state of affairs in both the civil service and the educational<br />
sector can be described as human resources related. It is also from this angle that<br />
most analysts of the computer and IT policies of the Nigerian National Computer<br />
Policy (1988) and the Nigerian Policy on Information Technology (2001) have looked<br />
at it. A further confirmation of this is the drive of one single state of the federation,<br />
Jigawa State, to simply ignore the slowly grinding federal structure and seriously<br />
249
invest in information and communication technology hardware, as well in the training<br />
of its civil servants and teachers. This exemplary position, which other states of the<br />
federation are now imitating 1 , was facilitated through an agreement between the<br />
Jigawa state government and Informatics Holdings of Singapore in 2001 2 . A further<br />
boost to the effort of this single state is the recent W 21 (Wireless Internet) award<br />
assigned to the State Governor, Alhaji Ibrahim Saminu Turaki, by the Wireless Internet<br />
Institute of the United States of America, which is an international recognition of<br />
Jigawa State’s investment in ICT and human resources development. However, taking<br />
care of the human resources issue does not also simultaneously take care of the<br />
technical needs of the Nigerian languages; on the contrary, it makes these technical<br />
needs ever more apparent. Thus, the limited increase in computer literacy is enough<br />
to make apparent that the Nigerian languages are confronted with two main problems:<br />
(1) an appropriate input system in the form of keyboards; and, (2) the fonts for a<br />
satisfactory combination of diacritics for the scripts of the individual languages.<br />
In the next section, I shall give an overview of the effort made so far to take care<br />
of these two problems for the Igbo language.<br />
2. Computer-Related Problems and their Solutions<br />
The input system and the appropriate font to display the inputed characters are<br />
so intertwined that progress in one cannot take place without a similar progress in<br />
the other. For the Igbo language (as well as other Nigerian languages), this has meant<br />
a constant pendulum movement between fixing the input system and fixing the font.<br />
But generally, the effort to solve the input problem for the Igbo language and other<br />
Nigerian languages had two main phases: the typewriter phase and the computer<br />
phase.<br />
2.1 The Typewriter Phase<br />
The typewriter phase was engineered by Kay Williamson and her colleagues, first<br />
at the University of Ibadan and later at the University of Port Harcourt. This involved<br />
removing certain foreign symbols on the standard typewriter keyboard and replacing<br />
them with special symbols used in Nigerian languages, like the hooked letters of<br />
Hausa, or by diacritics like tone marks and sub-dots. The diacritic keys were changed<br />
to become ‘dead’ keys, so that the diacritic (tone mark or subdot or both) was typed<br />
before the letter which bears it. With the start of the National Computer Policy in<br />
1988 this effort no longer had any further support and consequently came to an end.<br />
1 http://www.onlinenigeria.com/articles/ad.asp?blurb=117<br />
2 http://www.e-lo-go.de/html/modules.php?name=News&file=article&sid=7512<br />
250
2.2 The Computer Phase<br />
The computer phase, on the other hand, witnessed different stages. The first<br />
stage, from 1985 onwards, was not concerned with a physical input system (i.e. a<br />
keyboard), but mainly with the development of the appropriate font that could be<br />
inputted with the available English keyboard. This effort was led by Victor Manfredi<br />
on behalf of the Journal of the Linguistic Association of Nigeria (JOLAN), supported by<br />
Edward Ọgụejiofor, a Macintosh programmer in Boston. The first version of the font<br />
was called JolanPanNigerian; it was expanded to include symbols for other major<br />
languages of West Africa, and was consequently renamed PanKwa. The main drawback<br />
with PanKwa was its restriction to Macintosh computers, which were not used in<br />
Nigeria; it has also not been possible to adapt it to other operating systems such<br />
as DOS, Windows or Linux (for further details see Uchechukwu 2004). This situation<br />
remained unchanged until 2000.<br />
The next stage has been aided through a convergence of some favourable factors<br />
during the 21 st century, including the availability of virtual keyboards, the founding<br />
of Unicode, as well as the drive towards the development of a physical keyboard for<br />
Nigerian languages. All these, however, have led to four major lines of effort: the<br />
aforementioned Nigerian Keyboard Project of the National Information Technology<br />
Development Agency (NITDA), the independent endeavours of KỌYIN/Lancor, Alt-I,<br />
and my collaboration with Andrew Cunningham (http://www.openroad.net.au).<br />
2.2.1 The Nigerian Keyboard Project<br />
The keyboard layout produced by NITDA is messy. It is not fully Unicode compatible<br />
and does not provide the means for adding some diacritics, like a macron. It is<br />
not surprising that not much work has gone into the project since the release of<br />
its downloadable dll file for the Microsoft OS. Finally, the effort mentioned in the<br />
sections below can only buttress the point that NITDA’s Nigerian Keyboard Project has<br />
become another white elephant.<br />
2.2.2 The KỌYIN Keyboard<br />
The KỌYIN keyboard is not a Nigerian keyboard layout project, but a business<br />
venture of LANCOR Technologies of Boston, MA in the United States. The LANCOR<br />
Multilingual Keyboard Technology (LMKT) (http://www.lancorltd.com/konyin.html) has been<br />
used by the company to create multilingual keyboards for different languages. The<br />
251
KỌYIN keyboard involved the use of the LANCOR Multilingual Keyboard Technology<br />
(LMKT) to create a physical multilingual keyboard for the Nigerian languages.<br />
Some observable changes that the keyboard has undergone can be summarised as<br />
follows. First of all, for the characters (especially the vowels) that are combined with<br />
diacritics (in the form of definite symbols placed under the vowels), the company<br />
initially used the Yoruba Ife System, which involved the use of a short vertical line<br />
under the appropriate character. This was demonstrated on the company’s website<br />
with an instruction on how to key-in the Yoruba name of the president that contained<br />
such characters. Later, the vertical line was replaced with a dot, which is more<br />
widespread in the scripts of other Nigerian languages, including Igbo. This could be<br />
seen as an improvement, as it would also increase the marketability of the company’s<br />
keyboard in Nigeria.<br />
2.2.3 The ALT-I Keyboard<br />
The African Languages Technology Initiative (Alt-I) can be described as an<br />
organisation whose aim is to appropriate modern ICTs for use in African Languages.<br />
The company hopes to achieve this through advocating ICT and also delivering ICT-<br />
related services. But more relevant here is the organisation’s effort to produce a<br />
Yoruba keyboard. The organisation listed some of its achievements in this regard on<br />
its website. These include:<br />
• The production of an installable keyboard driver, which it hoped to start<br />
marketing by 2004;<br />
• Demonstrations (including installation) of its keyboard driver at the<br />
following seven universities in the Western part of Nigeria: University of Ibadan;<br />
Olabisi Onabanjo University, Ago-Iwoye; University of Ilorin; Lagos State University,<br />
Ojo; University of Lagos; Adekunle Ajasin University, Akungba; and Obafemi<br />
Awolowo University, Ile Ife;<br />
• The endorsement of its keyboard by the Yoruba Studies Association of<br />
Nigeria (YSAN) at the 2003 Annual Conference of the Yoruba Studies Association of<br />
Nigeria between the 4 th and 8 th November 2003; and,<br />
• The 2003 IICD award of the African Information Society Initiative (AISI) on<br />
Local Content Applications.<br />
Finally, the organisation has started to reach out to other Nigerian languages outside<br />
the Western and predominantly Yoruba-speaking part of the country. There is no doubt<br />
that the aim is a ‘Nigerian Keyboard.’ A hint in this direction is the comparison of the<br />
KỌYIN keyboard with the ALT-I keyboard: “the Alt-I keyboard is superior to the Lancor<br />
252
product on the grounds that Alt-I considered a lot of human factor engineering and<br />
other social issues, which Lancor seems to have overlooked in their keyboard design”<br />
(Adegbola 2003). I do not know the details of the issues between the two efforts, but<br />
with a population of about 120 million the Nigerian market is large enough for more<br />
keyboards. I now turn to the development of the Igbo keyboard.<br />
2.2.4 The Igbo Keyboard<br />
This is simply an effort that arose through my collaboration with Andrew<br />
Cunningham. The effort is not supported by any business or charity organisation. From<br />
the outset, the focus was to find a solution that would exploit the already available<br />
keyboard layouts and adapt them for the Igbo language without building a physical<br />
keyboard from scratch.<br />
There are many virtual keyboards on the net that could be altered to that effect,<br />
but Tavultesoft’s (www.tavultesoft.com) ‘Keyman’ program was found to be the best.<br />
Two possible physical keyboards came into consideration at the initial stage: the<br />
German keyboard and the English keyboard. The drawback of the English keyboard is<br />
the requirement to hold down or combine not less than three different keys in order<br />
to realise a single subdotted character. Such a method is tedious and not particularly<br />
appealing. That is why I chose the German keyboard. The special German characters<br />
can thus be replaced with specific Igbo characters as shown in Table 1 below.<br />
The third column of Table 1 shows a further combination of the subdotted<br />
characters with tone marks. Through the collaboration with Andrew Cunningham,<br />
all these and many other changes (especially with regard to the consonants) were<br />
incorporated and used to build an Igbo keyboard layout that can freely be downloaded<br />
from the Tavultesoft website. Later a similar keyboard map was also made for the<br />
English keyboard for people who have access only to the English keyboard. But as<br />
has already been pointed out, the users of the English keyboard simply have to cope<br />
with the tedious key combinations. I have therefore donated the Tavultesoft keyboard<br />
program, together with physical German keyboards, to the Department of Nigerian<br />
Languages at the University of Nigeria, Nsukka, as well as to some other Igbo scholars;<br />
since then I have been receiving feedback that was further incorporated to refine<br />
the program for both the average user as well as for the linguist’s most complicated<br />
needs. This has led to the development of the second version of the program. Due to<br />
the use of the English language in Nigeria, the third version of the program has now<br />
been made QWERTY-based like the English and Danish keyboards, thus replacing the<br />
253
QWERTZ layout of the German keyboard. In addition, it also has a much better display<br />
of the characters than is shown in the third column of Table 1. Like the previous<br />
versions, the program shall also be freely available.<br />
Table 1: Some Igbo Characters<br />
Finally, while the Igbo Keyman keyboard is a virtual keyboard, its transformation<br />
into a physical Igbo keyboard shall be taken up at the appropriate time. For the time<br />
being, it has contributed to taking care of the language’s input system. The problem<br />
of the appropriate font programs to go with the keyboard has also been taken care<br />
of through the increasing number of Unicode-based font programs. Thus, the two<br />
aspects of (1) an appropriate input system and (2) the fonts mentioned at the end of<br />
section 1 of this paper have been addressed. The next step is to use these facilities to<br />
tackle specific linguistic problems of the language. In the next section, I shall present<br />
my efforts in this direction, especially with regard to the development of a corpus<br />
model for the Igbo language.<br />
3. Computer Technology and the Igbo Corpus Model<br />
Of all the different activities involved in developing the Igbo Corpus Model 3<br />
obtaining the proper OCR software was indeed difficult, but more difficult was (and<br />
still is) finding an adequate corpus development, manipulation or query system and<br />
the use of the software to process Igbo texts.<br />
Some of the programs I initially experimented with were either theory dependent,<br />
required some manipulation of the system, or required personally writing the internal<br />
components needed; all this would involve an initial preoccupation with more<br />
theoretical issues than with the practical development of the corpus itself. It is for<br />
this reason that I chose the following three pieces of software: WordSmith, Corpus<br />
Presenter, and the Word Sketch Engine (WSE). Some factors influenced my choice.<br />
3 Partly supported through the Laurence Urdang Award 2002 of the European Association for<br />
Lexicography.<br />
254
These programs are relatively theory neutral, have friendly GUI, and are explicit in<br />
their claim to be Unicode-based.<br />
With regard to the Igbo texts, one can differentiate between two types. The first<br />
type is made up of texts without tone marks:<br />
Ị ka nọrịrị Obinna? Obinna tụgharịrị hụ Ogbenyeanụ...<br />
The second type is tone marked like the text below:<br />
Ị̀ kà nọ̀rị̀rị̀ Óbìńnà? Óbìńnà tụ̀ghàrị̀rị̀ hụ́ Ógbènyèánụ<br />
A typical Igbo text written by native speakers for fellow native speakers is usually<br />
not tone-marked, simply because many find it tedious, although tone marking Igbo<br />
texts would make a great deal of difference. However, for any serious linguistic work<br />
or research, the Igbo texts are usually tone-marked. The tone-marked Igbo text above<br />
was produced with version 2.0 of the Igbo Keyman program. The rendition in version 3,<br />
which is soon to be released, is much better. However, I used mainly Igbo texts without<br />
tone marks in my work with the above-named three corpus programs. I examined<br />
them based on (1) how they handled the text input; and, (2) how they handled Igbo<br />
words, with and without diacritics. I shall present the programs individually.<br />
3.1 WordSmith<br />
The basic problem encountered with WordSmith is that not ALL components of the<br />
software are able to handle the Igbo texts appropriately.<br />
For text input, it is not possible to add words with diacritics, either directly through<br />
the Igbo Keyman keyboard or simply by copying and pasting. For example, a keyboard<br />
input of the word ọgụ results in ‘ügß’ (see the WordSmith-Concordance screenshot),<br />
while pasting the word only yields a blank in the entry box. In both cases, activating<br />
‘Search’ does not yield any results.<br />
Figure 1: <strong>Text</strong> Input in WordSmith<br />
In terms of processing an entry, the concordance component of the software does<br />
not function properly. As long as a word to be searched does not bear any diacritics<br />
255
(like a subdot), the program sorts it appropriately, as can be seen in the concordance<br />
of ONWE.<br />
Generally, WordSmith 4.0 is much better than its previous version, but it still has<br />
the problem of not being fully Unicode compatible at all levels. Its use for the Igbo<br />
language is therefore extremely limited.<br />
3.2 Corpus Presenter<br />
Figure 2: Concordance Search for ‘onwe’ in WordSmith<br />
The problems encountered here are the same as in WordSmith: not all components<br />
are fully Unicode-based.<br />
Inputing the Igbo word ndị with the Keyman program, or simply by copying and<br />
pasting, yields nd?, the same result as in WordSmith (This can be seen in the Image of<br />
the Corpus Presenter Search):<br />
Figure 3: <strong>Text</strong> Input in Corpus Presenter<br />
256
In its Dataset component, the program processes an Igbo text in a different manner.<br />
The text is usually well displayed as a dataset, as can be seen from the screenshot of<br />
the Corpus Presenter Dataset.<br />
Figure 4: Corpus Presenter Dataset<br />
But switching to the ‘<strong>Text</strong> Retrieval Level’ for the manipulation of the displayed<br />
text simply turns the characters with subdots into question marks. This can be seen<br />
from the screenshot <strong>Text</strong> Retrieval component.<br />
Figure 5: Corpus Presenter <strong>Text</strong> Retrieval<br />
Finally, one major point of difference between WordSmith and Corpus Presenter<br />
is in the making of word lists. While WordSmith can do it without loss of data, Corpus<br />
257
Presenter simply leaves out ALL the characters with diacritics. This can be seen in the<br />
screenshot of the Corpus Presenter and WordSmith wordlists.<br />
Figure 6: Corpus Presenter Word List Figure 7: WordSmith Word List<br />
Generally, the two programs are very good for manipulating texts of European<br />
languages, with Corpus Presenter also having the further advantage of the capacity<br />
for POS-tagging. But with regard to the Unicode scripts of an African language like<br />
Igbo, they both have their limitations.<br />
I shall now turn to the next program, which is the most promising, the most user-<br />
friendly, and the most suitable for use in teaching corpus linguistics at an elementary<br />
or advanced stage.<br />
3.3 Word Sketch Engine (WSE)<br />
WSE is so far the only relatively theory-neutral program that has been able to<br />
handle Igbo texts without tone marks. The different components of the program are<br />
also reliable. For example, words could be keyed in or copied into the Concordance<br />
component of the program without any loss or distortion. This applies to both words<br />
without subdots and those with subdots, as can be seen from the two screenshots:<br />
258
Figure 8: A Word Without Subdots in WSE<br />
Figure 9: A Word With Subdots in WSE<br />
The collocation analysis of WSE does not present any difficulties; neither does it distort<br />
the characters of the language. This can be seen in the collocation screenshot:<br />
259
Figure 10: Collocation in WSE<br />
Finally, the texts processed with the three different programs are Igbo texts<br />
without tone marks. The way each program handles such a text determines the extent<br />
to which the program can be taken into consideration for texts with greater character<br />
combination and complication. This simply means that, at the level of texts without<br />
tone marks, WordSmith and Corpus Presenter are of limited use. For WSE, however,<br />
an investigation is still to be made into its further use for processing Igbo texts that<br />
have been complicated through the combination of more diacritics with the subdotted<br />
words. In the next section, I shall briefly discuss the effect of the above problems on<br />
Igbo lexicography, as well as the efforts to develop a spellchecker for the language.<br />
3.4 Igbo Lexicography and the Igbo Spellchecker<br />
The three lexicographic works of the language to be examined here are Williamson’s<br />
Igbo-English Dictionary (1972), Echeruo’s Igbo-English Dictionary (1998), and Igwe’s<br />
Igbo-English Dictionary (1999). The spellchecker is a project that I am currently<br />
working on with Kevin Skannel (http://borel.slu.edu/nlp.html).<br />
Each of the three dictionaries have imprints of the technological stages of the time<br />
when they were written. Williamson’s dictionary was produced with the typewriter. Its<br />
legacy is the imprint it has left on Igbo orthography. A particular tone of the language<br />
known as the ‘downstep’ was marked in her dictionary through placing the dash ‘-’ on<br />
the sound segments that incorporate the downstep. This means the following forms<br />
for the vowels without subdots: ā ē ō ī. The same was also done for the subdotted<br />
vowels. But as was pointed out in section 2 above, this was achieved through the<br />
physical adjustment of the typewriter. Through such a method, Igbo texts can be<br />
properly tone marked with the old typewriter.<br />
The presentation of the Igbo characters in Echeruo’s dictionary has been strongly<br />
influenced by the available fonts within the Microsoft operating system. The author<br />
260
simply used the German umlauted vowels (shown in the first column of Table 1 above)<br />
instead of the Igbo subdotted vowels. But, with his method, an Igbo word like ụ̀tọ́<br />
‘sweet, sweetness’, whose tone is indicated on the word itself, is written as ütö<br />
[LH]. This simply means marking the tone extra. Such a method, however, becomes<br />
irrelevant when one tries to use it for the representation of a fully tone-marked Igbo<br />
text. The method has not found much acceptance (Uchechukwu 2001), but the author<br />
has also agreed to change it in line with the existing orthography (Anyanwu 2000).<br />
There is no doubt that this would become easier for him as a result of the growing<br />
improvements in computer technology, including the freely available Igbo keyboard<br />
and Unicode-based font programs.<br />
Igwe’s dictionary was written in line with the existing orthography; however, the<br />
author’s effort involved the use of the ‘Symbols’ windows within Microsoft’s Word to<br />
painfully click on the individual Igbo characters of his 850-page dictionary! The only<br />
mark the method left on his work can be seen in the combination of the vowel with<br />
a tone mark. Within the Microsoft 95/98 system used by the author, this lower case<br />
vowel is automatically changed into an upper case vowel through such a combination.<br />
Thus, a combination of the vowel with the high tone symbol (accent acute) <br />
yields ; and a combination with the low tone symbol (grave accent) is realized<br />
as . This is regardless of whether the letter occurs at the beginning of a word,<br />
in the middle, or at the end. But with the current improvements in the different<br />
operating systems, as well as the available Igbo keyboard, such problems can now be<br />
completely addressed.<br />
The spellchecker is still an on-going project between Andrew Scannel and myself.<br />
For the time being, this is restricted to Aspell and Igbo texts without tone marks. The<br />
development of a spellchecker for fully tone-marked Igbo texts shall be taken into<br />
consideration at the appropriate time.<br />
4. Prospects for Computer Linguistics<br />
The above situation of the Igbo language has not only highlighted the stages<br />
involved in the struggle of the language with modern technologies, but also how this<br />
development can be enhanced.<br />
The texts of the Igbo language without tone marks can be processed to some extent<br />
by some corpus processing or development systems. Such texts are of little research<br />
significance, however, since they cannot graphically represent the very phenomena<br />
that are of interest to both the ordinary linguist and the computational linguist.<br />
Compared with the situation of such matters just a few years back, it is already a<br />
great advancement to have some software that can handle Igbo texts without tone<br />
261
marks. But a further step in the direction of laying a good foundation for future<br />
computer linguistics within the Nigerian setting requires the different sophisticated<br />
corpus software, acoustic phonological system, spellchecker and so on to be in a<br />
position to handle the scripts of the language, whether with or without tone marks.<br />
The conclusion is that through developing the Igbo keyboard, as well as through<br />
the availability of freely downloadable Unicode-based fonts, the problem confronting<br />
the average Nigerian language is now solely a software problem and no longer the<br />
problem of a physical keyboard or an operating system. Two additional points support<br />
this conclusion. First of all, the same Keyman program can be used to write more<br />
keyboard maps for other Nigerian languages, thus making it unnecessary to invest<br />
in building a physical keyboard from scratch. All that the users with different native<br />
languages within the Nigerian setting need to do is simply click on their language<br />
keyboard map. This solution is not likely to change, because the English keyboard<br />
has become part and parcel of the computer hardware within the Nigerian setting.<br />
Thus, the production of a physical keyboard for a Nigerian language would definitely<br />
involve an expansion of the physical English keyboard. The second point is the present<br />
effort by Andrew Cunningham and Tavultesoft to further port the Keyman program<br />
into the Linux operating system. This should make available to Linux users the same<br />
keyboard facility that the Keyman program has provided for the Windows operating<br />
system. The effect would be to have the appropriate software also within the Linux<br />
OS. Both developments can only but further the point that the keyboard and font<br />
stages have been addressed. Computer linguistics within the Nigerian setting now<br />
faces the problem of developing the necessary programs that make use of the facilities<br />
presently available.<br />
Finally, the picture of the Igbo language presented here shows that the current<br />
excitement about the new technology should not make us overlook the simple fact<br />
that the three necessary elements for the development of computational linguistics<br />
for any African language are: (1) an appropriate font program; (2) a good input system;<br />
and (3) compatible computer programs. Thus, the development of computational<br />
linguistics for an average African language depends on the extent to which these three<br />
aspects have been taken care of for the respective language.<br />
262
263<br />
References<br />
Adegbola, T. (2003). 2003 Annual report on the activities of African Languages<br />
Technology Initiative (Alt-I). http://alt-i.org/2003Report.doc.<br />
Ajayi, G.O. (2003). “NITDA and ICT in Nigeria.” Paper presented at the 2003 Round<br />
Table Talk on Developing Countries Access to Scientific Knowledge. (The Abdus Salam<br />
ICTP, Trieste, Italy, 23 October 2003.<br />
Anyanwu, R.J. (2000). “Echeruo, Micheal J.C. 1998. Igbo-English Dictionary: A<br />
Comprehensive Dictionary of the Igbo Language, with an English-Igbo Index.”<br />
Frankfurter Afrikanistische Arbeitspapier. 12: 147-150.<br />
Echeruo, M.J.C. (1998). “Igbo-English Dictionary: A Comprehensive Dictionary of the<br />
Igbo Language, with an English-Igbo Index.” Yale: Yale University Press.<br />
Egbokhare, F.O. (2004). Breaking Barriers: ICT-Language Policy and Development.<br />
Dugbe, Ibadan: ODU’A Printing & Publishing Company Ltd.<br />
Jegede, P.O. & Owolabi, J.A. (2003). “Computer Education in Nigerian Secondary<br />
Schools: Gaps Between Policy and Practice.” Meridian: A Middle School Computer<br />
Technologies Journal. Raleigh, NC: NC State University, 6(2).<br />
http://www.ncsu.edu/meridian/sum2003/nigeria/print.html.<br />
Nigeria National Computer Policy (1988). Lagos: Federal Ministry of Education.<br />
Nigerian National Policy for Information Technology (IT) (2001).<br />
http://www.nitda.gov.ng/docs/policy/ngitpolicy.pdf<br />
Uchechukwu, C. (2001). “Echeruo na Eceruo, kedu nke ka mma…?” KWENU, 1(8), 16-<br />
22.<br />
Uchechukwu, C. (2004). “The Representation of Igbo with the Appropriate Keyboard.”
Paper presented at the International Workshop on Igbo Meta-Language. (University of<br />
Nigeria, Nsukka, 18 April 2004).<br />
Yusuf, M.O. (1998). “An Investigation into Teachers’ Competence in Implementing<br />
Computer Education in Nigerian Secondary Schools.” Journal of Science Teaching and<br />
Learning, 3(1 & 2), 54-63.<br />
Yusuf, M.O. (2005). “Information and Communication Technology and Education:<br />
Analysing the Nigerian National Policy for Information Technology.” International<br />
Education Journal, 6(3), 316-321.<br />
264
Annotation of Documents for Electronic<br />
Editing of Judeo-Spanish <strong>Text</strong>s: Problems and<br />
Solutions<br />
Soufiane Roussi and Ana Stulic<br />
The result of an interdisciplinary process comprising Linguistics, Information and<br />
Computer Sciences, this contribution consists of modelling the annotated electronic<br />
editing of Judeo-Spanish 1 texts written in Hebrew characters, following the principle<br />
of document generation in a collaborative work environment. Our approach is based<br />
on the concept of digital document annotation that places mark-up at any text level,<br />
starting with the character resulting from the transcription. Our point of view is that the<br />
annotations of a ‘translated/interpreted’ document can have two different purposes:<br />
to interpret (to add new mark-up in order to propose a different interpretation from<br />
the one formulated at the beginning); and, to comment (make a comment on the<br />
interpretation done by another author). Our aim is to make it possible for the reader/<br />
user to interact with the document by adding his own interpretation (translation)<br />
and/or comments on an interpretation made by another author. We present a model<br />
for the description of annotation in response to our problem.<br />
1. Introduction<br />
In this paper, we will explore the problem of digital document annotations in<br />
application to the very specific problem of building a Judeo-Spanish corpus. We will<br />
briefly present the interest of such an enterprise, together with some difficulties<br />
related to building corpora in general, as well as those specific to the Judeo-Spanish<br />
case. Considering the recent developments in information technology (IT), we will take<br />
into account the concepts of digital documents, automatic generation of documents,<br />
and production of digital documents in collaborative mode, and then, apply them to<br />
our problem. Finally, we will propose a prototype model for Judeo-Spanish corpus<br />
building in the context of a collaborative environment. This proposal offers some<br />
conceptual and methodological solutions based on existing technologies, but leaves<br />
open the question of technical realisation.<br />
1 The language of the Sephardic Jews, who, after being expelled from Spain at the end of 15th century,<br />
settled in the greater Mediterranean area.<br />
265
2. Building a Judeo-Spanish Corpus<br />
2.1 The Research Value of a Judeo-Spanish Corpus<br />
Judeo-Spanish is the language spoken by the Sephardic Jews, who, after being<br />
expelled from Spain at the end of 15th century, settled in the greater Mediterranean<br />
area. It represents a variety of Spanish that has followed its own development path<br />
since late 15th century (though not without any contact with the Iberian Peninsula),<br />
and is relatively well documented. Many original documents in Judeo-Spanish, such<br />
as manuscripts, books and other printed material have been conserved. Linguistic<br />
fieldwork from the beginning of 20th century also yielded a certain number of oral<br />
transcriptions.<br />
From a linguistic point of view, Judeo-Spanish is very interesting because it offers<br />
numerous possibilities for comparative and historical linguistic analysis, in that<br />
peninsular Spanish is itself very well documented in terms of the pre-expulsion period.<br />
Equally, the original sources are of great value for historical and cultural research.<br />
Unfortunately, in many countries where it was kept alive for centuries, Judeo-<br />
Spanish has been in progressive decline since the beginning of the 20th century, and,<br />
at present, it is no longer spoken in many cities of Balkan Peninsula (formerly the<br />
centres of Judeo-Spanish culture). Therefore, the editing of original Judeo-Spanish<br />
sources can also contribute to the preservation of knowledge about this language.<br />
The approach adopted here in treatment of Judeo-Spanish documents has been<br />
primarily oriented to their usage as a corpus for linguistic research, but it can be<br />
extended to other uses as well.<br />
2.1 General Problems of Linguistic Corpora Editing<br />
In the humanities (philology, history, literature and linguistics), the word ‘corpus’<br />
has traditionally referred to any body of texts that are used for research. In modern<br />
linguistics, it refers most often to relatively large collections of texts that represent<br />
a sample of a particular variety or use of language(s). Language corpora can be in the<br />
form of manuscripts, paper-printed, sound recordings (spoken corpora), or machine-<br />
readable. Nowadays, it has become common to think of it especially in this latter<br />
form- and not without reason. The development of computer-readable corpora has<br />
enlarged the possibilities of linguistic research by simplifying search tasks, and making<br />
possible the use of large portions of texts.<br />
Compilations of electronic corpora, especially those with historical dimensions,<br />
rely upon existing written documents, which are frequently philological editions. In<br />
266
electronic corpora, regardless of whether the electronic text is made from a source<br />
document (such as a manuscript or original edition) or a philological edition, the<br />
authors of the corpus must develop annotation strategies in order to represent the<br />
source document from which the text is derived. The information commonly provided<br />
in annotations concerns metadata about the digital document itself (as well as the<br />
source document), but it can also deal with the linguistic properties of the text, such<br />
as parts of speech, lexemes, prosody, phonetic transcription, semantics, discourse<br />
organisation, co-reference, and so forth. Designed to be global (extending to the<br />
entire corpus) and universal in their validity, most types of linguistic annotations<br />
represent complex tasks that are economically very expensive and inconvenient for<br />
lesser-used language corpora that are designed principally for scientific research. On<br />
the other hand, de Haan’s proposal of problem-oriented tagging offers another point<br />
of view (de Haan 1984). In this approach, the users take a corpus, either annotated<br />
or unannotated, and add to it their own form of annotation, oriented to their own<br />
research goal. This type of annotation seems very promising in the context of specific<br />
research needs, provided that the annotational system in question is supplied with a<br />
dynamic and interactive dimension.<br />
Although in language corpora building, emphasis is frequently placed on developing<br />
software search possibilities and linguistic annotations, one crucial question remains:<br />
Where do the texts come from? Though reproducing the raw text of a source document<br />
may seem like a simple task, it often isn’t, especially when dealing with ancient texts,<br />
or texts in writing systems other than Latin character set. In current corpora building<br />
practices, the old philological problems are still of current interest. In the traditional<br />
philological paper editions, the text is determined by the editor (on the basis of the<br />
source document[s]), together with the critical apparatus and the writing system,<br />
and the reader/researcher can only accept the editor’s interpretation. In electronic<br />
corpora, the approach is similar: while the source document remains inaccessible for<br />
practical reasons, whatever the editorial choices of the corpus author are (including<br />
critical apparatus, writing system and annotations), the user cannot intervene or<br />
adapt them to his/her own purposes.<br />
On the other hand, source documents have never been more accessible in technical<br />
terms, albeit on condition that they are available in digital form, as an image or as<br />
a sound file. The possibility of consulting the digital image of a source document<br />
in parallel to its electronic transcription would enable the researcher to be critical<br />
of the editor, and would resolve some of the philological problems caused by the<br />
necessarily arbitrary choices one is forced to make in a philological edition.<br />
267
2.2 Specific Problems in the Judeo-Spanish Context<br />
The most salient specificity of Judeo-Spanish texts is the writing system in which<br />
they were composed, and it represents, at the same time, the most important difficulty<br />
in their editing and computer processing.<br />
The Judeo-Spanish documents produced in the post-expulsion period were<br />
commonly written in an adaptation of Hebrew script (in this context its distinctive<br />
Rashi version is frequently used, but the square Hebrew script is equally found; the<br />
difference between the two is only in the form of the letters). The practice of using<br />
Hebrew script for texts in Romance languages was already very common before the<br />
expulsion (see Minervini 1992).<br />
In the history of writing systems, the adaptation of a script originally designed for<br />
one language into the writing system of another language is a cultural phenomenon<br />
that has been frequently repeated; it leads to the development of new conventions<br />
adapted for the target language that involve the use of graphemes coming from the<br />
source language’s writing system.<br />
In its original form, the Hebrew script made no use of vocalic graphemes, because<br />
in most of the cases, the realised linguistic contrast was of grammatical and not<br />
lexical character. In order to avoid certain ambiguities, some letters progressively<br />
acquired vocalic meaning in certain contexts in Biblical Hebrew (yod, waw, he and<br />
aleph). The fully vocalised writing system of Hebrew was designed much later, and<br />
has been mostly reserved to the texts of the Bible (for more details see Sampson<br />
1997:123-129).<br />
Similar to the Hebrew writing tradition, in Spanish texts, the fully vocalised script<br />
was reserved only to translations of the Bible and sacred texts, but in other texts<br />
the usage of letters with vocalic meaning was extended to all contexts, with the<br />
particularity that two letters, yod and waw, could denote two vowels each, /e/<br />
and /i/, and /o/ and /u/ respectively. Also, a diacritic sign (of different shape in<br />
different times and traditions) has been introduced above certain letters to create<br />
new graphemes for consonants that had no counterpart in Hebrew writing system, or<br />
that lacked phonological value in Hebrew (Sampson 1997:123-129).<br />
Nevertheless, the history of this adaptation shows many variations in application<br />
of conventions. One of the sources of variation comes from the possibility to use<br />
different letters for the same phoneme (this kind of variation is found even within the<br />
same text). Although some of the basic principles of adaption of Hebrew script for the<br />
Spanish language have probably been trasmitted over generations, the reading and<br />
268
writing of Hebrew represented the only constant knowledge, so conventions applied<br />
to Spanish could be updated at any time. On the other hand, Judeo-Spanish evolved<br />
phonologically, and this was also reflected in the writing system (Pascual Recuero<br />
1988).<br />
The main, but not the only problem, in the editing Judeo-Spanish texts, is the<br />
underspecified use of vowel graphemes. This becomes even more complex if we<br />
consider the fact that the vowel system has suffered some modifications and that<br />
reconstruction on the basis of 15 th century Spanish can not be completely reliable<br />
(especially if we take into account the fact that 15 th century Spanish is known through<br />
the different variations it presented, including in the vowel system).<br />
Two approaches are possible: (1) conserving the original script conventions in<br />
every way, by transliteration of original documents, which means replacing each<br />
grapheme by another one; and, (2) interpreting vowel graphemes, by transcription,<br />
which means specifying the vowels where their presence is indicated. If carried out<br />
in the traditional sense of a philological edition, both have their advantages and<br />
drawbacks. The transliteration conforms to the source, so that the researcher can rely<br />
on its fidelity, but it doesn’t make the text more accessible in terms of intelligibility.<br />
The transcription is certainly more intelligible, but the choices of interpretation of<br />
vowels done on the basis of reconstruction are determined (and fixed forever) by the<br />
transcriber, and fidelity to the source is lost 2 . The need for both translitteration and<br />
transcription as research tools has been recognised in the study of Judeo-Spanish texts;<br />
they are both used in different contexts, and sometimes even the parallel versions<br />
of texts are proposed, as in the edition of Jewish medieval texts from Castilla and<br />
Aragon by Laura Minervini (1992); also, a similar solution is proposed independently<br />
in Stulic (2002) for the editing of a 19 th century Judeo-Spanish newspaper El amigo<br />
del puevlo.<br />
In this paper, we wanted to model a solution for the electronic editing of such texts<br />
that could encompass both approaches, and maybe even offer something more as a<br />
research tool. Although we took as a starting point a very concrete Judeo-Spanish text<br />
from the 19 th century, the problems we are endeavouring to solve could derive from<br />
any text edition where a choice between the intelligibility of the text and the fidelity<br />
to the source is imposed.<br />
2 The text transcribed for syntactic analysis wouldn’t be useful, for example, for research on phonological<br />
issues or writing system history.<br />
269
3. The Digital Document and Annotation<br />
3.1 What is a Digital Document?<br />
The beginning of the year 2000 showed an increasing spread of online environments,<br />
which has been facilitated by the use of databases for the storage of content, the<br />
automatic generation of digital documents, and the use of interactive ‘fill-in’ forms. It<br />
is this favorable situation, together with numerous developments in Web technologies,<br />
that leads us to the creation of a Judeo-Spanish corpus in a digital environment.<br />
First of all, what is a digital document? Following the definitions in use among French<br />
scholars, it can be defined on three levels: as an object (material or immaterial), as<br />
a sign (meaningful element) and as a relation (communication vector). As an object,<br />
a digital document can be defined as “a data set organised in a stable structure<br />
associated with formatting rules to allow it to be read both by its designer and its<br />
readers” (Pédaugue 2003). As a sign, a digital document is “a text whose elements<br />
can potentially be analysed by a knowledge system in view of its exploitation by a<br />
competent reader” (Pédaugue 2003). From the social perspective, a digital document<br />
can be viewed as “a trace of social relations reconstructed by computer systems.”<br />
(Pédauque 2003).<br />
For our approach, the perspective of a digital document as an object (material<br />
or immaterial) seems the most relevant. As such, the digital document opens three<br />
principal issues: storage, plasticity, and its programmability (Rouissi 2004). It is this<br />
latter characteristic that captures our attention here. Dealing with the annotation of<br />
electronic documents, our aim is to make use of solutions made possible by automatic<br />
generation.<br />
In a functional approach, a digital document can be considered as any other<br />
document. In documentalist theory, the definition of a document emphasised the<br />
function (something that serves as an evidence) more than the actual physical form<br />
(paper, stone or antilope) (see Schürmeyer 1935; Briet 1951; Otlet 1990). The irrelevance<br />
of support became only more evident in the case of the digital document. Buckland<br />
writes “The algorithm for generating logarithms, like a mechanical educational toy,<br />
can be seen as a dynamic kind of document unlike ordinary paper documents, but still<br />
consistent with the etymological origins of “docu-ment”, a means of teaching - or, in<br />
effect, evidence, something from which one learns” (Buckland 1998). In accordance<br />
with this point of view, and considering the digital document as a research tool, we<br />
would like to explore its programmable possibilities.<br />
270
3.2 Automatic Generation of Electronic Documents<br />
Current Web technology offers the possibility of creating a document upon request.<br />
In this context, the result of an execution of a computer program is a document: a<br />
Web page obtained on the basis of one or multiple resources (program, database,<br />
cascading style sheets, etc.). The automatic generation of Web documents is based<br />
on the use of scripting programming languages whose execution takes place on the<br />
server. These technologies accelerate the treatment, making it possible to surpass the<br />
limits of HTML (Hypertext Markup Language), which remains static, and only permits<br />
the handling of the document’s layout. Equally, a connection is established with the<br />
databases where the information to be furnished in a given context is stored. Thanks<br />
to these principles of functioning, and with an appropriate editing program, individual<br />
appropriation by Information Technology non-professionals has developed. The ease<br />
with which it is possible to reuse documentary resources existing on the Web only<br />
contributes to an even greater development of digital document production. The<br />
correction, modification and adding possibilities facilitate digital document production<br />
in an autonomous mode. Production in autonomous mode is defined by a user, who<br />
elaborates the content and defines its layout for his personal use, or for that of other<br />
users. This production is realised with the help of suitable IT material and programs.<br />
The autonomous mode means that the user has all the creative freedom (choice of<br />
layout, colours, fonts, format, file names, diffusion and storage hosting, etc.), but<br />
also presupposes that he has all the technical competence needed. On the other hand,<br />
by conserving the autonomy of production while facilitating the exchange, the semi-<br />
autonomous mode can be applied in a collaborative environment where production<br />
rules need to be followed. It is in this context that we wish to situate our work on<br />
digital document annotation. Our aim is to put at the user’s disposal (in this case,<br />
scholars working on Judeo-Spanish documents) a tool modelled on the principle of<br />
semi-autonomous mode production of digital documents.<br />
In this context, the rules are defined at technical level (layout rules and data<br />
structuration) but the decision to produce and to publish is made by the user (Rouissi<br />
2005). This implies of course that he can be identified by the system, and that the<br />
maintenance and technical assistance service is provided. The user doesn’t produce<br />
in an isolated manner, but in a collaborative work environment, which somewhat<br />
constrains his production, but, on the other hand, offers a conception of the whole<br />
and facilitates the integration of individual work.<br />
271
The emergence of online environments, and within them, the possibility of<br />
producing and uploading digital documents, has changed the role of the user from<br />
passive user/reader to active user/author. In semi-autonomous mode, what is defined<br />
in advance concerns the common vocabulary, the visual aspect, and the structure of<br />
digital documents 3 .<br />
The principal advantages of production in semi-autonomous mode are:<br />
• the durability of the system and the possibility of its evolution (the<br />
contents can evolve more easily);<br />
• the autonomy of handling (the use of ‘fill-in’ forms allows for the handling<br />
of the data by the users themselves);<br />
• the minimal technical competence that is required (the systems remain<br />
intuitive and easy to handle); and,<br />
• the common vocabulary.<br />
It is in terms of these principles that we envisage the development of the digital<br />
document annotation model for Judeo-Spanish corpus edition.<br />
3.3 Production in a Collaborative Mode<br />
Production in a collaborative mode, already widely present in different forms as<br />
collective websites nourished by individual contributions (forums, blogs, wikis, etc.),<br />
seems particularly suited for the collaborative work of specialists of the documents<br />
in discussion. In this sense, annotation can play an important role in the evaluation,<br />
interpretation and production of a document, which thereby becomes itself dynamic<br />
and subject to evolution. The final (and collective) document obtained is the result<br />
of the contribution of individual fragments (but not necessarily the sum of the<br />
contributions), and it can also result from choices the user has made.<br />
Annotation is something added to the document. It can be a remark, a comment,<br />
or, in our case, even a proposition of interpretation. Already in 1945, Vannevar Bush<br />
envisaged for the Memex (a device which was supposed to create links between related<br />
topics in different research papers) that the owner could add his own comments (Bush<br />
1945). More recently, numerous developments in annotation management systems<br />
appeared with real promise in the direction of sharing and exchanging information.<br />
Several (wide public) office programs (some versions of MS-Word) or the W3C<br />
project Annotea in the Web domain (Annotae 2005) are just some of the examples<br />
of applications aimed at sharing annotations. A more exhaustive list can be found in<br />
Perry (2005).<br />
3 We are not dealing here with the application of the XML (eXtensible Mark-up Language) which plays an<br />
important role in the data structuring and data exchange.<br />
272
There are two types of annotations: semantic annotations (in the sense of<br />
standardised metadata Web annotations) and free annotations. The former are<br />
attached to the actual work on the Semantic Web, and are based on the development<br />
of metadata and/or ontologies used in the description of the document, with the<br />
purpose of facilitating their localization, identification and automatic recognition.<br />
Without neglecting this important issue, we will focus here on free annotations,<br />
because they are used – in philological and linguistic analysis – to interpret and<br />
comment upon documents, and we therefore consider that they can constitute an<br />
important factor in the development of collaborative digital production, improving,<br />
at the same time, communication among the specialists of the domain in question.<br />
From our point of view, annotation can have two purposes. The first concerns the<br />
interpretation of the original document (how to translate or read it). The second<br />
adds comments to the one part of the document and/or to the interpretation already<br />
made. The annotation can be placed at several levels. The global level concerns the<br />
annotation made on the whole of the document put into discussion. This annotation<br />
can be based on free comments made on the entire document or can represent a<br />
reaction to the global annotation already made. We consider it useful, for the sake of<br />
analysis, that the zone of annotation is freely marked in the document. The smallest<br />
mark up unity is the character; therefore, a part of the word, a word, a line, several<br />
lines, a paragraph, or several paragraphs can also be the target of the mark up.<br />
Collaborative work is situated in the context of semi-autonomous production. Every<br />
member of this collaborative community participates in a responsible way, benefits<br />
from the result of the work of the community, and receives feedback for his work.<br />
Two models of work where everyone’s autonomy can be expressed are: cooperative<br />
work (wherein everyone accomplish a part of the work and shares it with others) and<br />
collaborative work (wherein several autonomous individuals work together in order<br />
to produce collectively). We’ll see how the concept of a digital document and its<br />
production can help in our case.<br />
4. Prototype Model for the Descriptions of Annotation<br />
4.1 General Properties<br />
Considering the problems related to the treatment of Judeo-Spanish texts and<br />
to the building of a corpus, and taking into account the theoretical approach to the<br />
digital document (particularly from the point of view of collaborative work), we<br />
propose a model that can respond to the needs we have identified. Our work focuses<br />
on the definition of needs without making choices that could constrain future program<br />
273
implementation. In this sense, our contribution is situated on the analytic level prior<br />
to the concrete realisation of project.<br />
One of the first preoccupations is the constitution of documentary corpus. In order<br />
to achieve this, the model must be conceived as a digital repository of documents<br />
that are described with specifications that are sufficiently fine-grained, but open<br />
to interoperability. The collaborative dimension must take into consideration the<br />
management of users. Our intention is to describe annotations and to build a typology<br />
of annotations that will appear as the system begins to function.<br />
We will resume briefly here the requirements that the model should satisfy:<br />
• the source document should be accessible, as a transliterated version, or,<br />
ideally, as a collection of image files;<br />
• the transcribed version is given as a starting point of discussion/<br />
analysis;<br />
• the metadata annotations (according to the widely accepted standard,<br />
Dublin Core) are provided with the transcribed version and image files;<br />
• the authorised user can add free annotations on a global or any other text<br />
zone level, starting with the character; they may include new interpretation<br />
(may include corrections) and/or comments; and,<br />
• the authorised user can export the result of his or another’s work by<br />
making choices to use or not the annotations that he or another has made.<br />
The annotation management system has still to be developed, and it will use, as a<br />
basis, the model here presented.<br />
4.2 Data Description Model<br />
We have seen that the particularity of Judeo-Spanish texts serves as the source of<br />
many methodological and technical problems. Among them, we’ll concentrate on the<br />
annotation that represents the form of document production in collaborative mode.<br />
The annotation can help to handle various interpretations, as well as the comments<br />
made by reader-users. Here, we see the possibility of developing real support for<br />
documentary information and communication: the document in question remains the<br />
carrier of different contributions, and represents, at the same time, the archive of<br />
the exchanges made, as well as the basis for different reading possibilities.<br />
We offer here a proposal of a data description model adapted to our needs in the<br />
study of Judeo-Spanish texts. We have chosen the representation model based on the<br />
Codd’s relational model (Codd 1970), which presents the meaningful data in the form<br />
of relations, as grouped properties, and as a whole. The choice of representation<br />
274
inspired by the relational model is guided by the desire to conserve the freedom in<br />
the definition of the necessary fields (type, size, etc.) in the future implementation.<br />
The relational model for the annotation of Judeo-Spanish documents is constituted<br />
of four relations. The primary keys are in bold and underlined, the foreign keys are in<br />
bold and followed by the # sign.<br />
• ANNOTATION (annotation_num, annotation_date_creation,<br />
annotation_date_lastmodified, annotation_ comment_title,<br />
annotation_comment_text, annotation_language, annotation_position_begin,<br />
annotation_position_end, annotation_status, annotation_commented_num#,<br />
annotation_type_num#, document_num#, author_num#).<br />
The relation ANNOTATION, whose identifier annotation_num (primary key) has to<br />
be created automatically, conserves the trace of the date of its creation (annotation_<br />
date_creation), as well as the date of the last modification (annotation_date_<br />
lastmodified). An annotation is described equally by the language of the author with<br />
the property annotation_language. The status carried by the property annotation_<br />
status allows the author to say whether the annotation is considered as active or<br />
not: value 1 is for active and public (default value), 0 signifies inactive or private<br />
(i.e., reserved to its author, who considers it of no utility for the public while in draft<br />
status, or for some other reason). The annotation can be deactivated, because it<br />
evolves over time and has no permanent character (notion of duration of annotation).<br />
The contribution added by the identified author (primary key author_num) over the<br />
given document (document_num) has a short title (annotation_comment_title),<br />
which will be used for the publication of the lists of comments, and a text field<br />
(annotation_comment_text) of variable size. The position of the annotation in the<br />
given text/document is determined by the starting point (annotation_position_begin)<br />
and the ending point (annotation_position_end). In the case where the annotation<br />
concerns the whole document and not only one of its fragments, position is indicated<br />
in the following manner: annotation_position_begin = annotation_position_end = 0.<br />
The annotation_commented_num property allows for the formal identification<br />
of the annotation on which the comment is made. In the opposite case, where the<br />
annotation is not made over another annotation (without a link to another annotation),<br />
the value of annotation_commented_num is 0. The type of annotation can be specified<br />
(otherwise the value 0 is attributed) with the annotation_type_num property, which<br />
points to the common vocabulary shared by the members of the community.<br />
• ANNOTATION_TYPE (annotation_type_num, annotation_type_<br />
vocabulary, annotation_type_description, annotation_type_mode_edit)<br />
275
The property annotation_type_mode_edit indicates (with the value 0 or 1) whether<br />
the annotation aims (or not) at proposing that a text be substituted for the one<br />
that is initially put into discussion. This kind of annotation corresponds to the<br />
editing action in a document.<br />
Some of the examples of annotation_type_vocabulary values would be: interpret,<br />
comment, refuse, confirm, accept, and so forth. This vocabulary can be established<br />
also a posteriori with the observation of users’ practices, and with their help. The<br />
addition of elements in the future table can be made from proposals of the users.<br />
The annotation_type_description allows for the inclusion of additional information,<br />
and for making the chosen vocabulary more precise.<br />
• AUTHOR (author_num, name, email, login, password).<br />
The AUTHOR relation describes the users that are authorised to interact with<br />
the document, to bring annotations, or to modify the existing ones (the author can<br />
only modify his own annotations). Whoever wants to propose a contribution must<br />
be identified.<br />
• DOCUMENT (document_num, title, creator, keywords, description,<br />
publisher, contributor, date, resourcetype, format, identifier, source, language,<br />
relation, coverage, rightsmanagement).<br />
The DOCUMENT properties follow the recommendations of the Dublin Core<br />
Metadata Initiative (Dublin Core 2005). The identifier document_num can serve for<br />
the denomination of different document resources (the original document can be<br />
presented as a collection of image files, but also as a transcription in ASCII format).<br />
The export formats envisaged here are HTML, XML or even <strong>PDF</strong>. Some difficulties are<br />
still to be overcome, since our annotation in theory allows for mark up overcrossing.<br />
The documents are uploaded by the administrator on the proposal of one of the<br />
members of the community.<br />
The program implementation must bring into consideration the different<br />
applications that are possible. The process of document annotation consists of two<br />
complementary phases. The first one comprises the contributive action of adding or<br />
modifying annotations (on the entire document or on one of its fragments). The second<br />
one concerns reading through the exploitation of existing annotations. The reading<br />
possibilities include the choice of exporting and saving files in different formats.<br />
5. Conclusion and Prospects<br />
The work on a documentary corpus as specific as Judeo-Spanish texts opens many<br />
questions concerning the design of the proposed electronic model.<br />
276
In the context of this particular type of document, there is a necessity to share<br />
the results of study, remarks, comments and interpretation among the members of a<br />
relatively small and geographically dispersed scientific community. A possible solution<br />
is to develop a tool for the management and rationalisation of individual work.<br />
In our approach to the problem, we have taken as a theoretical basis recently<br />
developed concepts related to digital documents, focusing chiefly on the programmable<br />
aspect of a digital document. Taking into account that the documents in question can<br />
be considered as digital documents (over which it is possible to act), we have worked<br />
on the modelling of contributions that can be added to these objects of study. This<br />
has led us to a model that describes the annotations made on documents collected in<br />
a digital repository.<br />
The program implementation, which is still to be executed, must be situated in<br />
a full Web approach in order to satisfy the conditions of collaborative work and to<br />
remain easy to use with the help of the Web navigator.<br />
Some questions that will certainly appear in the implementation phase are not<br />
accounted for by the proposed model, such as how to apply a modification starting<br />
from one sequence (the annotation that proposes that one sequence be replaced by<br />
another) to the whole document, or should all users have the same profile and same<br />
possibilities to act within the documents.<br />
In this sense, the proposed model leaves many questions to be answered, but the<br />
direction in which we are pointing seems rather promising.<br />
277
“Annotea Projet.” <strong>Online</strong> at http://www.w3.org/2001/Annotea.<br />
Briet, S. (1951). Qu’est-ce que la documentation? Paris: EDIT.<br />
278<br />
References<br />
Buckland, M.K. (1998). “What is a ‘Digital Document’?” Document numérique 2, 221-<br />
230.<br />
Codd, E. F. (1970). “A Relational Model of Data for Large Shared Data Banks.”<br />
Communications of the ACM. June 1970, 13(6), 377-387.<br />
Dublin Core. Dublin Core Metadata Element Set, Version 1.1:Reference Description.<br />
http://dublincore.org/documents/dces/.<br />
de Haan, P. (1984). “Problem-oriented Tagging of English Corpus Data.” Aarts, J. & W.<br />
Meijs (eds) (1984). Corpus Linguistics. Amsterdam: Rodopi, 123-139.<br />
Marshall, C.C. (1998). “Toward an Ecology of Hypertext Annotation.” Proceedings of<br />
‘Hypertext 98’. New York: ACM Press.<br />
http://www.csdl.tamu.edu/~marshall/ht98-final.pdf.<br />
Minervini, L. (1992). Testi giudeospagnoli medievali (Castiglia e Aragona). 2. Napoli:<br />
Liguori Editore.<br />
Otlet, P. (1990). International Organization and Dissemination of Knowledge: Selected<br />
essays. (FID 684). Amsterdam:Elsevier.<br />
Pascual Recuero, P. (1988). Ortografía del ladino. Granada: Universidad de Granada,<br />
Departamento de los Estudios Semíticos.<br />
Perry, P. (2001). “Web Annotations.”<br />
http://www.paulperry.net/notes/annotations.asp.
Pédauque, R.T. (2003). “Document: Form, Sign and Medium, As Reformulated for<br />
Electronic Documents.” Version 3, July 8, 2003.<br />
http://archivesic.ccsd.cnrs.fr/documents/archives0/00/00/05/94/sic_00000594_<br />
02/sic_00000594.html.<br />
Rouissi, S. (2005). “Production de document numérique en mode semi-autonome<br />
au service de la territorialité.“ Colloque Les systèmes d’information élaborée. Ile<br />
Rousse, juin 2005.<br />
Rouissi, S. (2004). Intelligence et normalisation dans la production des documents<br />
numériques. Cas de la communauté universitaire. PhD Thesis, Bordeaux 3 University.<br />
Sampson, G. (1997). Sistemas de escritura. Análisis lingüístico. Barcelona: Gedisa.<br />
First published in 1985. Writing Systems. London: Hutchinson.<br />
Stulic, A. (2002). “Recherches sur le judéo-espagnol des Balkans: l’exemple de la<br />
revue séfarade ‘El amigo del puevlo’.“ (I, 1888, Belgrade). MS thesis, Bordeaux 3<br />
University.<br />
Schürmeyer, W. (1935). “Aufgaben und Methoden der Dokumentation.” Zentralblatt<br />
für Bibliothekswesen 52, 533-543. Reprinted in FRA 78, 385-397.<br />
Röscheisen, M., Mogensen, C. & Winograd, T. (1997). Shared Web Annotations As A<br />
Platform for Third-Party Value-Added Information Providers: Architecture, Protocols,<br />
and Usage Examples. Technical Report CSDTR/DLTR. http://dbpubs.stanford.<br />
edu:8091/diglib/pub/reports/commentor.html<br />
Bush, V. (1945). “As we may think.” The Atlantic Monthly. July 1945. http://www.<br />
theatlantic.com/doc/194507/bush.<br />
Wynne, M. (2003). Writing a Corpus Cookbook.<br />
http://ahds.ac.uk/litlangling/linguistics/IRCS.htm.<br />
279
“W3-CorporaProject” (1996-1998).<br />
<strong>Online</strong> at http://www.essex.ac.uk/linguistics/clmt/w3c/<br />
corpus_ling/content/introduction.html.<br />
280
Il ladino fra polinomia e standardizzazione:<br />
l’apporto della linguistica computazionale<br />
Evelyn Bortolotti, Sabrina Rasom 1<br />
Dolomite Ladin is a polynomic language: it is characterised by a rather large variety of<br />
local idioms, that have been undergoing a process of normalisation and standardisation.<br />
This process can be supported by the development of computer-based infrastructures<br />
and tools. The efforts of the major Ladin institutions and organisations have led to the<br />
creation of lexical and terminological databases, electronic dictionaries, concordancer<br />
tools for corpora analysis, and, eventually, to the development of spell-checkers and<br />
‘standard adapters/converters’.<br />
1. Introduzione<br />
Il ladino delle Dolomiti (Italia) è caratterizzato da una grande varietà interna, che<br />
ha reso necessario un intervento di normazione e standardizzazione, nel rispetto del<br />
carattere polinomico della lingua stessa.<br />
Nelle cinque valli ladine dolomitiche si vanno formando lingue di scrittura, o<br />
standard di valle. Alcuni idiomi di valle sono piuttosto unitari ed è stato sufficiente<br />
codificarli, ma in Val Badia (con Marebbe) e in Val di Fassa la loro varietà ha portato<br />
alla proposta di una normazione che si sovrapponesse alle sottovarianti di paese. Ad<br />
esempio il badiot unitar, basato principalmente sull’idioma centrale (San Martin),<br />
ma aperto anche a elementi provenienti da idiomi di altri paesi, e similmente<br />
il fascian standard, orientato verso l’idioma cazet, la cui scelta come variante<br />
standard è giustificata anche dal fatto che questo idioma è molto più vicino nelle sue<br />
caratteristiche linguistiche agli altri idiomi dolomitici.<br />
Infine si è sentito il bisogno di elaborare un livello ancora più alto di standardizzazione<br />
valido per l’intera Ladinia, sulle orme del Rumantsch Grischun, dando il via<br />
all’elaborazione del Ladin Dolomitan, o Ladin Standard (LS).<br />
Dal punto di vista della polinomia quindi, da una situazione linguistica molto<br />
differenziata, si è passati prima a un livello più alto di normazione che consente,<br />
prendendo la valle come unità di riferimento, di raccogliere più varietà in una norma<br />
unica. A seguire si è raggiunto un terzo livello che permette di avere a disposizione<br />
1 I paragrafi “Introduzione” e “Risorse e infrastrutture linguistiche e lessicali” sono stati scritti da Evelyn<br />
Bortolotti; il paragrafo “Correttori ortografici con adattamento morfologico” è stato scritto da Sabrina<br />
Rasom.<br />
281
un unico idioma di riferimento, una norma o lingua standard per tutte e cinque le<br />
vallate.<br />
La standardizzazione ha riguardato in un primo tempo la forma grafica: si è cercato<br />
di adottare una grafia il più possibile comune alle diverse varianti ladine dolomitiche,<br />
per garantire, nella diversità, il riconoscimento dell’appartenenza alla stessa famiglia<br />
linguistica e un maggiore grado di coesione e uniformità del sistema.<br />
L’utilizzo della lingua ladina scritta nelle scuole, nelle pubbliche amministrazioni,<br />
nella stampa ecc. comporta, a seconda del grado di standardizzazione del ladino<br />
utilizzato, un più o meno marcato sforzo di avvicinamento alla norma da parte<br />
dello scrivente e richiede una grande consapevolezza delle differenze fra la propria<br />
sottovarietà e lo standard utilizzato nello scrivere.<br />
Di fondamentale importanza in questo processo di standardizzazione è stato<br />
e continua a essere l’apporto della linguistica computazionale. La diffusione<br />
della tecnologia informatica permette infatti la creazione e lo sviluppo di risorse<br />
linguistiche e di infrastrutture di supporto al trattamento automatico della lingua,<br />
soprattutto nell’ambito della lessicografia moderna e tradizionale e della terminologia<br />
settoriale basate su corpora e della standardizzazione linguistica. Inoltre favorisce la<br />
realizzazione di strumenti di aiuto alla scrittura che facilitino il passaggio verso la<br />
norma standard.<br />
Nel caso del ladino delle valli dolomitiche, i vari progetti relativi all’informatizzazione<br />
delle risorse lessicali e allo sviluppo di strumenti per il trattamento automatico sono<br />
stati portati avanti attenendosi al principio di conservazione e valorizzazione della<br />
ricchezza e della varietà in una visione unitaria. Questo principio deriva dalla riflessione<br />
teorica del linguista còrso Jean-Baptiste Marcellesi, che per primo ha introdotto il<br />
concetto di “lingue polinomiche” (Langues Polynomiques) [Chiorboli 1990].<br />
Le principali istituzioni coinvolte nei progetti di modernizzazione e di trattamento<br />
automatico del ladino promossi o realizzati in collaborazione con l’Istitut Cultural<br />
Ladin “Majon di Fascegn” sono l’Union Generela di Ladins dles Dolomites e l’Istitut<br />
Ladin “Micurà de Rü”.<br />
I principali obiettivi perseguiti in campo linguistico computazionale sono:<br />
• l’informatizzazione del patrimonio lessicale ladino con la creazione di<br />
una banca dati generale lessicale ladina (BLAD), di banche dati strutturate delle<br />
varietà locali e di una banca dati centrale dello standard;<br />
• l’elaborazione di dizionari degli standard di valle (per il fassano standard:<br />
DILF “Dizionario Italiano – Ladino Fassano / Dizionèr talian-ladin fascian”, per<br />
il badiotto unitario: Giovanni Mischì, Wörterbuch Deutsch – Gadertaslisch /<br />
282
Vocabolar todësch – ladin (Val Badia); per il gardenese: Marco Forni, Wörterbuch<br />
Deutsch – Grödner-ladinisch / Vocabuler tudësch – ladin de Gherdëina) e del ladino<br />
standard (DLS “Dizionar dl Ladin Standard”) anche in versione elettronica e alcuni<br />
consultabili online;<br />
• la raccolta di glossari terminologici, parzialmente consultabili online<br />
(glossari di ambiente, botanica, materie giuridico-amministrative, medicina,<br />
architettura e costruzioni, pedagogia, musica e trasporto turistico);<br />
• la creazione di corpora elettronici analizzabili tramite un’apposita<br />
interfaccia: il web-concordancer;<br />
• la realizzazione di strumenti informatici per facilitare l’uso e<br />
l’apprendimento delle varianti standard: dizionario elettronico, e-learning,<br />
correttori ortografici e adattatori per il fassano standard e per il Ladin Standard.<br />
2. Risorse e infrastrutture linguistiche e lessicali<br />
2.1 BLAD: Banca dac Lessicala Ladina<br />
La banca dati BLAD consente l’accesso:<br />
• allo SPELL base, il database che raccoglie circa 15.000 schede con LS e idiomi<br />
di valle (lessico prevalentemente moderno), da cui è stato elaborato il DLS;<br />
• alle banche dati locali di lessico tradizionale, per un totale di circa 90.000<br />
schede, in cui sono confluiti i dati raccolti dai dizionari e dai database di<br />
lessico patrimoniale (per il fassano: Dell’Antonio 1972, Mazzel 1995, De Rossi<br />
1999; per il badiotto: Pizzinini-Plangg 1966; per il gardenese: Lardschneider-<br />
Ciampac 1933 e 1992, Martini 1953; per il fodom: Masarei in stampa; per<br />
l’ampezzano: Comitato 1997) (descrittivi);<br />
• alle banche dati dei dizionari moderni (normativi), per un totale di circa<br />
250.000 schede (DILF, Mischì, Forni);<br />
• alle banche dati terminologiche elaborate nell’ambito del progetto TERM-LeS,<br />
in cui sono raccolte circa 16.000 schede.<br />
283
Fig. 1: L’interfaccia di ricerca della banca dati BLAD: la ricerca può essere effettuata in<br />
italiano, tedesco, LS e negli idiomi di valle.<br />
Fig. 2: Esempio di scheda: dal pannello “Idioms”, in cui accanto al lemma in LS vengono<br />
riportate la traduzione italiana e tedesca e le forme corrispondenti negli idiomi di valle, si<br />
ha anche accesso alle singole banche dati locali.<br />
2.2 I dizionari normativi: le versioni elettroniche online del DILF e del DLS<br />
Il DILF e il DLS sono strumenti linguistici la cui accessibilità e semplicità d’uso<br />
consentono la facile consultazione di risorse lessicali di grande importanza per le<br />
284
persone che si trovano a dover scrivere in fassano standard o in ladino standard.<br />
Nel DILF (Dizionario Italiano – Ladino Fassano / Dizionèr talian-ladin fascian) il<br />
repertorio lessicale tradizionale registrato nei dizionari descrittivi è stato integrato<br />
con un’ampia selezione di voci moderne il cui uso è ampiamente documentato nella<br />
produzione linguistica contemporanea. Questa versione elettronica, corrispondente<br />
alla seconda edizione cartacea (2001), è stata realizzata con la collaborazione dell’ITC-<br />
IRST di Trento, avviata nell’ambito di progetti relativi al trattamento automatico e<br />
allo sviluppo di infrastrutture informatiche per il ladino (progetto “TALES”, iniziato<br />
nel 1999).<br />
Fig. 3: DILF online: esempio di ricerca dal ladino fassano all’italiano, con visualizzazione<br />
dei risultati.<br />
285
Fig 4: Esempio di scheda di lemma con traducenti ladini e fraseologia<br />
Accanto al DILF, è disponibile anche la versione online del Dizionar dl Ladin<br />
Standard (DLS). A differenza dei consueti dizionari bilingui, il DLS registra i lemmi in<br />
ladino standard con accanto i termini corrispondenti negli idiomi di valle, dai quali<br />
la forma standard è stata ricavata secondo un articolato complesso di criteri. Inoltre<br />
viene riportato il traducente sia in italiano che in tedesco, lingue di adstrato delle<br />
valli ladine dolomitiche.<br />
Fig. 5: Interfaccia di ricerca: la parola può essere ricercata in ognuna delle varianti<br />
registrate nel dizionario e nei traducenti italiani e tedeschi<br />
286
Fig. 6: Esempio di scheda di lemma LS con traduzione italiana<br />
e tedesca e forme locali corrispondenti<br />
2.3 Il progetto TERM-LeS: Standardizzazione lessicale e terminologia per le<br />
lingue ladina e sarda 2<br />
Il progetto, condotto negli anni 2001-2003, ha previsto l’elaborazione di terminologia<br />
moderna e la creazione di banche dati terminologiche in ladino standard nei settori in<br />
cui l’uso della lingua ladina è obbligatorio (amministrazione) e in altri settori rilevanti<br />
per la realtà territoriale (architettura e costruzioni, ambiente, medicina, botanica,<br />
musica, pedagogia, trasporto turistico). Alcuni di questi glossari sono stati realizzati nel<br />
quadro del progetto Linmiter, promosso dalla Direzione Terminologia e Industrie della<br />
Lingua (DTIL) dell’Unione Latina, in coordinamento con altre minoranze linguistiche<br />
europee neolatine.<br />
2 Il ladino e il sardo sono le lingue oggetto dello stesso progetto di standardizzazione terminologica e<br />
lessicale finanziato dalla Comunità Europea tra il 2001 e il 2003.<br />
287
Fig. 7: Esempio di scheda terminologica: l’interfaccia di lavoro permette una visione<br />
sinottica sullo standard e sulle varianti. Da essa è inoltre possibile accedere direttamente<br />
alle banche dati degli idiomi di valle, ai dizionari moderni e ai corpora testuali.<br />
Tanto nell’elaborazione lessicografica quanto in quella terminologica in ladino<br />
standard, la polinomia, la varietà e la diversità degli idiomi ladini, è la base di<br />
partenza per la standardizzazione; la lingua standard attinge quindi dalle varianti<br />
locali riassumendole in una norma comune, mirando nel contempo a essere non<br />
uno strumento per soffocare le differenze, ma al contrario un tetto, un ombrello di<br />
protezione contro gli influssi e le interferenze esterne, e un punto di collegamento fra<br />
i diversi idiomi per permetterne uno sviluppo parallelo e armonico. La struttura delle<br />
banche dati e l’interfaccia di lavoro tengono quindi conto dell’esigenza di avere facile<br />
e immediato accesso a tutte le risorse linguistiche utili e necessarie.<br />
288
2.4 I corpora elettronici<br />
Nell’ambito del progetto TALES sul trattamento automatico della lingua ladina<br />
sono state create delle raccolte organiche di testi ladini, sia nel ladino standard che<br />
nei singoli idiomi. I corpora raccolti (fassano, gardenese, badiotto e ampezzano)<br />
contengono complessivamente circa 6.500.000 parole. I testi selezionati coprono<br />
un periodo che va dal XIX secolo fino ai giorni nostri, con preponderanza di testi<br />
appartenenti alla seconda metà del XX secolo. Per garantire un certo equilibrio fra i<br />
vari generi, sono stati inseriti sia testi letterari (prosa, poesia, teatro, memorialistica,<br />
testi sul folclore e le tradizioni, libri di preghiere), sia testi non letterari (testi giuridici<br />
e amministrativi, modulistica, testi di informazione giornalistica e pragmatici, testi<br />
di divulgazione scientifica e culturale, testi scolastici). Attualmente il corpus fassano<br />
è quello nella fase più avanzata di elaborazione. La sua strutturazione, che fornisce<br />
per ogni testo informazioni rilevanti (data, luogo di provenienza, tipologia testuale,<br />
autore), permette di affinare la ricerca secondo una serie di criteri predeterminati.<br />
I corpora sono consultabili tramite il concordancer, uno strumento elaborato ad<br />
hoc e rivolto anzitutto al linguista e allo studioso del ladino: esso permette l’analisi<br />
dei testi attraverso la ricerca di concordanze, collocazioni e frequenze secondo la<br />
modalità KWIC (Keyword In Context), ossia un sistema che permette di visualizzare la<br />
parola oggetto della ricerca con il suo contesto a corredo.<br />
Una sezione del concordancer è dedicata ai corpora amministrativi bi- e trilingui<br />
allineati: questa raccolta è di particolare utilità nel lavoro di realizzazione di glossari<br />
settoriali.<br />
Il lavoro preliminare per lo sviluppo dello strumento di analisi di corpora è<br />
consistito nella creazione di corpora testuali: i testi selezionati sono stati acquisiti<br />
elettronicamente oppure manualmente e sono stati elaborati rispettando precisi<br />
criteri di archiviazione. In seguito sono stati classificati in base alla loro appartenenza<br />
diatopica (individuazione della variante in cui sono scritti) e diacronica (dalle prime<br />
testimonianze scritte in ladino sino ai testi contemporanei) e alla tipologia testuale<br />
(testi letterari e non letterari con individuazione del genere specifico). Per ogni testo<br />
è stato creato un frontespizio elettronico che riassume tutte queste informazioni:<br />
periodo, autore, genere, nome del file, titolo originale, numero di parole, variante.<br />
Il frontespizio è stato linkato al testo corrispondente, cosicché le informazioni in esso<br />
contenute possano essere utilizzate per circoscrivere la ricerca.<br />
I corpora consultabili attraverso il concordancer si rivelano una risorsa di<br />
fondamentale importanza per diversi campi di applicazione: per lo studio del lessico,<br />
della sintassi e della morfologia, per l’elaborazione di strumenti normativi e didattici,<br />
289
per le operazioni di corpus planning, per i progetti relativi alla standardizzazione<br />
della lingua e per l’elaborazione di banche dati lessicografiche e di terminologia<br />
multilingue.<br />
Fig. 8: Esempio di ricerca nel concordancer: la parola cercata viene visualizzata in un<br />
breve contesto e in rosso per essere facilmente riconosciuta. Anche la parola che la<br />
precede o segue può essere evidenziata in un colore diverso. L’interfaccia di ricerca<br />
permette all’utente di decidere quante parole devono apparire nel contesto.<br />
3. Correttori ortografici con adattamento morfologico<br />
Nell’ambito del progetto SPELL-TALES, nell’anno 2002, l’Istituto Culturale Ladino<br />
ha realizzato il correttore ortografico del ladino fassano in collaborazione con l’ITC-<br />
IRST di Trento e col sostegno finanziario dell’Unione Europea, del Comprensorio Ladino<br />
di Fassa C11 e della Regione Trentino Alto-Adige. L’Istituto Culturale Ladino ha curato<br />
la parte linguistica del progetto riguardante la creazione delle regole morfologiche,<br />
mentre la parte informatica è stata seguita dall’ITC, nella persona del dott. Claudio<br />
Giuliano, che ha elaborato e applicato il programma di generazione delle forme. La<br />
realizzazione del software è poi stata affidata alla ditta Expert System di Modena.<br />
Nel corso del 2003 è stato messo a punto anche il correttore ortografico del ladino<br />
standard – SPELL-checker –, elaborato con le stesse modalità del correttore fassano.<br />
I due software di correzione sono realizzati in ambiente Windows e Macintosh<br />
per tutti gli applicativi Consumer della suite Microsoft Office e sono corredati di<br />
290
installazione automatica e di guida e assistenza all’installazione. Le funzionalità<br />
previste da questi due strumenti, similmente ai correttori ortografici disponibili per<br />
le lingue maggioritarie, prevedono la correzione di errori di digitazione, di ortografia<br />
e di morfologia direttamente durante la redazione di un testo, oppure in un secondo<br />
momento, sottoponendo a verifica un testo già scritto. I correttori ortografici in<br />
questione si basano su forme ricavate dai dizionari di riferimento, rispettivamente<br />
il DILF per il fassano standard e il DLS per il ladino standard; il formario di base<br />
fassano è poi stato implementato con forme ottenute dallo spoglio di alcuni testi<br />
amministrativi e giornalistici (Usc di Ladins) esportati tramite il concordancer e con i<br />
dizionari personalizzati realizzati dagli utenti che hanno usato il correttore per circa<br />
un anno nell’ambito dell’amministrazione. Per quanto riguarda il formario del ladino<br />
standard l’implementazione è avvenuta attraverso il dizionario personalizzato creato<br />
dai redattori del sito Noeles.net e da export delle banche terminologiche.<br />
L’Istituto Culturale Ladino “Majon di Fascegn”, in collaborazione con l’Istituto<br />
Ladino “Micurà de Rü” e con il supporto tecnico-informatico della Ditta Open Lab di<br />
Firenze, sta ora lavorando a una seconda generazione di correttori ortografici delle<br />
varietà dolomitiche (badiotto, fassano, gardenese) e del ladino standard, non più<br />
ancorata alla Suite Office di Microsoft. Si tratta di una scelta all’avanguardia che<br />
prevede la realizzazione di software e sistemi aperti (open source) disponibili in rete<br />
e non più dipendenti da programmi specifici.<br />
Il motore alla base dei correttori delle diverse varianti sarà uno e il sistema totalmente<br />
internazionalizzato: l’interfaccia d’uso a lingua multipla permetterà di scegliere la<br />
lingua stessa di interfaccia e la lingua di correzione all’atto della configurazione. Le<br />
novità pratiche più importanti di questi strumenti stanno in un’accurata ricerca delle<br />
corrispondenze interne al formario, che non si presenterà più come una semplice<br />
lista di forme non ancorate fra loro, bensì avrà una sua coerenza interna, riconoscerà<br />
la categoria grammaticale a cui appartiene ogni forma, la rispettiva forma base<br />
di riferimento, la coniugazione o declinazione e la marca d’uso, per poi suggerire<br />
l’eventuale forma corretta. Inoltre, nel processo di sofisticazione delle opzioni di<br />
correzione che verranno fornite, le varietà ladine inserite nel correttore saranno<br />
corredate da uno specifico algoritmo fonetico – soundslike - che non sarà più quello<br />
Metaphone classico dell’inglese (usato fra l’altro dalla maggior parte dei correttori<br />
ortografici esistenti), ma verrà elaborato sui soundslike specifici delle varietà in<br />
questione, permettendo quindi opzioni di correzione più precise.<br />
291
Fig. 9: Interfaccia del nuovo correttore ortografico open source<br />
accessibile direttamente da internet.<br />
Nel progetto di elaborazione di questa nuova tipologia di strumenti di correzione<br />
l’Istituto Culturale Ladino “Majon di Fascegn” sta sperimentando un’ulteriore<br />
funzione nell’ambito del correttore ortografico open source per l’assistenza a chi<br />
scrive in ladino fassano e in ladino standard. Si tratta di una funzione di adattamento<br />
morfologico che permetterà di passare “automaticamente” dalla variante locale<br />
fassana (cazet, brach, moenat) alla variante fassana standard, oppure dalle varietà<br />
standard di valle (fassano standard, badiotto unificato e gardenese) al ladino standard<br />
durante la digitazione di un testo.<br />
I nuovi strumenti di correzione si rendono quanto mai utili nel momento in cui<br />
una lingua polinomica viene riconosciuta come lingua ufficiale e si ritrova quindi a<br />
dover far fronte alle esigenze della comunicazione in ambito pubblico-amministrativo<br />
e nella scuola. Come è stato già osservato, l’apporto della linguistica computazionale<br />
nel processo di standardizzazione si è rivelato di primaria importanza per facilitare il<br />
passaggio dalla sottovarietà dello scrivente (impiegato, insegnante, studente o semplice<br />
appassionato) a una lingua standard ufficiale e unificata. I correttori ortografici sono<br />
quindi un passaggio fondamentale verso la realizzazione di strumenti ausiliari sempre<br />
più sofisticati per coloro che lavorano ogni giorno con la lingua ladina.<br />
292
293<br />
Bibliografia<br />
Bortolotti, E. & Rasom, S. (2003). “Linguistic Resources and Infrastructures for the<br />
Automatic Treatment of Ladin Language.” Proceedings of TALN 2003. RECITAL 2003.<br />
Tome 2. Batz-sur-Mer, 253-263.<br />
Chiocchetti, N. & Iori, V. (2002). Gramatica del Ladin Fascian. Vigo di Fassa: Istitut<br />
Cultural Ladin “majon di fascegn”.<br />
Chiorboli, J. (ed.) (1991). Corti 90: actes du Colloque international des langues<br />
polynomique. PULA n° 3/4, Université de Corse.<br />
Comitato del Vocabolario delle Regole d’Ampezzo (1997). Vocabolario Italiano -<br />
Ampezzano. Cortina d’Ampezzo: Regole d’Ampezzo e Cassa Rurale ed Artigiana di<br />
Cortina d’Ampezzo e delle Dolomiti.<br />
Dell’Antonio, G. (1972). Vocabolario ladino moenese – italiano. Trento: Grop Ladin da<br />
Moena.<br />
De Rossi, H. (1999). Ladinisches Wörterbuch: vocabolario ladino (brach)-tedesco. A<br />
cura di Kindl, U. e Chiocchetti, F.. Vigo di Fassa: Istitut Cultural Ladin “majon di<br />
fascegn”/Universität Innsbruck.<br />
Forni, M. (2002). Wörterbuch Deutsch - Grödner-Ladinisch. Vocabuler Tudësch – Ladin<br />
de Gherdëina. San Martin de Tor: Istitut Ladin “Micurà de Rü”.<br />
Giuliano, C. (2002). “A tool box for lexicographers.” Proceedings of EURALEX 2002.<br />
Copenhagen: Center for Sprogteknologi (CST), 113-118.<br />
Istitut Cultural Ladin “majon di fascegn”/SPELL (2001). DILF: Dizionario italiano -<br />
ladino fassano con indice ladino - italiano = Dizioner talian-ladin fascian con indesc
ladin-talian. Dizionèr talian – ladin fascian. II ed., 1. rist. Vigo di Fassa: Istitut Cultural<br />
Ladin “majon di fascegn”/SPELL.<br />
Lardschneider-Ciampac, A. (1933). Wörterbuch der Grödner Mundart. (Schlern-<br />
Schriften ; 23). Innsbruck: Wagner.<br />
Lardschneider-Ciampac, A. (1992). Vocabulér dl ladin de Gherdëina: Gherdëina -<br />
Tudësch. Übera. von Mussner, M. & Craffonara, L. San Martin de Tor: Istitut Ladin<br />
“Micurà de Rü”.<br />
Martini, G.S. (1953). Vocabolarietto gardenese – Italiano. Firenze: Francolini.<br />
Masarei, S. (2005). Dizionar Fodom – Talián - Todésch. Colle Santa Lucia: Istitut Cultural<br />
Ladin “Cesa de Jan” - SPELL.<br />
Mazzel, M. (1995). Dizionario Ladino fassano(cazet) – Italiano: con indice italiano-<br />
ladino. 5. ed. riv. e aggiornata (prima ed. 1976). Vigo di Fassa: Istitut Cultural Ladin<br />
“majon di fascegn”.<br />
Mischì, G. (2000). Wörterbuch Deutsch - Gadertalisch. Vocabolar Todësch – Ladin (Val<br />
Badia). San Martin de Tor: Istitut Ladin “Micurà de Rü”.<br />
Pizzinini, A. & Plangg, G. (1966). Parores Ladines. Vocabulare badiot – tudësk. ergänzt<br />
und überarbeitet von G. Plangg. Innsbruck: L.F. Universität Innsbruck.<br />
SPELL (2001). Gramatica dl Ladin Standard. Urtijëi: SPELL.<br />
SPELL (2002). DLS - Dizionar dl Ladin Standard. Urtijëi: SPELL.<br />
294
Schmid, H. (2000). Criteri per la formazione di una lingua scritta comune della ladinia<br />
dolomitica. San Martin de Tor/Vich: Istitut Ladin “Micurà de Rü”/Istitut Cultural Ladin<br />
“majon di fascegn”.<br />
Valentini, E. (2002). Ladin Standard. N lingaz scrit unitar per i ladins dles Dolomites.<br />
Urtijëi: SPELL.<br />
Videsott, P. (1997). “Der Wortschatz des Ladin Dolomitan: Probleme der<br />
Standardisierung.” Iliescu, Maria (Hrsg.) et al.: Ladinia et Romania. Festschrift für<br />
Guntram Plangg zum 65. Geburtstag. Vich/Vigo di Fassa: ICL 149-163. [Mondo Ladino,<br />
21].<br />
295
Il progetto “Zimbarbort” per il recupero del<br />
patrimonio linguistico cimbro<br />
297<br />
Luca Panieri<br />
Some time ago, people living in the mountain territory between the rivers Adige and<br />
Brenta in northern Italy spoke a Germanic language usually known as ‘Cimbro.’ This<br />
language was brought into northern Italy by Bavarian colonists in the Middle Ages.<br />
Surrounded by Italian speakers, and isolated from the rest of the German-speaking<br />
world, Cimbro developed as an autonomous language, preserving many of its original<br />
old German features, but becoming strongly influenced by Italian lexis and syntax as<br />
well.<br />
Since this language is nowadays commonly spoken only in Luserna (a village south<br />
of Trento), the local township has set up a project (presented in this paper) for the<br />
creation of a database of Cimbro lexis.<br />
The main purpose of the project is to create a virtual memory of the Cimbrian language,<br />
where all known records of the Cimbrian language tradition can be stored. The first<br />
written records in Cimbro date back to around 1600, so the aim of the project is to<br />
give back to the Cimbrian language tradition its forgotten historical roots. We are sure<br />
that by looking into the deep historical layers of the language tradition, we will help<br />
the surviving Cimbrian community of Luserna to face the present.<br />
Premessa<br />
Con questo breve contributo si illustrano le linee guida di un progetto strategico<br />
finalizzato al recupero del patrimonio lessicale della tradizione linguistica cimbra.<br />
Tale progetto ha ottenuto l’approvazione del Comune di Luserna (l’isola linguistica<br />
cimbra più consistente), che ha erogato per l’anno in corso un primo finanziamento. Lo<br />
scrivente, membro del Comitato Scientifico dell’Istituto di Cultura Cimbra di Luserna<br />
è stato da esso designato Coordinatore del progetto.<br />
L’Istituto di Cultura Cimbra, mediante la presentazione del progetto al Convegno<br />
Eurac “Lesser Used Languages and Computer” ha inteso soprattutto mettere a<br />
conoscenza gli esperti di linguistica computazionale dell’esistenza di tale iniziativa,<br />
illustrandone i contenuti, le finalità e la sua struttura operativa, allo scopo di sollecitare<br />
eventuali proposte sulle modalità tecniche della sua realizzazione. In tal senso, grazie<br />
all’occasione d’incontro con gli specialisti fornita dal Convegno di Bolzano, i promotori
del progetto sono effettivamente riusciti a suscitare vivo interesse e concrete proposte<br />
di collaborazione per la realizzazione della banca dati lessicale.<br />
Si deve quindi premettere che il presente contributo non è che la trasposizione<br />
scritta della presentazione del progetto, inteso nei termini suddetti. Non si tratta<br />
quindi di un articolo specialistico di contenuto teorico o sperimentale, bensì della<br />
descrizione dell’iniziativa concreta che l’Istituto Cimbro intende promuovere per la<br />
salvaguardia del patrimonio lessicale della propria tradizione linguistica. Abbiamo<br />
demandato agli specialisti d’informatica il compito di indicarci le soluzioni tecnologiche<br />
più opportune alla sua realizzazione e gestione.<br />
Quanto detto sul carattere di questo contributo spiega anche la mancanza quasi<br />
totale di riferimenti bibliografici, che sono tuttavia presenti in misura modesta<br />
nella sola introduzione, essendo essa finalizzata a portare a conoscenza del lettore<br />
la particolare realtà linguistica cimbra. Il resto della trattazione, invece, come già<br />
evidenziato, consiste nella semplice esposizione delle linee guida del progetto.<br />
1. Introduzione<br />
L’idea di questo progetto nasce dalla consapevolezza della situazione precaria in<br />
cui versano le tre isole linguistiche cimbre sopravvissute nei secoli fino ai giorni nostri:<br />
Giazza (VR), Roana-Mezzaselva (VI) e Luserna (TN). In particolare, la condizione<br />
relativamente rosea in cui fortunatamente ancora si trova la varietà cimbra di<br />
Luserna impone l’attuazione di ogni possibile strategia di difesa e consolidamento<br />
del patrimonio linguistico cimbro, essendo diventata Luserna l’ultima roccaforte di<br />
un gruppo etnico un tempo disseminato in tutto il territorio prealpino tra l’Adige<br />
e il Brenta. 1 Tale tradizione fu un tempo capace di trovare originale espressione<br />
letteraria e politico-amministrativa, in particolar modo sull’Altopiano d’Asiago, dove<br />
la Reggenza dei Sette Comuni riuscì a conservare la propria autonomia di governo<br />
locale per molti secoli, sopravvivendo all’avvicendarsi delle potenti signorie dell’Italia<br />
settentrionale e mantenendo una propria fisionomia linguistica e culturale anche nei<br />
confronti del vasto mondo di lingua tedesca, tanto geograficamente vicino. 2<br />
Ai nostri giorni, quando ormai l’area linguistica cimbra si è drasticamente ridotta,<br />
soppiantata quasi ovunque dal dialetto veneto o dalla lingua italiana, ed è rimasta vitale<br />
soltanto a Luserna, insorge la necessità di evitare che il patrimonio lessicale espresso<br />
dalla civiltà cimbra nel corso dei secoli cada per sempre nell’oblìo. Non consideriamo<br />
1 Tra i vari testi consultabili sulla questione dell’origine degli insediamenti “cimbri” e sulla loro lingua<br />
rimane tuttora fondamentale lo studio del grande dialettologo bavarese Johann Andreas Schmeller<br />
(1985).<br />
2 Per una sintesi efficace sulla storia istituzionale della comunità cimbra dei Sette Comuni dell’Altopiano<br />
d’Asiago, basata sulla documentazione, si veda anche Antonio Broglio (2000).<br />
298
ciò solamente un’operazione dettata dal rispetto per la memoria storica di una<br />
civiltà, ma soprattutto un intervento preventivo di rilevante importanza strategica<br />
e finalizzato a salvaguardare la tradizione linguistica cimbra. Oggigiorno infatti la<br />
comunità di Luserna si trova in una situazione di bilinguismo nettamente sbilanciato,<br />
in cui la lingua italiana predomina come mezzo di comunicazione atto a esprimere il<br />
panorama concettuale astratto della cultura moderna, mentre il cimbro è soprattutto<br />
la lingua materna della sfera affettiva, quella che esprime con genuina spontaneità<br />
i moti dell’animo, il sentimento di appartenenza alla comunità e al suo territorio<br />
naturale. Per quanto questa ripartizione complementare dell’uso delle due lingue<br />
possa apparire accettabile, se non addirittura comoda, essa pone il cimbro in posizione<br />
debole nei confronti dell’italiano. I continui stimoli e cambiamenti socio-economici<br />
e culturali del mondo moderno e il loro influsso globalizzante scardinano la coesione<br />
tradizionale delle “piccole patrie” di un tempo e ne catapultano gli appartenenti in un<br />
contesto socio-culturale del tutto diverso e di più ampie dimensioni, il cui baricentro<br />
è al di fuori della stessa comunità che ne subisce l’influenza. Questo mondo si<br />
esprime soprattutto mediante le lingue nazionali della scolarizzazione di massa, come<br />
appunto l’italiano o il tedesco. La lingua cimbra rimane quindi legata e, purtroppo,<br />
confinata all’ambito delle relazioni socio-economiche e dei valori tradizionali della<br />
piccola comunità di un tempo. Ma con gli inevitabili e troppo repentini mutamenti di<br />
prospettiva dovuti alla modernizzazione, la lingua connaturata alla tradizione locale<br />
cede il passo a quella delle relazioni esterne, della cultura tecnologica, scientifica e<br />
amministrativa, sempre più preponderanti.<br />
La sopravvivenza della tradizione linguistica cimbra dipende quindi dalla sua<br />
capacità di rinnovarsi ed espandere il proprio dominio espressivo agli ambiti concettuali<br />
tipici della cultura moderna.<br />
2. Motivazioni strategiche e obiettivi<br />
In considerazione di quanto sopra si è evidenziato, riteniamo necessario intervenire<br />
a tutela della lingua cimbra con un’operazione di consolidamento delle fondamenta<br />
storiche della stessa tradizione linguistica, mediante la realizzazione di una banca<br />
dati globale del patrimonio lessicale cimbro. In essa dovranno confluire i dati lessicali<br />
estrapolati da tutte le fonti scritte disponibili, a partire dalle prime attestazioni<br />
storiche di testi letterari quali il Catechismo cimbro del ‘600 fino ad arrivare alla<br />
lingua cimbra di oggi. L’idea di fondo è quella di creare una sorta di luogo virtuale della<br />
memoria linguistica collettiva della civiltà cimbra, che accolga il maggior numero di<br />
lemmi possibile, derivanti da tutte le varietà storiche del cimbro, oggi rappresentate<br />
dalle tre note isole linguistiche di Giazza, Roana-Mezzaselva e Luserna.<br />
299
Oltre all’indubbio valore storico-documentario, tale operazione, sul piano<br />
strategico, consente di fornire alla lingua cimbra ancora in uso degli utili strumenti<br />
lessicologici per far fronte alla minaccia contingente di progressiva erosione del<br />
vocabolario originario. S’intende con ciò favorire il recupero delle risorse espressive<br />
della tradizione linguistica cimbra nel suo complesso, vedendo in essa il più valido<br />
punto di riferimento per consolidare la lingua di Luserna. Anche in relazione alla<br />
questione attuale della necessità di elaborare un lessico cimbro capace di esprimersi<br />
oltre l’ambito familiare e tradizionale, la sperimentazione di neologismi deve in prima<br />
istanza fare riferimento alla propria tradizione linguistica, sia pure intesa in senso<br />
lato, ancor prima che si faccia ricorso al modello italiano o tedesco. Entrambi sono da<br />
adottare solo se è accertata la mancanza di risorse linguistiche interne.<br />
A tal riguardo si obietterà che attingere dal lessico storico cimbro per supplire<br />
alle deficienze semantiche della parlata attuale negli ambiti concettuali più astratti<br />
dell’espressione linguistica moderna potrebbe sembrare paradossale: come trovare<br />
nell’inventario lessicale del passato soluzioni adeguate a esprimere concetti che in<br />
molti casi non erano stati ancora immaginati da nessuno? Ad esempio, nell’ambito<br />
della tecnologia o in certi nuovi campi del sapere scientifico? Ovviamente non ci<br />
aspetteremo di “ritrovare” nel lessico storico cimbro la parola esatta per ‘computer’<br />
o per ‘ecologia’, ma sicuramente non sarà difficile rendere tali concetti partendo dalle<br />
radici lessicali che per approssimazione semantica e/o per analogia strutturale meglio<br />
si prestano a descriverne il valore. Rimanendo negli esempi citati, considereremo<br />
il computer un ‘calcolatore’, perché tale è la sua funzione preminente, tale la sua<br />
prima denominazione italiana e tale il significato letterale del termine inglese preso<br />
in prestito. Si potrà proporre quindi di designarlo con un termine cimbro derivato<br />
dalla radice verbale tradizionale che indica il concetto di ‘calcolare’. Per quanto<br />
riguarda il concetto di ‘ecologia’ occorrerà partire dalla sua possibile trasposizione in<br />
parole di uso comune che rendano chiaro il concetto, come ‘scienza dell’ambiente’,<br />
‘scienza della natura’. A questo punto avremo riportato il termine “moderno”<br />
negli ambiti concettuali già noti alla tradizione linguistica cimbra di ‘sapere’ e<br />
‘natura’. Ovviamente non si tratterà di imporre con l’autorità le soluzioni teoriche<br />
che si andranno proponendo, esse infatti si potranno realmente affermare nell’uso<br />
quotidiano solo se la comunità linguistica le avvertirà come utili alla comunicazione<br />
spontanea e in armonia con la percezione che ogni parlante nativo ha delle proprie<br />
radici linguistiche.<br />
Certamente, rispetto alle più vaste comunità linguistiche nazionali, quella cimbra<br />
di Luserna, a fronte di tanti svantaggi, presenta almeno il vantaggio di una maggiore<br />
coesione tra le istituzioni e la cittadinanza. Di per sé ciò favorisce l’affermazione di<br />
300
ogni iniziativa intrapresa dalle istituzioni locali, nelle quali il cittadino si rispecchia<br />
direttamente, in un clima di compartecipazione costruttiva. Ciò quindi gioca a favore<br />
anche degli interventi mirati di politica linguistica patrocinati dalle locali istituzioni.<br />
3. Struttura operativa<br />
La realizzazione della banca dati globale del lessico cimbro (progetto Zimbarbort)<br />
si articola essenzialmente nella fase di raccolta delle fonti primarie in lingua cimbra<br />
e nella fase di estrapolazione e inserimento dei singoli dati lessicali nel supporto<br />
informatico della banca dati stessa.<br />
3.1 Raccolta delle fonti<br />
In questa fase si procede al reperimento di ogni tipo di testimonianza linguistica<br />
del cimbro. Pur essendo questa fase logicamente preliminare rispetto a quella<br />
dell’estrapolazione e dell’inserimento dei dati nella banca virtuale, essa sarà<br />
destinata a protrarsi nel tempo fino all’esaurimento delle attestazioni storiche sulla<br />
lingua cimbra e continuerà seguendo a mano a mano gli eventuali sviluppi linguistici<br />
che si producono nel momento attuale. Poiché tale fase costituisce il momento di<br />
acquisizione alla “memoria virtuale collettiva” di ogni espressione lessicale integrata<br />
nella tradizione linguistica cimbra, essa sarà destinata ad arricchirsi progressivamente<br />
di ogni futuro neologismo che eventualmente si affermi nell’uso comune.<br />
A prescindere dall’epoca a cui risalgono, le attestazioni della lingua cimbra si<br />
possono ripartire in due categorie, distinte dal diverso supporto in cui sono state<br />
registrate e tramandate ai giorni nostri:<br />
• Fonti scritte<br />
In quest’ambito rientra la moltitudine di attestazioni scritte in cimbro (interamente<br />
o parzialmente) nell’intero corso della storia, fino al tempo presente. Si tratta di<br />
testi scritti di tipologia e di epoca varia, che comprendono opere letterarie, quali<br />
poesie, racconti popolari o testi liturgici, scritti ad uso privato, quali le epistole, e<br />
opere finalizzate allo studio della lingua cimbra, quali grammatiche, glossari, studi<br />
toponomastici, ecc.<br />
Ai fini del presente progetto si tratterà di individuare e raccogliere tutte le fonti<br />
scritte di cui si ha conoscenza per radunarle fisicamente in originale o almeno in<br />
copia fedele e inventariarle in modo ragionato, onde agevolarne la consultazione.<br />
Tra i criteri di catalogazione figureranno sicuramente il genere (poesia, grammatica,<br />
racconto popolare, epistola, ecc.) e il periodo storico.<br />
• Fonti orali<br />
301
Questo tipo di attestazioni comprendono tutte le registrazioni della voce viva dei<br />
parlanti nativi. Tali fonti sono della massima importanza per lo studio della fonologia<br />
e di tutti i fenomeni caratteristici del linguaggio parlato.<br />
La disponibilità di questo genere di attestazioni si deve al progresso tecnologico<br />
avvenuto negli ultimi cento anni, in cui è andata progressivamente migliorando la<br />
qualità della riproduzione della voce viva, così come sono cambiati e si sono moltiplicati<br />
i supporti di registrazione (supporto radiofonico, magnetico, digitale, ecc.).<br />
Anche in questo caso si tratterà di fare una ricognizione del materiale registrato<br />
esistente e di raccoglierlo in originale o in copia fedele. Esso sarà poi opportunamente<br />
inventariato con criteri che ne favoriscano la consultazione. In questo caso la tipologia<br />
delle attestazioni è però molto più omogenea, sia per epoca (durante l’ultimo secolo<br />
di storia) che per genere (per lo più interviste).<br />
3.2 Estrapolazione e inserimento dei dati nella banca dati<br />
Questa fase può avere inizio dal momento in cui un primo contingente di<br />
attestazioni, scritte e/o orali, sia stato raccolto e inventariato; a seguire la fase di<br />
raccolta e quella di estrapolazione e inserimento dei dati potranno proseguire anche<br />
in contemporanea.<br />
Prima di dare avvio a questa fase è però indispensabile aver stabilito il formato in cui<br />
ogni dato sarà inserito nella banca dati virtuale. Con un termine tecnico chiameremo<br />
record ogni dato lessicale inserito con il suo corredo informativo (fonte di provenienza,<br />
significato in italiano, note grammaticali, fraseologiche, area semantica, riferimenti<br />
incrociati, ecc.).<br />
• Scelta del formato e della struttura del record<br />
Si dovrà porre particolare attenzione alla definizione preliminare dei parametri del<br />
corredo informativo che accompagnerà ogni dato inserito, poiché la scelta influenzerà<br />
la struttura globale della banca dati.<br />
In linea di principio occorre tener presente il maggior numero possibile di<br />
informazioni attribuibile a un elemento lessicale. Dato che la rubricazione all’interno<br />
di ogni record assumerà, nel contesto informatico, la veste di ‘campi’, converrà<br />
attribuire a ogni categoria concettuale potenzialmente rilevante ai fini informativi<br />
un proprio campo. Il record esemplare sarà quello in cui tutti i campi informativi<br />
verranno compilati, ben sapendo che in numerosi casi non saranno disponibili tutti<br />
i dati. Se infatti, ad esempio, tra i parametri informativi accludiamo la trascrizione<br />
fonetica del dato lessicale, il campo destinato a questo parametro rimarrà certamente<br />
vuoto per tutte le voci del lessico cimbro risalenti a periodi storici molto antichi, non<br />
302
essendo possibile stabilire con sufficiente sicurezza l’esatta pronuncia della lingua<br />
dell’epoca.<br />
• Estrapolazione dei dati lessicali<br />
L’operazione di acquisizione dei dati lessicali sarà più o meno complessa, a seconda<br />
della natura delle fonti esaminate. Ciò si rifletterà sul grado d’impegno lavorativo e<br />
sulle diverse competenze richieste allo svolgimento del compito.<br />
Il caso più semplice è quello dello spoglio di un glossario, presentando già la fonte<br />
scritta di partenza i dati lessicali in forma di voci di entrata, con relativa traduzione<br />
e commento informativo. In questo caso l’inserimento dei dati lessicali nella banca<br />
dati può avvenire pressoché contemporaneamente alla loro estrapolazione dal testo<br />
in cui sono stati reperiti. Inoltre, sarà il testo stesso a fornirci importanti informazioni<br />
grammaticali e sul significato del lemma.<br />
Ben più complessa sarà invece l’estrapolazione di dati lessicali derivanti da fonti<br />
orali registrate. Qui l’operazione sarà particolarmente difficile nel caso di registrazioni<br />
di qualità scadente e/o di provenienza dialettale diversa da Luserna. In questo caso<br />
il gruppo di lavoro dovrà cimentarsi nella comprensione di varianti del cimbro ormai<br />
vicine all’estinzione ed essere in grado di individuare, dal contesto di un discorso<br />
parlato, i singoli costituenti lessicali riconoscendone la loro reciproca relazione<br />
grammaticale. Gli operatori dovranno poi trasporre la propria interpretazione dei dati<br />
lessicali in forma scritta, operando una scelta ragionata sulla loro rappresentazione<br />
grafica, e da qui procedere al loro inserimento nella banca dati.<br />
• Inserimento dei dati lessicali nella banca dati<br />
L’operazione d’inserimento dei dati, come abbiamo già evidenziato, presuppone<br />
la creazione di un formato uniforme per tutti i record della banca dati. Per ogni<br />
dato lessicale (lemma) estrapolato sarà creato un record specifico all’interno del<br />
quale il dato sarà corredato di varie annotazioni informative ripartite nei rispettivi<br />
campi. L’operatore, inserito il lemma nel suo record, dovrà riempire i campi con le<br />
informazioni di cui dispone al momento, lasciando vuoti gli altri campi. Ad esempio,<br />
si potrebbe presentare il caso in cui l’operatore inserisca un lemma risalente a una<br />
fonte antica dal cui contesto non sia possibile risalire al genere grammaticale. In tale<br />
circostanza lascerà vuoto il campo relativo all’informazione grammaticale sul genere<br />
dei sostantivi.<br />
Questa procedura lascia aperta la possibilità di successive revisioni dei record,<br />
finalizzate a integrare il corredo informativo dei lemmi ogniqualvolta emergano<br />
nuove informazioni sui medesimi. Per rimanere nell’esempio citato, può darsi il<br />
303
caso che, successivamente, lo spoglio di altre fonti porti alla conoscenza del genere<br />
grammaticale di quello stesso lemma.<br />
Naturalmente la continua acquisizione di fonti da sottoporre ad analisi porta spesso<br />
a estrapolare dati lessicali già noti da altre attestazioni precedentemente esaminate.<br />
La ricorrenza multipla di uno stesso lemma porta automaticamente alla revisione del<br />
record in cui è stato inizialmente inserito, aggiungendovi via via le nuove informazioni<br />
desunte dal contesto della fonte.<br />
Oltre a questa revisione “automatica” in corso d’opera, è tuttavia raccomandabile<br />
affiancare all’operatore che al momento svolge il lavoro d’inserimento dei dati<br />
lessicali un revisore che controlli nell’immediato la compilazione dei record, poiché<br />
in molti casi il grado di completamento delle note informative sui lemmi dipende,<br />
oltre che dal contesto in cui sono stati reperiti, anche dalla competenza specialistica<br />
di chi svolge il compito.<br />
304
305<br />
Bibliografia<br />
Broglio, A. (2000). La proprietà collettiva nei Sette Comuni. Aspetti storico-normativi.<br />
Roana: Istituto di Cultura Cimbra.<br />
Schmeller, J.A. (1985). Über die sogenannten Cimbern der VII und XIII Communen<br />
auf den Venedischen Alpen und ihre Sprache, 1811, 1838, 1852, 1855†, Curatorium<br />
Cimbricum Bavarense, Landshut.
Stealth Learning with an <strong>Online</strong> Dog<br />
(Web-based Word Games for Welsh)<br />
Gruffudd Prys and Ambrose Choy<br />
This paper describes issues surrounding developing web-based word games in a<br />
minority language setting, and is based on experience gained from the development of<br />
a project designed to improve the language skills of fluent Welsh speakers undertaken<br />
at Canolfan Bedwyr at the University of Wales, Bangor.<br />
This project was conceived by the BBC as an entertaining way of improving the<br />
language skills of fluent Welsh-speakers, especially those in the 18-40 age range.<br />
Funded by ELWa, the body responsible for post-16 education and training outside<br />
higher education in Wales, it was to form part of BBC Wales’ “Learn Welsh” website.<br />
The BBC’s Welsh language web pages are immensely popular, attracting a high<br />
proportion of younger Welsh-speakers. A survey conducted by the BBC in April and May<br />
2003 revealed that 43% of the BBC Welsh language online news service “Cymru’r Byd”<br />
belonged to the 15-34 age group, with a high level of workplace usage, peaking at<br />
lunchtimes. The project was to provide this audience with word games, a self marking<br />
set of language improvement exercises, and an online answering service dealing with<br />
grammatical and other language problems. In order to appeal to the target audience,<br />
it was important that they be entertaining and attractive in addition to being<br />
educational. It was also intended that the project should emphasise progressive youth<br />
culture rather than old-fashioned Celtic themes, and this would be incorporated into<br />
the design and feel of the games.<br />
This paper will concentrate specifically on the development of the interactive online<br />
games and puzzles, showing how digital language resources originally created for<br />
previous digital language projects were adapted and recycled, allowing the e-Welsh<br />
team at the University of Wales, Bangor, to produce a working website within a few<br />
short months. It will also detail some of the new innovations created as part of the<br />
project, with a view of building a modularized set of components that will provide a<br />
versatile resource bank for future projects.<br />
307
1. The Ieithgi Name<br />
Welsh has a peculiar word for people intensely interested in language. It is ieithgi,<br />
the literal translation of which would be ‘language dog.’ Perhaps ‘language terrier’<br />
would be a meaningful image for English speakers, as it denotes someone, who,<br />
having got hold of a particularly tasty bone to gnaw, is unwilling to let it go. It may be<br />
a question of some obscure Welsh grammar rule, or the origin of some Welsh place-<br />
name, but the ieithgi will not let the subject drop without knowing the answer.<br />
By coincidence, a project aimed at Welsh learners was using an animated dog,<br />
called Cumberland, and his owner, Colin, to introduce Welsh to new audiences. In the<br />
Colin and Cumberland storyline, Colin has no Welsh, whereas his dog Cumberland, is a<br />
fluent, knowledgeable and slightly pompous Welsh speaker. As Colin and Cumberland<br />
was aimed at the same demographic age group as the Ieithgi project, and possessed<br />
a design that was modern, contemporary and attractive, it was therefore a short step<br />
for Cumberland, the know-all dog in the animated cartoons, to become the namesake<br />
and mascot of the Ieithgi project, on hand to answer questions on Welsh grammar as<br />
well as guide users through the games and exercises.<br />
2. Macromedia Flash and XML<br />
The brief received from the BBC specified that the games were to be created using<br />
Macromedia Flash. Flash is a multimedia authoring program that creates files that can<br />
be played on any computer, Mac or PC, where Flash Player is installed (Macromedia<br />
claim a coverage of 98% of all desktops worldwide).<br />
Flash can combine vector and raster graphics, and uses a native scripting language<br />
called actionscript which is similar to Java. It can communicate with external XML<br />
files and databases, and, when used intelligently, produces small files which are quick<br />
to download. Flash also allows easy collaboration between a software engineer and<br />
a designer.<br />
308
Figure 1: Colin and Cumberland – The BBC Cartoon for Learners<br />
3. Technical Challenges<br />
The main technical challenge posed by the games was the need to adapt game<br />
formulas already existing in English to work with the characteristics of the Welsh<br />
language. This meant that new code specific to the needs of Welsh had to be created.<br />
The lack of ready-made Welsh language components available to form the building<br />
blocks needed to create the word games was a significant disadvantage when compared<br />
with developers creating similar games in a major language. These building blocks for<br />
Welsh had to be created as part of the project.<br />
In order to keep down costs, the project hoped to reuse resources developed<br />
originally for previous digital language projects undertaken by Canolfan Bedwyr. This<br />
is one way that a minority language such as Welsh can keep costs down and make<br />
frugal use of existing components in an attempt to keep pace with greater resourced<br />
languages.<br />
4. Resource Audit<br />
Over the years, as part of its mission to address the needs of Welsh language in<br />
a digital environment, Canolfan Bedwyr has built up a library of language resources,<br />
including digital dictionaries, spelling and grammar checkers as well as the assorted<br />
components such as lexicons and lemmatizers that combine to create such tools. Many<br />
of these resources are either useful or essential when attempting to create games<br />
309
such as Ieithgi; although seemingly quite different, digital dictionaries share many<br />
prerequisites with word games.<br />
As the Ieithgi project was a low budget, tight deadline project, it was imperative<br />
that we make as much use as possible of our existing resources, as opposed to<br />
reinventing the wheel. However, we also recognised that new tools for manipulating<br />
the Welsh language would also have to be forged in order for some aspects of Welsh<br />
to function properly in a digital online setting.<br />
Below is a list of the relevant resources available to Canolfan Bedwyr and the<br />
games in which they would be used:<br />
• Lexicon: To be used in Cybolfa (conundrum) and Dic Penderyn<br />
(hangman)<br />
• Place-name databases (AMR and Enwau Cymru): To be used in Rhoi Cymru<br />
yn ei lle (locate and identify place-names);<br />
• Proverb database: To be used in Diarhebol (guess the proverb);<br />
• Alphabet order sorter: To be used in Cybolfa, Diarhebol, Dic Penderyn,<br />
Pos croeseiriau (crossword) and Ystyrlon (identify the correct meaning).<br />
5. The Games<br />
Six games were to be produced for the Ieithgi project. Of these six, three were<br />
to be open-ended games. These games draw randomly from a large list of words or<br />
phrases each time the game is played, giving the user a fresh challenge every time<br />
they start a new game, and ensuring that the games have enormous replay value.<br />
Each instance of a closed game, on the other hand, must be created manually by a<br />
games designer, and this means in practice that there are fewer unique instances of<br />
closed games than of open games. However, conversely, the content of closed games<br />
can be more complex, as they do not need to be designed to conform to such tight<br />
technical constraints.<br />
Below is a list of the games divided by category:<br />
• Open Ended<br />
•<br />
Dic Penderyn (hangman)<br />
Cybolfa (conundrums)<br />
Diarhebol (guess the proverb)<br />
310
• Closed<br />
Pos croeseiriau (crosswords)<br />
Rhoi Cymru yn ei Lle (locate and identify place-names)<br />
Ystyrlon (identify the correct meaning)<br />
6. Open Ended Games<br />
From a technical point of view, the open-ended games posed the greatest challenge.<br />
Cybolfa, Dic Penderyn and Diarhebol all make use of XML word lists that are used<br />
to supply the games with random words or phrases that test the player’s language<br />
skills.<br />
6.1 Dic Penderyn<br />
Dic Penderyn, named after a Welsh folk hero, is our version of the popular Hangman<br />
game. Drawing a word at random from an XML file, Dic Penderyn gives the person<br />
playing the game ten attempts to guess the word before a set of gallows are built and<br />
a caricature of Colin, Cumberland’s owner, is hung, signalling ‘Game Over.’<br />
From an educational point of view, Dic Penderyn nurtures spelling ability by having<br />
the player think in terms of the letter patterns present in the language in order to<br />
correctly identify the game word. The game also increases the player’s vocabulary<br />
by sometimes suggesting unfamiliar words (as the word list contains words of varying<br />
degrees of familiarity).<br />
The XML wordlist was drawn from the lexicon compiled for the BBC’s Learn Welsh<br />
dictionary, which had been created previously for the BBC by Canolfan Bedwyr. This<br />
had the bonus of making it possible to link each of the words in the wordlist to a<br />
definition on the dictionary’s Webpage. The link would appear each time the player<br />
failed to identify the word, increasing the educational value of the game by providing<br />
definitions of words that had proved unfamiliar.<br />
The lexicon itself included words taken from Corpws Electroneg o Gymraeg (CEG),<br />
the tagged 1 million word Welsh language corpus developed at the University of Wales<br />
Bangor in the early nineties.<br />
Having a part-of-speech tagged lexicon proved extremely useful as it enables a<br />
game designer to tweak the content of the word list that would be created from it.<br />
After some initial playtesting, it was decided that conjugated verbs would be excluded<br />
from the wordlist. These are sometimes included in English versions of Hangman,<br />
as English has limited conjugation possibilities. In Welsh, however, as in Romance<br />
311
languages, most verbs follow a regular pattern of conjugation, with separate but<br />
regular conjugations for the different persons as well as tenses.<br />
Table 1: Conjugation of rhedeg (to run)<br />
Present Imperfect Past Pluperfect Subjunctive Imperfect<br />
312<br />
Subjunctive<br />
Imperative<br />
rhedaf rhedwn rhedais rhedaswn rhedwyf rhedwn rhed, rheda<br />
rhedi rhedit rhedaist rhedasit rhedych rhedit rheded<br />
rhed, rheda rhedai rhedodd rhedasai rhedo rhedai rhedwn<br />
rhedwn rhedem rhedasom rhedasem rhedom rhedem rhedwch<br />
rhedwch rhedech rhedasoch rhedasech rhedoch rhedech rhedent<br />
rhedant rhedent rhedasant rhedasent rhedont rhedent rheder<br />
rhedir rhedid rhedwyd rhedasid rheder rhedid<br />
As is apparent from Table 1, many of the verb forms above are far too similar for<br />
a player to differentiate between them in a game of hangman. Coupled with the fact<br />
that conjugated verbs seem unfamiliar outside the context of a sentence, this meant<br />
that their inclusion would have made the game too difficult and unrewarding from a<br />
playability perspective.<br />
6.1.1 Mutation<br />
The lemmatizer also allowed us to prevent the initial consonant mutation that is<br />
a feature of Welsh words within sentences from making its way from sentences in the<br />
corpus to words in the word list.<br />
A word such as ci (dog) can have the following mutations:<br />
ci nghi gi chi<br />
For example:<br />
Fy nghi<br />
Dy gi<br />
Ei chi<br />
Eu ci<br />
My dog<br />
Your dog<br />
Her dog<br />
Their dog
Mutations never occur in words when they appear in isolation (as they do in Dic<br />
Penderyn). It was therefore inappropriate to have them included in the XML word<br />
list.<br />
6.1.2 Dic Penderyn XML Word List Example<br />
Sample taken from list of 3,000+ six letter words<br />
6.1.3 Digraphs<br />
<br />
gormes<br />
<br />
<br />
gormod<br />
<br />
<br />
goroer<br />
<br />
The Welsh alphabet contains digraphs:<br />
ch dd ng ll ph rh<br />
These digraphs count as single letters rather than a combination of two separate<br />
letters. This means that Welsh, unlike most other languages that use the Roman<br />
alphabet, has two-character letters in addition to single-character letters.<br />
Take for example the word llefrith (milk), which has eight characters:<br />
L, L, E, F, R, I, T, H<br />
But six letters:<br />
LL, E, F, R, I, TH<br />
6.1.4 Digraph Problems<br />
The existence of digraphs in Welsh creates a number of problems:<br />
• Simple character count functions can’t be used to count letters.<br />
Due to the existence of digraphs, a function that simply counts the number of<br />
characters in a word will not be able to accurately count the number of letters in a<br />
Welsh word. Using a simple character count function to create the XML six letter word<br />
word list would not have included words such as llefrith that have six letters but have<br />
over six characters, and would erroneously include words with less than six letters<br />
313
ut that possessed six characters. In order to be able correctly count the letters, a<br />
digraph filter was created.<br />
Every word in the lexicon had to be passed through the filter. The filter identifies<br />
the characters that form Welsh digraphs (ch, dd, ng, ll, ph, rh) and treats them as<br />
single characters. The amount of letters in a word can then be counted correctly, so<br />
that only six letter words are added to our XML list of six letter words, whether they<br />
contain digraphs or not.<br />
Here is an example of the code used:<br />
public static int welshCharSplit(string word, ArrayList charArray)<br />
{<br />
int letterCount=0,x=0;<br />
charArray.Clear();<br />
word = word.ToUpper();<br />
string digraff = String.Empty;<br />
for (x=0; x
}<br />
}<br />
}<br />
{<br />
}<br />
if (word[x-1] == ‘N’)<br />
{<br />
}<br />
digraff = word[x-1].ToString() + word[x].ToString();<br />
//Check for Dd, Ff, Ll<br />
if ((word[x] == ‘D’)||(word[x] == ‘F’)||(word[x] == ‘L’))<br />
{<br />
}<br />
if (word[x-1] == word[x])<br />
{<br />
}<br />
digraff = word[x-1].ToString() + word[x].ToString();<br />
string buff = String.Empty;<br />
if (digraff != String.Empty)<br />
{<br />
}<br />
else<br />
{<br />
}<br />
charArray.RemoveAt((charArray.Count)-1);<br />
buff = digraff;<br />
buff = word[x].ToString();<br />
charArray.Add(buff);<br />
letterCount = charArray.Count;<br />
return letterCount;<br />
315
This creates a word list containing six-letter words, including digraphs.<br />
• Some combinations of characters can be a digraph or two separate<br />
letters.<br />
Bangor = (compound word of ban + côr) pronounced ‘n-g’<br />
angor ( meaning anchor) pronounced ‘ng’<br />
Fortunately, digraph look-up lists had previously been developed at Canolfan<br />
Bedwyr in order to correctly sort dictionaries according to the Welsh alphabet (ng<br />
follows g in the Welsh alphabet, so ng and n are sorted quite differently). These<br />
look-up lists could then be used to prevent confusion between digraphs and similar<br />
character combinations.<br />
• Inputting letters using the keyboard becomes more complicated.<br />
Designing an online interface that can differentiate elegantly between an inputted<br />
d and an inputted dd is a challenge. There is no support for digraphs in Welsh’s UTF-<br />
8 character set, and no specific Welsh keyboard that has digraph keys (UK English<br />
QWERTY keyboards are generally used). In practice this makes a keyboard-based<br />
approach to inputting awkward, especially when playing against the clock as in many<br />
of the Ieithgi games.<br />
It was decided that a visual interface would be devised in order to allow the user to<br />
input these characters quickly and efficiently. This came in the form of an on-screen<br />
keyboard featuring all the letters of the Welsh alphabet. Although the keyboard takes<br />
up some of the game’s screen space, it gives the player valuable feedback such as<br />
which letters have been chosen and which letters remain, as well as serving as a<br />
visual reminder to users more familiar with the English alphabet that Welsh considers<br />
digraphs to be single letters.<br />
The on-screen keyboard is used in Cybolfa, Diarhebol and Pos Croeseiriau in addition<br />
to Dic Penderyn, shown below:<br />
316
Figure 2: Screenshot showing Dic Penderyn after an Unsuccessful Attempt.<br />
6.2 Cybolfa<br />
The digraphs pose another problem when generating the Welsh words for Cybolfa,<br />
a game where the player must attempt to create words from a jumbled set of letters.<br />
Cybolfa uses the same six-letter XML word list as Dic Penderyn to supply the main<br />
six-letter game word. However, Cybolfa must then scramble the word so that it is<br />
difficult for the player to recognise. In English, this could be done fairly quickly by<br />
scrambling each individual character in a word. The same method can not be applied<br />
to the Welsh words because of the existence of digraphs, therefore the word must<br />
be passed through a filter that identifies any Welsh alphabetical letters in the word<br />
before scrambling it. To return to the earlier example of llefrith, an actionscript<br />
digraph filter within the Flash file identifies the digraphs as distinct letters. so that all<br />
six letters can be identified (LL, E, F, R, I, TH).<br />
Below shows the function called WelshFilter written in actionscript 2.0 in Flash<br />
MX, it receives the word in the form of an array and checks for the existence of any<br />
digraphs and returns the length of the word. If a digraph is found, it merges the<br />
letters to become a single element within the array.<br />
_global.welshFilter = function(WordArray) {<br />
for (x=1; x
{ if ((WordArray[x-1] == “R”) || (WordArray[x-1] == “T”) ||<br />
}<br />
(WordArray[x-1] == “P”) || (WordArray[x-1] == “C”))<br />
{ WordArray[x-1] = WordArray[x-1]+ WordArray[x];<br />
}<br />
WordArray.splice(x,1);<br />
//Check for Ng<br />
if (WordArray[x] == “G”)<br />
}<br />
{ if (WordArray[x-1] == “N”)<br />
{ WordArray[x-1] = WordArray[x-1]+ WordArray[x];<br />
}<br />
WordArray.splice(x,1);<br />
//Check for Dd, Ff, Ll<br />
if ((WordArray[x] == “D”) || (WordArray[x] == “F”) || (WordArray[x] == “L”))<br />
}<br />
{ if (WordArray[x-1] == WordArray[x])<br />
}<br />
{ WordArray[x-1] = WordArray[x-1]+ WordArray[x];<br />
}<br />
WordArray.splice(x,1);<br />
return(WordArray.length);<br />
}<br />
Once both digraphs and single-character letters have been identified as single<br />
elements, the word can be scrambled and displayed to the player in an unfamiliar<br />
letter order whilst still retaining the digraph integrity (TH, F, R, LL, I, E).<br />
6.3 Anagram Maker<br />
As described previously, the word list for the Cybolfa games is derived from the Dic<br />
Penderyn word list. Each time Cybolfa is played, a random six-letter word is drawn<br />
from the list and an anagram maker within the actionscript code generates a list of<br />
all possible anagrams for that word. This is achieved by cross-referencing Canolfan<br />
Bedwyr’s Welsh spellchecker list with the original word’s possible letter combinations.<br />
318
In programming terms, this is done by a one-to-one mapping of letter values to prime<br />
numbers, allowing words to be represented as composite numbers by multiplying<br />
together the primes that map each letter in the word. Words formed from the same<br />
letters, regardless of order, will then map to the same composite number. Therefore,<br />
if a word’s number divides exactly into another word’s, the first word’s letters must<br />
all appear in the second word.<br />
For example, take the word gwelwi.<br />
<br />
gwelwi<br />
<br />
By looking up the spellchecker list and using the anagram checker function, the<br />
following list is generated in XML:<br />
<br />
<br />
<br />
GWELWI<br />
GLIW<br />
ELI<br />
ELW<br />
EWIG<br />
GWELW<br />
GLEW<br />
GWIWI<br />
IGLW<br />
<br />
This list then is passed through and used as one of the games in Cybolfa.<br />
Below is a screenshot of a completed game where polisi was the six-letter gameword.<br />
319
6.4 Diarhebol<br />
Figure 3: Screenshot of Cybolfa Showing all Possible Anagrams.<br />
Diarhebol is in essence very similar to Dic Penderyn, the main difference being<br />
that rather than guessing a random six-letter word, the player must attempt to guess<br />
a Welsh proverb. Once again, players have a limited number of chances to achieve<br />
their objective before the game ends. If needed, a clue is provided in the form of an<br />
English translation of the proverb, and, whilst guessing a whole proverb may at first<br />
seem daunting, the higher probability of a sentence as opposed to a word containing<br />
a specific letter ensures that the game is of a similar level of difficulty.<br />
An XML proverb list replaces the XML word list used by both Dic Penderyn and<br />
Cybolfa, and an example is shown below.<br />
<br />
Yr afal mwyaf yw’r pydraf ei galon<br />
The biggest apple has the rottenest heart<br />
<br />
<br />
Yr euog a ffy heb neb yn ei erlid<br />
The guilty flees when no-one chases him<br />
<br />
<br />
Yr hen a wyr, yr ieuanc a dybia<br />
320
The old know, the young suppose<br />
<br />
<br />
A fo’n ddigwilydd a fo’n ddigolled<br />
The shameless will be without loss<br />
<br />
7. Closed Games<br />
Figure 4: Screenshot of Successful Attempt at Diarhebol<br />
Unlike the open-ended games, which draw their content from a list, the content<br />
for closed games must be manually created in advance due to the more involved<br />
nature of their content.<br />
7.1 Pos Croeseiriau<br />
Pos Croeseiriau is an online Welsh crossword puzzle. Crossword puzzles have been<br />
popular for some time in Welsh language publications such as local papers, where<br />
the custom of representing digraphs as a single letter within a single square has<br />
long been established. Due to the complexity of creating crosswords, both the clues<br />
and the answers have been hardcoded into the code. However, Cysgeir, Canolfan<br />
Bedwyr’s electronic dictionary can be used to aid in the creation of crosswords, as it<br />
can suggest words that contain specific letters in specific positions within the word.<br />
Take for instance a situation where the crossword designer has decided on the two<br />
321
words cyfarth and cynffon to form the answers for 1 and 2 across, and needs a word<br />
that will fit in the space for 1 down:<br />
Figure 5: TITLE<br />
¹_<br />
¹C Y F A R TH<br />
²C Y N FF O N<br />
The designer has to only type in ??F??N??? into Cysgeir (where ‘?’ represents an<br />
empty square) to be provided with a list of compatible words. In this case Cysgeir<br />
provides 32 different words that fulfil our requirements, of which we choose elfennol<br />
‘elementary.’<br />
Figure 6: title<br />
¹E<br />
L<br />
¹C Y F A R TH<br />
E<br />
N<br />
²C Y N FF O N<br />
O<br />
L<br />
Pos Croeseiriau used a slightly modified version of the on-screen keyboard found<br />
in the open-ended games to give players the ability to delete unwanted letters, and<br />
the resulting interface is simple and easy to use despite the complications caused by<br />
the Welsh digraphs.<br />
322
Figure 7: Pos Croeseiriau Screenshot Showing a Completed Game.<br />
7.2 Rhoi Cymru yn ei Lle<br />
Rhoi Cymru yn ei Lle was designed as a game that would educate people as to<br />
the geographical location of Welsh place-names. Players must attempt to drag a<br />
place-name to its correct position on a map, with themed clues relevant to each<br />
place providing some assistance. There are various themes, including sport, religion,<br />
culture, and history, so that the player learns a little about different aspects of their<br />
country as they play, and gain satisfaction from being able to locate an unfamiliar<br />
place on a map.<br />
When creating the content, Cronfa Archif Melville Richards and Enwau Cymru<br />
(developed by Canolfan Bedwyr) were invaluable in aiding in the identification and<br />
placement of place-names and their associated clues. Cronfa Archif Melvllle Richards<br />
is a fully searchable online database of historic Welsh place-name forms that contains<br />
location information and grid references, whilst Enwau Cymru is an online database of<br />
modern Welsh place-names dealing in particular with bilingual place-names and again<br />
giving location information. As with Pos Croeseiriau, due to its complexity, the game<br />
content is coded into the game itself.<br />
323
7.3 Ystyrlon<br />
Figure 8: Ystyrlon Screenshot Showing a Game in Progress<br />
Ystyrlon is similar to the popular game Call my Bluff in that the player is given<br />
an uncommon word (that is hopefully unfamiliar to him), and is then asked to guess<br />
the correct definition from a choice of three. From a technical viewpoint, this is a<br />
very simple game, the hard work being the creation of original content, choosing the<br />
unfamiliar words, and creating humorous and misleading definitions that will entertain<br />
those who play the game and keep them on their toes.<br />
As the content, once created, is quite simple, it is stored as an XML file that is<br />
then referenced by the Flash game file. This aids the production of new games, as it<br />
enables the creation of new content without having to use or understand the Flash<br />
programming application.<br />
324
8. Results<br />
Figure 7: Screenshot of Ystyrlon Following an Incorrect Guess<br />
Usually, academic establishments do not undertake commercial projects such as<br />
Ieithgi, concentrating on research that can then be exploited and taken forward by<br />
the private sector. However, in a minority language situation, the technical expertise<br />
and experience needed to create such language-specific products may not exist in<br />
the private sector, or the financial returns may not be high enough to justify the<br />
investment of time and money. In such a situation, centres such as Canolfan Bedwyr<br />
that see their goal as catering to the needs of a modern, living minority language,<br />
must sometimes fulfil both roles if the language is ever to see such products.<br />
The successful realisation of such a product has been one positive result of this<br />
venture.<br />
The Ieithgi project has also led to the creation of new digital resources, including a<br />
Welsh anagram maker and digraph filter, as well as a process for integrating resources<br />
through XML into Flash; these add to and enhance the resources available to Canolfan<br />
Bedwyr for future projects.<br />
The need to repackage existing digital resources to facilitate further reuse as<br />
part of future projects has also been identified, leading to a new programme of<br />
modularization of lexical components for future projects.<br />
A sure sign of a successful product is one that results in further commissions, and<br />
the success of Ieithgi has resulted in a further commission to develop a similar set<br />
325
of stealth educational Welsh language online games targeted at adults with below<br />
average literacy.<br />
It is hoped by Canolfan Bedwyr that the Ieithgi project will serve as an example<br />
of how to make a little go a long way, and that building up language resources and<br />
corpora can benefit a minority language in more ways than by producing dictionaries<br />
and spellcheckers, allowing existing resources to stretch further.<br />
326
“Archif Melville Richards Historical Place-name Database.” <strong>Online</strong> at<br />
http://www.bangor.ac.uk/amr.<br />
327<br />
References<br />
“Canolfan Bedwyr Website.” <strong>Online</strong> at http://www.bangor.ac.uk/ar/cb/.<br />
“Colin and Cumberland Website.” <strong>Online</strong> at<br />
http://www.bbc.co.uk/colinandcumberland/.<br />
Corpws Electronig o Gymraeg (CEG). “A 1 Million Word Lexical Database and Frequency<br />
Count for Welsh.”<br />
http://www.bangor.ac.uk/ar/cb/ceg/ceg_eng.html.<br />
“Cysgeir Electronic Dictionary Information Website.” <strong>Online</strong> at<br />
http://www.bangor.ac.uk/ar/cb/meddalwedd_cysgair.php.<br />
Davies, G. (2005). “Beginnings, New Media and the Welsh Language.” North American<br />
Journal of Welsh Studies, 5(1).<br />
“Enwau Cymru Modern Place-name Database.” <strong>Online</strong> at<br />
http://www.e-gymraeg.org/enwaucymru.<br />
Hicks, W.J. (2004). “Welsh Proofing Tools: Making a Little NLP go a Long Way.”<br />
Proceedings of the 1st Workshop on International Proofing Tools and Language<br />
Technologies. Greece: University of Patras.<br />
“Learn Welsh-The BBC’s Website for Welsh Learners.” <strong>Online</strong> at<br />
http://www.bbc.co.uk/wales/learnwelsh/.<br />
Prys, D. & Morgan, M. (2000). “E-Celtic Language Tools.” The Information Age, Celtic
Languages and the New Millenium. Ireland: University of Limerick.<br />
“The Ieithgi Website.” <strong>Online</strong> at http://www.bbc.co.uk/cymru/lieithgi/.<br />
328
Victoria Arranz (& Elisabet Comelles, David Farwell)<br />
Speech-to-Speech Translation for Catalan<br />
Alphabetical list of authors<br />
& titles with keywords<br />
Keywords: Catalan, Multilingualism, Speech-to-Speech Translation,<br />
Interlingua.<br />
Ermenegildo Bidese (& Cecilia Poletto, Alessandra Tomaselli)<br />
The relevance of lesser used languages for theoretical linguistics: the case of<br />
Cimbrian and the support of the TITUS corpus<br />
Keywords: Cimbrian, clitics, Wackernagelposition, Agreement, TITUS.<br />
Evelyn Bortolotti (& Sabrina Rasom)<br />
Il ladino fra polinomia e standardizzazione: l’apporto della linguistica<br />
computazionale<br />
Keywords: lessicografia, terminologia, corpus testuale, correttore ortografico,<br />
strumenti per la standardizzazione.<br />
Sonja E. Bosch (& Elsabé Taljard)<br />
A Comparison of Approaches towards Word Class Tagging: Disjunctively vs<br />
Conjunctively Written Bantu Languages<br />
Keywords: word class tagging, Bantu languages, disjunctive writing system,<br />
conjunctive writing system, morphological analyser, disambiguation rules,<br />
tagsets.<br />
Ambrose Choy (& Gruffudd Prys)<br />
Stealth Learning with an on-line dog Keywords: Web-based Word Games for Welsh<br />
Keywords: Stealth learning, Welsh, on-line games.<br />
Elisabet Comelles (& Victoria Arranz, David Farwell)<br />
Speech-to-Speech Translation for Catalan<br />
329
Keywords: Catalan, Multilingualism, Speech-to-Speech Translation,<br />
Interlingua.<br />
David Farwell (& Elisabet Comelles, Victoria Arranz)<br />
Speech-to-Speech Translation for Catalan<br />
Keywords: Catalan, Multilingualism, Speech-to-Speech Translation,<br />
Interlingua.<br />
Olya Gurevich<br />
Computing Non-Concatenative Morphology: the Case of Georgian<br />
Keywords: computational linguistics, morphology, Georgian, non-<br />
concatenative, construction grammar.<br />
Ulrich Heid (& Danie Prinsloo)<br />
Creating word class tagged corpora for Northern Sotho by linguistically informed<br />
bootstrapping<br />
Keywords: POS-tagger, Bantu-languages, Taggerlexicon, Tagging reference cp.<br />
Dewi Jones (& Delyth Prys)<br />
The Welsh National On-line Database<br />
Keywords: terminology standardization, Welsh, termbases, terminology markup<br />
framework.<br />
Cecilia Poletto (& Ermenegildo Bidese, Alessandra Tomaselli)<br />
The relevance of lesser used languages for theoretical linguistics: the case of<br />
Cimbrian and the support of the TITUS corpus<br />
Keywords: Cimbrian, clitics, Wackernagelposition, Agreement, TITUS.<br />
Danie Prinsloo (& Ulrich Heid)<br />
Creating word class tagged corpora for Northern Sotho by linguistically informed<br />
bootstrapping<br />
Keywords: POS-tagger, Bantu-languages, taggerlexicon, tagging reference cp.<br />
330
Luca Panieri<br />
Il progetto “Zimbarbort” per il recupero del patrimonio linguistico cimbro<br />
Keywords: cimbro, lessico, patrimonio linguistico.<br />
Delyth Prys (& Dewi Jones)<br />
The Welsh National On-line Database<br />
Keywords: terminology standardization, Welsh, termbases, terminology markup<br />
framework.<br />
Gruffudd Prys (& Ambrose Choy)<br />
Stealth Learning with an on-line dog Keywords: Web-based Word Games for Welsh<br />
Keywords: Stealth learning, Welsh, on-line games.<br />
Nicoletta Puddu<br />
Un corpus per il sardo: problemi e perspettive<br />
Keywords: corpus planning, corpus design, sardinian, non-standardized languages,<br />
XML.<br />
Sabrina Rasom (& Evelyn Bortolotti)<br />
Il ladino fra polinomia e standardizzazione: l’apporto della linguistica<br />
computazionale<br />
Keywords: Lessicografia, terminologia, corpus testuale, correttore ortografico,<br />
strumenti per la standardizzazione.<br />
Soufiane Rouissi (& Ana Stulic)<br />
Annotation of Documents for Eletronic Edition of Judeo-Spanish <strong>Text</strong>s: Problems<br />
and Solutions<br />
Keywords: electronic corpus, Judeo-Spanish, collaborative production, digital<br />
document.<br />
Clau Solèr<br />
Spracherneuerung im Rätoromanischen: Linguistische, soziale und politische<br />
Aspekte<br />
331
Oliver Streiter<br />
Implementing NLP-Projects for Small Languages: Instructions for Funding Bodies,<br />
Strategies for Developers<br />
Oliver Streiter (& Mathias Stuflesser)<br />
XNLRDF, A Framework for the Description of Natural Language Resources. A proposal<br />
and first implementation<br />
Keywords: XNLRDF, metadata, writing system, Unicode, encoding.<br />
Mathias Stuflesser (& Oliver Streiter)<br />
XNLRDF, A Framework for the Description of Natural Language Resources. A proposal<br />
and first implementation<br />
Keywords: XNLRDF, metadata, writing system, Unicode, encoding.<br />
Ana Stulic (& oufiane Rouissi)<br />
Annotation of Documents for Eletronic Edition of Judeo-Spanish <strong>Text</strong>s: Problems<br />
and Solutions<br />
Keywords: electronic corpus, Judeo-Spanish, collaborative production, digital<br />
document.<br />
Elsabé Taljard (& Sonja E. Bosch)<br />
A Comparison of Approaches towards Word Class Tagging: Disjunctively vs<br />
Conjunctively Written Bantu Languages<br />
Keywords: word class tagging, Bantu languages, disjunctive writing system,<br />
conjunctive writing system, morphological analyser, disambiguation rules,<br />
tagsets.<br />
Alessandra Tomaselli (& Ermenegildo Bidese, Cecilia Poletto)<br />
The relevance of lesser used languages for theoretical linguistics: the case of<br />
Cimbrian and the support of the TITUS corpus<br />
Keywords: Cimbrian, clitics, Wackernagelposition, Agreement, TITUS.<br />
Trond Trosterud<br />
332
Grammar-based language technology for the Sámi languages<br />
Keywords: Sámi, transducers, disambiguation, language technology, minority<br />
languages.<br />
Chinedu Uchechukwu<br />
The Igbo Language and Computer Linguistics: Problems and Prospects<br />
Keywords: language technology, lexicography, computer linguistics, linguistic<br />
tools.<br />
Ivan Uemlianin<br />
SpeechCluster: a speech database builder’s multitool<br />
Keywords: annotation, speech data, Welsh, Irish, open-source.<br />
333
Alphabetical list of contributors<br />
& contact adresses<br />
Victoria Arranz<br />
ELDA-Evaluation and Language<br />
Resources Distribution Agency<br />
arranz@elda.org<br />
Evelyn Bortolotti<br />
Istitut Cultural Ladin “majon di<br />
fascegn”<br />
rep.ling@istladin.net<br />
Ambrose Choy<br />
Canolfan Bedwyr<br />
Univeristy of Wales<br />
a.choy@bangor.ac.uk<br />
David Farwell<br />
Institució Catalana de Reserca i Estudis<br />
Avançats TALP-Centre de Tecnologies i<br />
Aplicacions del Llenguatge i la Parla<br />
Universitat Politècnica de Catalunya<br />
farwell@lsi.upc.edu<br />
Ulrich Heid<br />
IMS-CL, Institut für maschinelle<br />
Sprachverarbeitung<br />
Univerität Stuttgart<br />
uli@ims.uni-stuttgart.de<br />
Luca Panieri<br />
Istituto Cimbro di Luserna<br />
luca.panieri@fastwebnet.it<br />
Danie Prinsloo<br />
Department of African Languages<br />
University of Pretoria<br />
danie.prinsloo@up.ac.za<br />
Ermenegildo Bidese<br />
Università di Verona/ Philosophisch-<br />
Theologische Hochschule Brixen<br />
ebidese@lingue.univr.it<br />
Sonja E. Bosch<br />
University of South Africa<br />
boschse@unisa.ac.za<br />
Elisabet Comelles<br />
TALP-Centre de Tecnologies i<br />
Aplicacions del Llenguatge i la Parla<br />
Universitat Politècnica de Catalunya<br />
comelles@lsi.upc.edu<br />
Olya Gurevich<br />
UC Berkeley<br />
olya@berkeley.edu<br />
Dewi Jones<br />
Language Technologies<br />
Canolfan Bedwyr<br />
University of Wales, Bangor<br />
d.b.jones@bangor.ac.uk<br />
Cecilia Poletto<br />
Padova-CNR<br />
cecilia.poletto@unipd.it<br />
Delyth Prys<br />
Canolfan Bedwyr<br />
Univeristy of Wales<br />
d.prys@bangor.ac.uk<br />
335
Gruffudd Prys<br />
Language Technologies<br />
Canolfan Bedwyr<br />
University of Wales, Bangor<br />
g.prys@bangor.ac.uk<br />
Sabrina Rasom<br />
Istitut Cultural Ladin “majon di<br />
fascegn” (ICL)<br />
lengaz@istladin.net<br />
Clau Soler<br />
Universität Genf<br />
clau.soler@bluewin.ch<br />
Ana Stulic<br />
University of Bordeaux 3 AMERIBER<br />
etchevers@tele2.fr<br />
Elsabé Taljard<br />
University of Pretoria<br />
elsabe.taljard@up.ac.za<br />
Trond Trosterud<br />
Universitetet i Tromsø<br />
trond.trosterud@hum.uit.no<br />
Ivan Uemlianin<br />
Language Technologies<br />
Canolfan Bedwyr<br />
University of Wales, Bangor<br />
i.uemliani@bangor.ac.uk<br />
Nicoletta Puddu<br />
University of Pavia<br />
attel76@hotmail.com<br />
Soufiane Rouissi<br />
University of Bordeaux 3 CEMIC-GRESIC<br />
Soufiane.Rouissi@u-bordeaux3.fr<br />
Oliver Streiter<br />
National University of Kaohsiung<br />
ostreiter@nuk.edu.tw<br />
Mathias Stuflesser<br />
European Academy of Bolzano<br />
mstuflesser@eurac.edu<br />
Alessandra Tomaselli<br />
Università di Verona<br />
alessandra.tomaselli@univr.it<br />
Chinedu Uchechukwu<br />
Universität Bramberg, Germany<br />
neduchi@netscape.net<br />
336