Academia.eduAcademia.edu
The Workshop Programme Monday, May 26 14:30 -15:00 Opening Nancy Ide and Adam Meyers 15:00 -15:30 SIGANN Shared Corpus Working Group Report Adam Meyers 15:30 -16:00 Discussion: SIGANN Shared Corpus Task 16:00 -16:30 Coffee break 16:30 -17:00 Towards Best Practices for Linguistic Annotation Nancy Ide, Sameer Pradhan, Keith Suderman 17:00 -18:00 Discussion: Annotation Best Practices 18:00 -19:00 Open Discussion and SIGANN Planning Tuesday, May 27 09:00 -09:10 Opening Nancy Ide and Adam Meyers 09:10 -09:50 From structure to interpretation: A double-layered annotation for event factuality Roser Saurí and James Pustejovsky 09:50 -10:30 An Extensible Compositional Semantics for Temporal Annotation Harry Bunt, Chwhynny Overbeeke 10:30 -11:00 Coffee break 11:00 -11:40 Using Treebank, Dictionaries and GLARF to Improve NomBank Annotation Adam Meyers 11:40 -12:20 A Dictionary-based Model for Morpho-Syntactic Annotation Cvetana Krstev, Svetla Koeva, Du!ko Vitas 12:20 -12:40 Multiple Purpose Annotation using SLAT - Segment and Link-based Annotation Tool (DEMO) Masaki Noguchi, Kenta Miyoshi, Takenobu Tokunaga, Ryu Iida, Mamoru Komachi, Kentaro Inui 12:40 -14:30 Lunch 14:30 -15:10 Using inheritance and coreness sets to improve a verb lexicon harvested from FrameNet Mark McConville and Myroslava O. Dzikovska 15:10 -15:50 An Entailment-based Approach to Semantic Role Annotation Voula Gotsoulia 16:00 -16:30 Coffee break 16:30 -16:50 A French Corpus Annotated for Multiword Expressions with Adverbial Function Eric Laporte, Takuya Nakamura, Stavroula Voyatzi 16:50 -17:20 On Construction of Polish Spoken Dialogs Corpus Agnieszka Mykowiecka, Krzysztof Marasek, Ma"gorzata Marciniak, Joanna Rabiega-Wisniewska, Ryszard Gubrynowicz 17:20 -17:40 A RESTful interface to Annotations on the Web Steve Cassidy 17:40 -18:30 Panel : Next Challenges in Annotation Theory and Practice 18:30 -19:00 Open Discussion ! " Workshop Organisers Nancy Ide, Vassar College (co-chair) Adam Meyers, New York University (co-chair) Inderjeet Mani, MITRE Corporation Antonio Pareja-Lora, SIC, UCM / OEG, UPM Sameer Pradhan, BBN Technologies Manfred Stede, Universitat Potsdam Nianwen Xue, University of Colorado ! ""! Workshop Programme Committee David Ahn, Powerset Lars Ahrenberg, Linkoping University Timothy Baldwin, University of Melbourne Francis Bond, NICT Kalina Bontcheva, University of Sheffield Matthias Buch-Kromann, Copenhagen Business School Paul Buitelaar, DFKI Jean Carletta, University of Edinburgh Christopher Cieri, Linguistic Data Consortium/University of Pennsylvania Hamish Cunningham, University of Sheffield David Day, MITRE Corporation Thierry Declerck, DFKI Ludovic Denoyer, University of Paris 6 Richard Eckart, Darmstadt University of Technology Tomaz Erjavec, Jozef Stefan Institute David Farwell, New Mexico State University Alex Chengyu Fang, City University Hong Kong Chuck Fillmore, International Computer Science Institute John Fry, San Jose State University Claire Grover, University of Edinburgh Eduard Hovy, Information Sciences Institute Baden Hughes, University of Melbourne Emi Izumi, NICT Aravind Joshi, University of Pennsylvania Ewan Klein, University of Edinburgh Mike Maxwell, University of Maryland Stephan Oepen, University of Oslo Martha Palmer, University of Colorado Manfred Pinkal, Saarland University James Pustejovsky, Brandeis University Owen Rambow, Columbia University Laurent Romary, Max-Planck Digital Library Erik Tjong Kim Sang, University of Amsterdam Graham Wilcock, University of Helsinki Theresa Wilson, University of Edinburgh ! """ Table of Contents Long papers From structure to interpretation: A double-layered annotation for event factuality Roser Saurí, James Pustejovsky.............................................................................................. 1 An Extensible Compositional Semantics for Temporal Annotation Harry Bunt, Chwhynny Overbeeke ......................................................................................... 9 Using Treebank, Dictionaries and GLARF to Improve NomBank Annotation Adam Meyers......................................................................................................................... 17 A Dictionary-based Model for Morpho-Syntactic Annotation Cvetana Krstev, Svetla Koeva, Du!ko Vitas .........................................................................25 Using inheritance and coreness sets to improve a verb lexicon harvested from FrameNet Mark McConville, Myroslava O. Dzikovska ......................................................................... 33 An Entailment-based Approach to Semantic Role Annotation Voula Gotsoulia ....................................................................................................................41 Short papers A French Corpus Annotated for Multiword Expressions with Adverbial Function Eric Laporte, Takuya Nakamura, Stavroula Voyatzi ............................................................ 48 On Construction of Polish Spoken Dialogs Corpus Agnieszka Mykowiecka, Krzysztof Marasek, Ma"gorzata Marciniak, Joanna Rabiega-Wisniewska, Ryszard Gubrynowicz ........................................................................ 52 A RESTful interface to Annotations on the Web Steve Cassidy......................................................................................................................... 56 Demonstration Multiple Purpose Annotation using SLAT - Segment and Link-based Annotation Tool Masaki Noguchi, Kenta Miyoshi, Takenobu Tokunaga, Ryu Iida, Mamoru Komachi, Kentaro Inui.......................................................................................................... 61 ! "# Author Index Bunt, Harry 9 Cassidy, Steve 56 Dzikovska, Myroslava O. 33 Gotsoulia, Voula 41 Gubrynowicz, Ryszard 52 Iida, Ryu 61 Inui, Kentaro 61 Koeva, Svetla 25 Komachi, Mamoru 61 Krstev, Cvetana 25 Laporte, Eric 48 Marasek, Krzysztof 52 Marciniak, Ma!gorzata 52 McConville, Mark 33 Meyers, Adam 17 Miyoshi, Kenta 61 Mykowiecka, Agnieszka 52 Nakamura, Takuya 48 Noguchi, Masaki 61 Overbeeke, Chwhynny 9 Pustejovsky, James 1 Rabiega-Wisniewska, Joanna 52 Saurí, Roser 1 Tokunaga, Takenobu 61 Vitas, Du"ko 25 Voyatzi, Stavroula 48 ! ! # From structure to interpretation: A double-layered annotation for event factuality Roser Saurı́ and James Pustejovsky Lab for Linguistics and Computation Computer Science Department Brandeis University {roser,jamesp}@cs.brandeis.edu Abstract Current work from different areas in the field points out the need for systems to be sensitive to the factuality nature of events mentioned in text; that is, to recognize whether events are presented as corresponding to real situations in the world, situations that have not happened, or situations of uncertain status. Event factuality is a necessary component for representing events in discourse, but for annotation purposes it poses a representational challenge because it is expressed through the interaction of a varied set of structural markers. Part of these factuality markers is already encoded in some of the existing corpora but always in a partial way; that is, missing an underlying model that is capable of representing the factuality value resulting from their interaction. In this paper, we present FactBank, a corpus of events annotated with factuality information which has been built on top of TimeBank. Together, TimeBank and FactBank offer a double-layered annotation of event factuality: where TimeBank encodes most of the basic structural elements expressing factuality information, FactBank adds a representation of the resulting factuality interpretation. 1. Introduction In the past decade, most efforts towards corpus construction have been devoted to encoding a variety of semantic information structures. For example, much work has gone to annotating the basic units that configure propositions (PropBank, FrameNet) and the relations these hold at the discourse level (RST Corpus, Penn Discourse TreeBank, GraphBank), as well as specific knowledge that has proved fundamental in tasks requiring some degree of text understanding, such as temporal information (TimeBank) and opinion expressions (MPQA Opinion Corpus).1 The field is moving now towards finding platforms for unifying them in an optimal way –e.g., Pradhan et al. (2007); Verhagen et al. (2007). It therefore seems we are at a point where the first elements for text understanding can be brought together. Nonetheless, current work from different areas in the field points out the need for systems to be sensitive to an additional level of information; namely, that conveying whether events in text are presented as corresponding to real situations in the world, situations that have not happened, or situations of uncertain status. We refer to this level as event factuality. The need for this further type of information is demonstrated in highly domain-oriented disciplines such as bioinformatics (Light et al., 2004), as well as more genreoriented tasks. For example, Karttunen & Zaenen (2005) discusses the relevance of veridicity for IE. Factuality is critical also in the area of opinion detection (Wiebe et al., 2005), given that the same situation can be presented as a fact in the world, a mere possibility, or a counterfact according to different sources. And in the scope of textual 1 The main references for these corpora are: PropBank (Palmer et al., 2005), FrameNet (Baker et al., 1998), RST Corpus (Carlson et al., 2002), Penn Discourse TreeBank (Miltsakaki et al., 2004), GraphBank (Wolf & Gibson, 2005), TimeBank (Pustejovsky et al., 2003), MPQA Opinion Corpus (Wiebe et al., 2005). entailment, it has been taken as a basic feature in some of the systems participating in (or using the data from) previous PASCAL RTE challenges. For example, Tatu & Moldovan (2005) treat intensional contexts, de Marneffe et al. (2006) look at features accounting for the presence of polarity, modality, and factivity markers in the textual fragments, while Snow & Vanderwende (2006) check for polarity and modality scoping over matching nodes in a graph. Most significantly, the system that obtained the best absolute result in the three RTE challenges, scoring an 80% accuracy (Hickl & Bensley, 2007), is based on identifying the set of publicly-expressed beliefs of the author; that is, on the author’s commitments of how things are in the world according to what is expressed in text –either asserted, presupposed, or implicated. Event factuality is a necessary component for representing events in discourse, together with other levels of information such as argument structure or temporal information. Inferences derived from events that have not happened, or that are only possible, are different from those derived from events judged as factual in nature. For instance, it is basic for temporally ordering the events in a given text. For annotation purposes, however, it poses a representational challenge. The factuality of events is expressed through the interaction of elements from different linguistic categories. It involves, for instance, polarity (events can be presented as positive or negative) as well as modality – epistemic modality, for instance, expresses the degree of certainty of a source about what is asserted, and events qualified with other types of modality are generally presented as mere possibilities. Other information at play is evidentiality (e.g., a seen event is presented with a factuality degree stronger than that of an event reported by somebody else) or mood (e.g., indicative vs. subjunctive). Factuality is also a component in the semantics of specific syntactic structures with presuppositional effects (e.g., appositions and relative clauses), as well as certain types of predicates –most notoriously, the so-called factive and implicative predicates, but also others; compare, for instance, the effect that decision in (1a) and refusal in (1b) have on the factuality status of the underlined event. (1) a. A senior Russian politician has hailed a decision by Uzbekistan to shut down a United States military base. b. A senior Russian politician has hailed the refusal by Uzbekistan to shut down a United States military base. Part of these factuality markers are already encoded in some of the existing corpora (for example, TimeBank annotates polarity particles, modality operators, as well as the aforementioned predicates), but always in a partial way; that is, missing an underlying model capable of representing the factuality value that results from their interaction. In this paper, we introduce FactBank, a corpus of events annotated with factuality information which has been built on top of TimeBank. Together, TimeBank and FactBank offer a double-layered annotation of event factuality: the former encodes most of the basic structural elements expressing factuality information, whereas the latter represents the resulting factuality interpretation. In the next section, we set the linguistic grounding of our work by defining event factuality as a semantic property of events, establishing its possible values, and identifying its structural markers. Then, section 3 presents the main challenges for automatically recognizing it, which motivate the double-layered corpus annotation. We review some of the existing corpora where this information has already been annotated in section 4. Finally, section 5 focuses on FactBank, which is evaluated in section 6. 2. Linguistic foundations 2.1. What is event factuality Eventualities in discourse can be couched in terms of a veridicality axis that ranges from truly factual to counterfactual, passing through a whole spectrum of degrees of modality. In some contexts, the factual status of events is presented with absolute certainty. Events are then characterized as facts (2) or counterfacts (5). Other contexts introduce different shades of uncertainty. Depending on the polarity, events are then qualified as possibly factual (3) or possibly counterfactual (4). (2) Five U.N. inspection teams visited a total of nine other sites. (3) United States may extend its naval quarantine to Jordan’s Red Sea port of Aqaba. (4) They may not have enthused him for their particular brand of political idealism. (5) The size of the contingent was not disclosed. Factuality can therefore be characterized as involving polarity and modality (more precisely, epistemic modality). Polarity is a discrete category with two values, positive and negative. Epistemic modality expresses the speaker’s degree of commitment to the truth of the proposition (Palmer, 1986), which ranges from uncertain (or possible) to absolutely certain (or necessary). For methodological reasons, however, we need a discrete categorization of that system. 2.2. Factuality values Within modal logic, two operators are typically used to express a modal context: necessity (!) and possibility (♦); e.g., Lewis (1968). On the other hand, most of the work in linguistics points towards a three-fold distinction: certain, probable, and possible; e.g., (Lyons, 1977; Halliday & Matthiessen, 2004). Interestingly, Horn (1989) analyzes modality and its interaction with polarity based on both linguistic tests and logical relations at the basis of the Aristotelian Square of Opposition. He presents modality as a continuous category. Yet, he provides a good grounding for differentiating the three major modality degrees just mentioned. Based on that, we represent factuality by means of the features in Table 1: Table 1: Factuality values Certain Probable Possible Underspecif. Positive Negative Fact: <CT,+> Probable: <PR,+> Possible: <PS,+> Counterfact: <CT,−> Not probable: <PR,−> Not certain: <PS,−> (NA) (NA) Underspecified Certain but unknown output: <CT, u> (NA) (NA) Unknown or uncommitted: <U,u> The factual value of events is then presented as a tuple <mod, pol>, containing a modality and a polarity value.2 The polarity axis divides into positive, negative, and unknown, while the modality axis distinguishes among certain (CT), probable (PR), possible (PS), and unknown (UN). The unknown values are added to account for cases of uncommitment. The table includes six fully committed (or specified) values (<CT,+>, <CT,−>, <PR,+>, <PR,−>, <PS,+>, <PS,−>), and two underspecified ones: the partially underspecified <CT,u>, and the fully underspecified <U,u>. The partially underspecified value, <CT,u>, is for cases where there is total certainty about the factual nature of the event but it is not clear, however, what the output is –e.g., (6). The fully underspecified <U,u>, on the other hand, is used when any of the following situations applies: (i) The source does not know what is the factual status of the event, as in (7a); (ii) the source is not aware of the possibility of the event –e.g., (7b); or (iii) the source does not overtly commit to it –e.g., (7c). The following examples illustrate each of these preceding situations for the underlined event when evaluated by source John: (6) John knows whether Mary came. (7) a. John does not know whether Mary came. b. John does not know that Mary came. c. John knows that Paul said that Mary came. For simplicity, in what follows the factuality values will be represented in the abbreviated form of CT +, PR−, Uu, etc. 2 Semantically, this can be interpreted as: V al(mod)(V al(pol)(e)) –i.e., the modal value scopes over the polarity value. 2.3. Discriminatory tests In characterizing the factuality of events, the polarity parameter offers no problem, but distinguishing between the modality values (e.g., between possible and probable) is not always evident. In order to determine the modality parameter, we designed a battery of tests based on the logical relations considered in Horn (1989) to pinpoint the basic categories of epistemic modality; i.e., Law of Contradiction and Law of Excluded Middle. They are copredication tests. Underspecification (U) versus different degrees of certainty (CT, PR, PS): Events with an underspecified value can be copredicated with both: a context in which they are characterized as certainly happening (CT+), and a context in which they are presented as certainly not happening (CT−). For example, sentence (8) can be continued by either fragment in (10), the first of which maintains the original underlined event as certainly happening (CT+), and the second as certainly not happening (CT−). This is not the case, however, for sentence (9), where the underlined event is explicitly characterized as probable. Probable (PR) versus possible (PS): As seen, both degrees of uncertainty (PR and PS) accept copredication with PS in a context of opposite polarity. However, only the lowest degree of uncertainty (PS) accepts copredication with PR in a context of opposite polarity. (15) a. I think it’s not going to change (PR−) for a couple of years. b. #... but it probably will. (PR+) (16) a. It may not change (PS−) for a couple of years. b. ... but it most probably will. (PR+) Table 2 summarizes the different copredication tests just introduced. The resulting epistemic modality values assigned to events are listed in the rows, while the tests are presented in the columns, abbreviated as EMsubindex . EM expresses the epistemic modality value of the context to be copredicated to the original sentence, whereas subindex indicates its polarity: = means context of the same polarity, and op, context of opposite polarity. (8) Iraq has agreed to allow Soviets in Kuwait to leave. (9) Soviets in Kuwait will most probably leave. Table 2: Tests for discriminating among modality degrees. (10) a. ... They will take the plane tomorrow early in the morning. (CT +) b. ... However, most of them decided to remain there. (CT−) Absolute certainty (CT) versus degrees of uncertainty (PR, PS): Eventualities presented as certain (CT) cannot at the same time be assessed as possible (PS) in a context of opposite polarity. In the examples below, the symbol # is used to express that there is some sort of semantic anomaly. (11) a. Hotels are only thirty (CT +) percent full. b. #... but it is possible that they aren’t (PS−). (12) a. Nobody believes (CT−) this anymore. b. #... but it is possible that somebody does (PS +). On the other hand, eventualities characterized with some degree of uncertainty (PS or PR) allow for it: (13) a. I think it’s not going to change (PR−) for a couple of years. b. ... but it could happen otherwise. (PS+) (14) a. He probably died (PR+) within weeks or months of his capture. b. ... but it is also possible that the kidnappers kept him alive for a while. (PS-) In (13), the source expressed by the pronoun I characterizes the underlined event as PR− by presenting it under the scope of the predicate think used in 1st person. The fragment in (13b) can be added without creating any semantic anomaly. A similar situation is presented in (14): the adverb probably is characterizing the event as PR+, and the additional fragment presents the possibility of things being otherwise. U PS PR CT CT= ok ok ok ok CTop ok # # # PRop ok ok # # PSop ok ok ok # For example, given an event e presented under a context of negative polarity in its original sentence, test PRop requires the creation of a new fragment in which e is used in a context where the modality degree is probable and the polarity is positive: PR+.3 (PR−) (17) Original: I think it’s not going to change. Testing e2 with PRop : #... but it probably will. (PR+) 2.4. Factuality markers Event factuality in natural language is marked by both lexical items and syntactic constructions. 2.4.1. Lexical Markers Event Selecting Predicates (ESPs). These are predicates (verbs, nouns, or adjectives) that select for an argument denoting an eventuality of any sort. Syntactically, they subcategorize for that-, gerundive-, and to- clauses, or NPs headed by event-denoting nouns. The ESPs in (18) are in bold face; their embedded events, underlined. (18) a. Uri Lubrani also suggested Israel was willing to withdraw from southern Lebanon. b. Kidnappers kept their promise to kill a store owner they took hostage. 3 As appreciated, test CT= is non-discriminative. It is added there because, when combined with CPop, it allows to identify U values from the rest. ESPs contribute to characterizing the factuality of the event denoted by its complement. For example, complements to weak assertive predicates (Hooper, 1975) (think, suppose) are depicted as not totally certain; complements of reporting predicates (Bergler, 1992) are presented as certain according to a particular source; factive (regret, know) and implicative predicates (manage, prevent) characterize their embedded complements as either factual or counterfactual (Kiparsky & Kiparsky, 1970; Karttunen, 1970, 1971); and arguments of volition and commitment predicates (wish; offer) are presented as possible in a future point in time. Modal Particles. These include modal auxiliaries (could, may, must), but also clausal and sentential adverbial modifiers (maybe, likely, possibly). Polarity Particles. These include elements of a varied nature: adverbs (not, until), quantifiers (no, none), pronouns (nobody), etc. They switch the polarity of its context. When scoping over a modal particle, they also affect its modal interpretation. 2.4.2. Syntactic Contexts Syntactic structures conveying factuality information involve two clauses, one embedded under the other. In some cases, the embedded event is presupposed as holding; e.g., relative clauses (19), cleft sentences (20), and subordinated temporal clauses. (19) Rice, who became secretary of state two months ago today, took stock of a period of tumultuous change. (20) It was Mr. Bryant who, on July 19, 2001, asked Rep. Bartlett to pen and deliver a letter to him. In others, the event denoted by the embedded clause is intensional in nature; e.g., purpose clauses (21) and conditional constructions (22). (21) The environmental commission must adopt regulations to ensure people are not exposed to radioactive waste. (22) EZLN will return to the negotiating table if the conflict zone is demilitarized. 3. Challenges in identifying event factuality Annotating event factuality poses challenges at two levels. First, factuality is in many cases the result of different factuality markers interacting among them. They can all be in the local context of the event, but it is also common for them to be at different levels. Second, the factuality of an event is always relative to one or more sources. Hence, they must be included as part of the annotation scheme as well. The following subsections elaborate on these two issues. Refer to Saurı́ (2008) for a more comprehensive view on event factuality and its identification. 3.1. Interpreting the factuality of events Event factuality involves local but also non-local information. Consider the following examples:4 (23) a. The Royal Family will continue to allow detailed fire brigade inspectionse of their private quarters. 4 As startling as it may result, the original sentence in this set is (23b), from the BNC. b. The Royal Family will continue to refuse to allow detailed fire brigade inspectionse of their private quarters. c. The Royal Family may refuse to allow detailed fire brigade inspectionse of their private quarters. The event inspections in (23a), where allow is embedded under the factive predicate continue, is characterized as a fact in the world –i.e., there have been such inspections. Example (23b), on the other hand, depicts inspections as a counterfact because of the effect of the predicate refuse scoping over allow. Now contrast the two previous sentences with that in (23c), where the factual status of the event inspections is uncertain due to the modal auxiliary may scoping over refuse. Hence, the factuality status of a given event cannot be obtained from the strict local modality and polarity operators scoping over that event but, if present, appealing to their interaction with other non-local markers as well. Consequently, annotating factuality from a surface-based approach, accounting for the structural elements and without considering their interaction, will miss an important piece of information. 3.2. Relevant sources The second challenge to encoding event factuality involves the notion of perspective. Different discourse participants may present divergent views about the factuality nature of the very same event. Recognizing these sources is crucial for any task involving text entailment, such as question answering or narrative understanding. For example, event e in (24) (i.e., Slobodan Milosevic having been murdered in The Hague) will be inferred as a fact in the world if it cannot be qualified as the assertion of a specific source; namely, Milosevic’s son. (24) Slobodan Milosevic’s son said Tuesday that the former Yugoslav president had been murderede at the detention center of the UN war crimes tribunal in The Hague. By default, events mentioned in discourse always have an implicit source, viz., the author of the text. Additional sources are introduced in discourse by means of predicates of reporting (say, tell), knowledge and opinion (e.g., believe, know), psychological reaction (regret), etc. Because of their role in introducing a new source, we call them Source Introducing Predicates (SIPs). The status of the additional sources is, however, different from that of the author of the text. For instance, in (25) the reader learns Izvestiya’s position only according to what the author asserts –in other words, the reader does not have direct access to the factual assessment of Izvestiya about event e2 –or, for that matter, to the assessment of G-7 leaders about e3 . (25) Izvestiya saide1 that the G-7 leaders pretendede2 everything was OKe3 in Russia’s economy. Thus, we need to appeal to the notion of nested source as presented in Wiebe et al. (2005). Izvestiya is not a licit source of the factuality of event e2 , but Izvestiya according to the author instead, represented here as izvestiya author.5 5 Equivalent to the notation <author,izvestiya> in Wiebe’s work. Similarly, the source referred to by the G-7 leaders corresponds to the chain: g7leaders izvestiya author. As it happens, the same event can have more than one relevant source relative to which its factuality is assessed. In some cases, they coincide in the factual status of the event but in others there is disagreement. In (25), for example, event e3 is assessed as being a fact (CT+) according to the G-7 leaders (corresponding to source g7leaders izvestiya author), but as being false (CT−) according to Izvestiya (i.e., izvestiya author). The text author, on the other hand, remains uncommitted (Uu). The factuality value assigned to events in text must be relative to the relevant sources at play, which may be one or more. Only under this assumption it is possible to account for the potential contradictions between factual values assigned to the same event, and the different opinions commonly found in news reports. 4. Factuality information in existing corpora To our knowledge, factuality-related information is annotated in three corpora: the MPQA Opinion Corpus (Wiebe et al., 2005), the Penn Discourse TreeBank (Miltsakaki et al., 2004), and TimeBank (Pustejovsky et al., 2003). Currently, it is also being annotated in the ACE 2008 program.6 The factuality-relevant expressions annotated in the MPQA Opinion Corpus are private states (opinions, beliefs, thoughts) and speech events. They both convey the stance of a source with regard to what is believed or said. Nevertheless, event factuality is not the focus of the annotation, and hence these events and states are not characterized in terms of the factual degree they convey but in terms of perspective (i.e., objective vs. subjective). Another common feature between the MPQA Opinion Corpus scheme and our model of event factuality is the encoding of sources. Both approaches structure them as chains of nested sources. From our perspective, however, the MPQA Opinion Corpus is limited in that it only acknowledges one relevant source for each event. Another limitation in the MPQA annotation scheme is that it is not grammatically grounded. That is, the annotation of text spans is not guided according to the grammatical structure of the sentence, and this can pose an obstacle for tasks of automatic recognition. The Penn Discourse TreeBank (PDTB) seems closer to our perspective in that it contemplates the attribution of abstract objects (corresponding here to what we refer to as events), and encodes both their sources and the degree of factuality associated to them (Prasad et al., 2007). The task is approached from a compositional approach, contrary to the MPQA Opinion Corpus. In spite of these similarities, there are two significant differences. With regard to sources, PDTB does not encode the nesting relation that can hold among them, neither accounts for the possibility of more than one source for a given abstract object (or event). The second difference concerns the factuality degree associated to the attributed event, which is assigned based on 6 http://projects.ldc.upenn.edu/ace/annotation/. Because it still is an ongoing project, we will not comment on that corpus here. the type of action described by the predicate embedding it. In particular, events embedded under communication predicates are characterized as asserted; events embedded by propositional attitude predicates, as beliefs; and events embedded under factive predicates, as facts. As it happens, however, each of these types of predicates is not uniform in terms of the factuality they project to the embedded event. Suggest, for instance, is a communication verb which nevertheless conveys a nuance of belief. Similarly, forget is a factive predicate which, contrary to others in its class, expresses an uncommitted (or ignorant) stance of the source (i.e., the participant expressed by its subject) with regards to the factual status of its embedded complement. The classification misses therefore important factuality distinctions. Finally, PDTB annotation is not concerned with the effect of other markers of modality (modal auxiliaries and adverbials) on the factuality of abstract objects. The last corpus to evaluate is TimeBank, a corpus annotated with TimeML (Pustejovsky et al., 2005), a specification language representing temporal and event information in text. Given the surface-based approach of TimeML, TimeBank is the corpus that takes the most compositional approach to annotation among the three reviewed corpora. The factuality-relevant information encoded in TimeBank is mainly lexical: grammatical particles expressing event modality and polarity, as well as event selecting predicates (cf. section 2.4.1.), which project a factual value to their embedded event by means of subordination links (or slinks). Thus, TimeBank provides us with the basic components expressing factuality information in text –a consequence of the explicit surface-based approach of TimeML. And whereas there is some characterization of event factuality (through slinks), it does not deal with the interaction among the different markers scoping over the same event. 5. Creating a corpus of event factuality 5.1. FactBank FactBank is a corpus annotated with factuality information. It consists of 208 documents and contains a total of 8837 events manually annotated. FactBank includes all the documents in TimeBank and a subset of those in the AQUAINT TimeML Corpus (A-TimeML Corpus)7 . The contribution of each of these corpora to FactBank is shown in Table 3. Table 3: FactBank sources TimeBank A-TimeML Corpus Total # Documents 183 (88%) 25 (12%) 208 # Events 7935 (90%) 902 (10%) 8837 Because both TimeBank and AQUAINT TimeML Corpus are annotated with the TimeML spec, FactBank incorporates a second layer of factuality information on top of that in the original corpora. Thus, while the former two encode the structural elements expressing factuality information in language, the latter represents the resulting interpretation. The new annotation is kept in separate documents 7 http://www.timeml.org/site/timebank/timebank.html and is linked to the original data by means of the events IDs, which are the same in both annotation layers.8 3. Direct object of a SIP that has, as one of its arguments, a control clause headed by another SIP (e.g., He criticized Ed for saying...). 5.2. Corpus annotation We argued earlier that identifying event factuality requires linguistic processing at different layers. First, it involves the interaction of local and non-local context. Second, it puts into play at least one, but generally more, relevant sources for each event, which bear a nesting relation among them. Hence, if not structured adequately, the annotation task could become too complex and would inevitable result in a questionable outcome. Annotating event factuality needs to be addressed by steps that could both help annotators to mentally structure and comprehend the different information layers involved, as well as allow us to partially automate certain parts of the annotation process. We divide the annotation effort into three consecutive tasks. 4. Complement of preposition to at the beginning of a sentence (e.g., To me, she...). 5.2.1. Task 1: Identifying Source-Introducing Predicates (SIPs) Given a text with the events already recognized and marked as such, the annotators identified those that correspond to Source-Introducing Predicates. SIPs were briefly described in section 3.2. as including predicates of reporting, knowledge and opinion, among others. They are the linguistic elements that contribute a new source to the discourse. Such new sources, which must be nested relative to any previous relevant source, will have a role in assessing the factuality of the SIP event complement –recall example (25). This initial task allowed annotators to get familiarized with both the notion of source and the notion of SIP as marker of factuality information. Moreover, for processing purposes Saurı́ & Pustejovsky (2007) show that identifying SIPs is fundamental for the automatic computation of relevant sources. The manual annotation resulting from this task was then used to prepare the final task. 5.2.2. Task 2: Identifying sources The annotator was provided with a text with the following information already annotated: (a) all the SIPs in the text –obtained from the previous task; and (b) for each of these SIPs, a set of elements that can potentially express the new source it introduces; that is, a set of new source candidates. New source candidates had been automatically identified by selecting NP heads holding any of the syntactic functions listed here:9 1. Subject of any verbal predicate in the sentence. 2. Agent of a SIP in a passive construction (e.g., The crime was reported by the neighbor.)10 8 FactBank annotation can be expressed by means of XML tags representing the factuality value assigned by a source to a given event. Because each event can be assigned more than one factuality value (as many as relevant sources it has), these must be non-consuming tags. Alternatively, given the correspondence between events IDs in both layers, the mapping can be established by means of stand-off markup as well. 9 These syntactic functions were obtained from parsing the corpus with the Stanford Parser (de Marneffe et al., 2006). 10 In this and coming examples, the new source candidate is marked in bold face and the SIP, underlined. 5. Complement of preposition to that is in a dependency relation with a SIP (e.g., according to me, it seems to me.) 6. Complement of preposition of that is in a dependency relation with a noun SIP (the announcement of Unisys Corp.). 7. Possessor in a genitive construction whose noun head is a SIP (e.g., Unisys Corp.’s announcement). For every SIP, the annotator selected the new source it introduces among those in the candidate set. Two exceptional situations were also accounted for: (i) The new source did not correspond to any of the candidates in the list. The annotator would in these cases select option OTHER, and a posterior adjudication process would pick the adequate text item. (ii) There was no explicit segment in the text referring to the new source –for instance, in the case of generic sources (e.g., it was expected/assumed that...). The annotator would then select for option N ONE. The new source is then interpreted as generic –i.e., it can be paraphrased as everybody. They will be represented as GEN in the resulting chain expressing the relevant source (e.g., GEN author). 5.2.3. Task 3: Assigning factuality values This final task was devoted to selecting the factuality value assigned to events by each of their relevant sources. The annotators were provided with a text where every event expression was paired with its relevant sources. Hence, sentences containing events with more than one relevant source were repeated several times, each presenting a different event-relevant source pair. The set of relevant sources for each event had been automatically computed given the new sources manually identified in the previous task, and based on the algorithm for finding them presented in Saurı́ & Pustejovsky (2007). The annotators had to choose among the set of factuality values presented in Table 4, which corresponds grosso modo to Table 1 with the addition of values PRu and PSu. In establishing the former table, these two values were estimated as non relevant, but we wanted to confirm they were also considered unnecessary by the annotators when looking at real data. Two further values were allowed as well in order to pinpoint potential limitations in our value set: OTHER, covering situations where a different value would be required (e.g., the combinations U+ and U−), or when the annotator did not know what value to select; and NA (non-applicable), for events whose factuality cannot be evaluated. To discern among the different factuality values, the annotators were asked to apply the discriminatory tests presented in section 2.3. 6. Evaluation FactBank has been annotated by a pair of annotators. Overall, three annotators participated in the effort: annotators A and B participated in the first task, and annotators B and C carried out tasks 2 and 3. All of them are competent undergraduate Linguistics Majors. In addition, there were two Table 4: Factuality values VAL CT+ PR+ PS+ CTPRPSCTu PRu PSu Uu Other NA U SE Committed Values According to the source, it is certainly the case that X. According to the source, it is probably the case that X. According to the source, it is possibly the case that X. According to the source, it is certainly not the case that X. According to the source it is probably not the case that X. According to the source it is possibly not the case that X. (Partially) Uncommitted Values The source knows whether it is the case that X or that not X. The source knows whether it is probably the case that X or that not X. The source knows whether it is possibly the case that X or that not X. The source does not know what is the factual status of the event, or does not commit to it. Other Values Covering the following two situations - A different value is required here (e.g., U+, U-). - The annotator does not know what value to assign. The factuality nature of the eventuality cannot be evaluated. adjudicators handling cases of disagreement in each task before annotators would continue with the next one. Task 1. The interannotation ratio achieved is k=0.88 over 40% of the corpus (on the number of events).11 Some of the most common cases of disagreement concern: • SIP candidates with implicit sources –e.g., generic, as in: He’s expected to meet with Iraqi deputy prime minister Tariq Aziz later this afternoon. • SIP candidates lacking an explicit event complement (e.g., The executives didn’t disclose the size of the expected gain.). • Negated SIP candidates (e.g., didn’t disclose, did not tell, in the examples above). Task 2. The interannotation agreement achieved for this task is k=0.95 over 40% of the corpus (on the number of events). Such good results come as no surprise since it is a very well-defined task, both in syntactic and semantic terms –essentially, it requires identifying SIP logical subjects. The most common cases of disagreements are those in which: • There is a second expression in the text correfering with the new source. For example, the first person pronoun in a quoted fragment (e.g., “We are going to maintain our forces in the region for the foreseeable future,” said spokesman Kenneth Bacon.)12 Another common situation was given with relative clauses (e.g., British police officers who had been searching for Howes concluded that ...). 11 We apply Cohen Kappa (Cohen, 1960), hence assuming any potential distortion in the resulting figures due to the skewed distribution of categories (the so-called prevalence problem) as well as the degree to which the annotators disagree (the bias problem). Refer to Di Eugenio & Glass (2004). 12 In this and the following examples, the SIP is presented in bold face and the new source to be selected in bold face and underlined. If an additional expression enters in consideration as new source candidate as well, it will only be underlined. • The new source introduced by the SIP referred to a non-human entity (e.g., Reports attributed to the Japanese foreign ministry said ...). One of the annotators would choose a different option. Task 3. Interannotion agreement for this last task scores at k=0.82 over the 30% of the corpus (in terms of number of events). We consider this a very acceptable result, given the complexity of the task. In a comparable work devoted to classify certainty in text according to a five-fold categorization (absolute, high, moderate, low, and uncertain) (Rubin, 2007), the interannotation score obtained was k=0.15, which improved to k=0.41 when stricter annotation instructions were provided. Furthermore, an analysis of disagreement cases on the 10% of our corpus shows that around two thirds of them are cases of true ambiguity, originated from different constructions. Some of the most common concerned the scope of a reporting predicate –or, in other words, the span of the attributed fragment. In (26), for example, the reporting predicate (in bold face) can be interpreted as scoping over both events want and traveled, or only only over traveled. (26) Authorities want to question the unidentified woman who allegedly traveled with Kopp, according to an investigator quoted by the newspaper. A second common case of ambiguity is caused by syntactic constructions typically triggering a presupposition (e.g., relative clauses, temporal clauses, appositions) when embedded under a reporting predicate (27). Annotators would disagree on whether the presupposition would be projected to the main clause –in our terms, the disagreement concerns whether the author of the text commits to the embedded event (underlined below) as a fact. (27) The killing of Dr. Barnett Slepian, a gynecologist in Buffalo who performed abortions, has become a factor in at least two campaigns in New York, say political consultants and some campaign advisers. 7. Conclusions Event factuality is an important component for representing events in discourse, but identifying it poses a two-fold challenge. First, factuality is in many cases the result of different factuality markers interacting among them. They can all be in the local context of the event, but it is also common for them to be at different levels. Second, the factuality value assigned to events in text must be relative to the relevant sources at play, which may be one or more. In this paper, we introduced FactBank, a corpus of events annotated with factuality. FactBank contributes a semantic layer of factuality information on top of the grammar-based layer provided in TimeBank. The interannotation agreement scores obtained for the three annotation tasks we designed are encouraging. Specifically, for the task of selecting the factuality value assigned to events by each of their relevant sources, we achieved k=0.82 over 30% of the corpus. That suggests that event factuality as modeled in our work is well-grounded in linguistic data, and that its identification is achievable using an approach along the lines of that proposed here. FactBank will be made available to the community in a near future. References Lyons, J. (1977). Semantics. Cambridge: Cambridge University Press. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet project. In Proceedings of the 17th International Conference on Computational Linguistics: 86–90. Miltsakaki, E., Prasad, R., Joshi, A., & Webber, B. (2004). The Penn Discourse Treebank. In Proceedings of LREC 2004. Bergler, S. (1992). Evidential Analysis or Reported Speech. PhD thesis, Brandeis University. Palmer, F. R. (1986). Mood and Modality. Cambridge, England: Cambridge University Press. Carlson, L., Marcu, D., & Okurowski, M. E. (2002). Building a discourse-tagged corpus in the framework of rhetorical structure theory. Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1). Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 10, 37–46. Pradhan, S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2007). OntoNotes: A unified relational semantic representation. In Proceedings of IEEE International Conference on Semantic Computing. de Marneffe, M.-C., MacCartney, B., Grenager, T., Cer, D., Rafferty, A., & Manning, C. D. (2006). Learning to distinguish valid textual entailments. In Second PASCAL RTE Challenge (RTE-2). de Marneffe, M.-C., MacCartney, B., & Manning, C. D. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of LREC 2006. Di Eugenio, B. & Glass, M. (2004). The kappa statistic: a second look. Computational Linguistics, 30. Halliday, M. A. K. & Matthiessen, C. M. (2004). An introduction to functional grammar. London: Hodder Arnold. Hickl, A. & Bensley, J. (2007). A discourse commitmentbased framework for recognizing textual entailment. In Proceedings of the Workshop on Textual Entailment and Paraphrasing: 171–176. Hooper, J. B. (1975). On assertive predicates. In J. Kimball (Ed.), Syntax and semantics, IV. New York: Academic Press: 91–124. Horn, L. R. (1989). A Natural History of Negation. Chicago: University of Chicago Press. Karttunen, L. (1970). Implicative verbs. Language, 47, 340–358. Karttunen, L. (1971). Some observations on factivity. Papers in Linguistics, 4, 55–69. Karttunen, L. & Zaenen, A. (2005). Veridicity. In Katz, G., Pustejovsky, J., & Schilder, F. (Eds.), Dagstuhl Seminar Proceedings, Schloss Dagstuhl, Germany. Internationales Begegnungs- und Forschungszentrum (IBFI). Kiparsky, P. & Kiparsky, C. (1970). Fact. In M. Bierwisch & K. E. Heidolph (Eds.), Progress in Linguistics. A Collection of Papers. The Hague: Mouton, 143–173. Prasad, R., Dinesh, N., Lee, A., Joshi, A., & Webber, B. (2007). Attribution and its annotation in the Penn Discourse Treebank. Traitement Automatique des Langues, 47(2). Pustejovsky, J., Hanks, P., Saurı́, R., See, A., Gaizauskas, R., Setzer, A., Radev, D., Sundheim, B., Day, D., Ferro, L., & Lazo, M. (2003). The TimeBank corpus. In Proceedings of Corpus Linguistics 2003, (pp. 647–656). Pustejovsky, J., Knippen, B., Littman, J., & Saurı́, R. (2005). Temporal and event information in natural language text. Language Resources and Evaluation, 39(2), 123–164. Rubin, V. L. (2007). Stating with certainty or stating with doubt: Intercoder reliability results for manual annotation of epistemically modalized statements. In Proceedings of the NAACL-HLT 2007. Saurı́, R. (2008). A Factuality Profiler for Eventualities in Text. PhD thesis, Brandeis University. Saurı́, R. & Pustejovsky, J. (2007). Determining modality and factuality for text entailment. In Proceedings of 1st IEEE International Conference on Semantic Computing. Snow, R. & Vanderwende, L. (2006). Effectively using syntax for recognizing false entailment. In HLT-NAACL 2006. Tatu, M. & Moldovan, D. (2005). A semantic approach to recognizing textual entailment. In Proceedings of HLT/EMNLP: 371–378. Verhagen, M., Stubbs, A., & Pustejovsky, J. (2007). Combining independent syntactic and semantic annotation schemes. In Proceedings of the Linguistic Annotation Workshop. Lewis, D. (1968). Counterpart theory and quantified modal logic. Journal of Philosophy, 65, 113–126. Wiebe, J., Wilson, T., & Cardie, C. (2005). Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2), 165–210. Light, M., Qiu, X. Y., & Srinivasan, P. (2004). The language of Bioscience: Facts, speculations, and statements in between. In BioLINK 2004: Linking Biological Literature, Ontologies, and Databases: 17–24. Wolf, F. & Gibson, E. (2005). Representing discourse coherence: A corpus-based analysis. Computational Linguistics, 31(2), 249–287. An Extensible Compositional Semantics for Temporal Annotation Harry Bunt, Chwhynny Overbeeke Tilburg University, Department of Communication and Information Sciences P.O. Box 90153, 5000 LE Tilburg, Netherlands, harry.bunt@uvt.nl., info@chwhynny.nl Abstract In this paper we present an event-based formal semantics for temporal annotation, in particular for the ISO-TimeML annotation language under development in the International Organization for Standardization. This semantics has the form of a compositional translation into First-Order Logic (FOL) using terms that denote concepts in an extended OWL-Time. Given the fact that FOL has a compositional semantics, our ISO-TimeML semantics is compositional because its translation into FOL is compositional in the sense that the translation of the annotation of a text is a function of the translations of its subexpressions (where any well-formed subexpression can be translated independently of other subexpressions) and the structure of the annotation, as encoded in its linking tags. The approach presented here has been designed to be extensible to the semantic annotation of other than temporal information. 1. Introduction Linguistic annotation, according to Ide & Romary (2004), is the process of adding linguistic information to language data, or that information itself. The primary aim of annotation is usually the identification of certain linguistic patterns, in order to support the investigation of linguistic phenomena illustrated by such patterns, in particular for applying machine learning algorithms. As such, syntactic annotation as well as morphosyntactic, prosodic and pragmatic annotation have been useful in the development of data-driven linguistic models and theories. Semantic annotations are meant to capture some of the meaning in the annotated text. This is not only potentially useful for identifying certain linguistic semantic patterns, but the meaning that is captured by the annotation should also support the exploitation of that semantic information in language processing tasks. For instance, Pustejovsky et al. (2003) argue that their annotation language TimeML, designed to support the automatic recognition of temporal and event expressions in natural language text, should also support “temporal and event-based reasoning in language and text, particularly as applied to information extraction and reasoning tasks”. (See also Han & Lavie (2004).) Bunt & Romary (2002) argue that any adequate semantic annotation formalism should have a well-defined semantics. Existing approaches to semantic annotation, by contrast, tend to take the semantics of the annotations for granted. A current development in the area of semantic annotation is the design of an international standard for the annotation of temporal information, undertaken in the project ”Semantic Annotation Framework, Part 1: Time and Events”, which is carried out by an expert group within the International Organisation for Standardisation ISO. The annotation language that is defined in this project is based on TimeML and is therefore called ISO-TimeML. This project includes an effort to provide a formal semantics for the annotation language based on Pratt-Hartman’s proposal of a formal semantics for TimeML (Pratt-Hartman, 2007) using Interval Temporal Logic, a first-order logic for reasoning about time. In this framework, the annotations are interpreted as statements about time intervals associated with events; events are not represented explicitly. While representing a substantial step forward, this semantics, described in the ISO (2007) document, has certain important limitations: 1. it applies only to a rather limited fragment of the annotation language, not including for instance tense, aspect, and durations; 2. it is not compositional, in the sense that it involves a translation from ISO-TImeML to ITL in such a way that the translation of a subexpression of an annotation structure is dependent on that of other subexpressions; 3. it is applicable to temporal information only, and not extensible to other kinds of semantic information, such as the identification of the participants in the events of which the temporal properties are considered. In this paper we present an alternative, event-based formal semantics for ISO-TImeML, which applies to a substantially greater part of the annotation language, which is fully compositional, and which is not limited to dealing with temporal information. This approach follows the familiar ‘interpretation-by-translation’ paradigm, translating ISO-TimeML annotations, as represented in XML, into First-Order Logic (FOL). The compositionality of the approach rests on making this translation compositional. In discussing this approach we will follow the TimeML terminology and speak of ‘events’ in the generalized sense for which Bach (1981) introduced the term eventualities, as covering both states and events, where events may be subcategorized in various ways, for instance in processes and transitions. This paper is organized as follows. In section 2 we briefly look at temporal information from a (onto-)logical and a linguistic point of view, and at the role that temporal annotation has to play. In section 3 we describe the translation of ISO-TimeML tags into formal representations. In section 4 we discuss the problem of making a formal semantics for XML-based annotations compositional, and present our solution to the problem. We end with concluding remarks in section 5. 2. Temporal Information From a (onto-)logical point of view, the fundamental concepts relating to time are time point; the ordering relation between time points (‘before’); temporal interval; the begin and end points of an interval; the relation ‘inside’ between points of time and temporal intervals; and the length of a temporal interval, which requires the notion of a temporal unit of measurement. The general framework of Allen (1984), which has been very influential in the computational modelling of time, distinguishes 7 relations (and their inverses) between temporal intervals: equals, before/after, meets, overlaps, starts, finishes, during/ contains. These relations can all be defined in terms of the before relation among time points and the begin- and end points of intervals. In our FOL translations of ISOTimeML annotations we will use polymorphic versions of Allen’s relations, applying them both to time points and temporal intervals where appropriate. (For instance, we will use a predicate ‘Before’ which can apply to two temporal intervals, to two time points, to a time point and a temporal interval, or to a temporal interval and a time point, with the obvious interpretations.) λt : INSTANT(t) ∧ calYear(t,cet) = 2007 ∧ calMonth(t,cet) = march ∧ calDay(t,cet) = 16 ∧ clockTime(t,cet) = 10:15 It is rather unusual to be as explicit about a time zone as in (1); the time zone in which a clock time is considered is usually assumed to be obvious from the context in which the text fragment occurs that mentions the time. We will use the constant zc to indicate the contextually relevant time zone in which a clock time is intended. We use predicates like DAY and MONTH to represent intervals such as days, weeks, months, and years. The predicate DAY, for instance, is true of an interval starting at twelve midnight in some time zone, ending 24 hours later. Again using ISO standard 8601, ISO-TimeML represents weekdays according to the format xxxx-wxx-d, where d is the number of the weekday. Thus, Monday would be xxxx-wxx-1, and Friday would be xxxx-wxx-5. We will use predicates of the weekdays and Allen’s relations between temporal intervals to interpret the ISO-TimeML annotation of such expressions: (2) (a) Friday λt . ∃T : FRIDAY(T) ∧ Inside(t,T) (b) every Friday λP . ∀T : FRIDAY(T) → P(T) (c) each year in March From a linguistic point of view, the issue is in what way these temporal objects and relations are described by linguistic expressions, and how language relates temporal objects to other concepts; in particular to states and events. Temporal annotation, when endowed with a formal semantics, can be viewed as a bridge between the linguistic encoding of temporal information and the logical modeling of temporal structures and relations. For the formal semantics of ISO-TimeML (ISO, 2007), we will use an extension of the OWL-Time ontology (Hobbs & Pan, 2004). To the basic concepts of OWL (interval, instant, beginning, end, inside, time zone) we add the concepts of temporal unit and duration; and concepts needed for interpreting tense: event time, speech time, and reference time.1 2.1. Dates, Times and Periods To represent dates, ISO-TimeML follows ISO standard 8601 and uses the format yyyy-mm-dd to encode year, month and day. This representation is unsatisfactory from a logical point of view, as it does not make the components of this information available for reasoning. For specifying a point in time we will use functions like calYear, calMonth, calDay, and clockTime (which specifies a time as shown on the clock in a given time zone): (1) March 16th 2007 at 10:15 a.m. CET 1 Hobbs & Pan (2004) use the term ‘duration’ to indicate a time span during which an event or state occurs. This is to be distinguished from our use of the term as indicating the length of a time span. λP . ∀T1 : (YEAR(T1 ) ∧ ∃T2 : calMonth(T2 ,zc ) = march ∧ Before(Start(T1 ),Start(T2 )) ∧ Before(EndT2 ),End(T1 )) → P(T1 ) We will use the constant today to refer to an interval that is a day inside which lies the speech time: today ⇔ DAY(T) ∧ Inside(T0 ,T): (3) (a) yesterday DAY(T) ∧ END(T) = START(today) (b) the day before yesterday DAY(T1 ) END (T2 ) ∧ START(today) = END(T1 ) ∧ DAY(T2 ) ∧ = START(T1 ) 2.2. Durations To define durations we introduce the function TimeAmount, which constructs an amount of time from a numerical specification and a temporal unit, as illustrated in (4a). (4) for 2 hours λT : DURATION(T) = TimeAmount(2,hour) A conversion function which specifies a numerical relation between temporal units, such as Conversion(hour, minute) = 60 explain equivalences like TimeAmount(1,day) = TimeAmount(24,hour) (see further Bunt (1985) for a calculus of amounts). 2.3. Tense and Aspect Following Reichenbach (1947), we analyse tenses in terms of event time, speech time, and reference time (ET, T0 , and RT in the formal representations). ISO-TimeML uses PAST, PRESENT , and FUTURE as values of the tense attribute. If an utterance applies to an event in the past, the event time lies before the speech time; if it applies to an event in the present, the speech time is contained in the event time; if it applies to an event in the future, its event time is after the speech time. We can therefore conclude that: PAST(e) PRESENT (e) FUTURE (e) ⇔ ⇔ ⇔ Before(ET(e),T0 ) Inside(T0 ,ET(e)) Before(T0 ,ET(e)) Some examples:: (5) (a) Igor coughed. ∃e ∃x : SLEEP(e) ∧ AGENT(x,e) ∧ IGOR(x) ∧ Before(ET(e),T0 ) (8) Igor died between 10 and 11 AM. ∃e ∃x ∃T ∃t1 ∃t2 : DIE(e) ∧ PIVOT(x,e) ∧ IGOR(x) ∧ Interval(t1 ,t2 ) = T ∧ clockTime(t1 ,zc ) = 10:00 ∧ clockTime(t2 ,zc ) = 11:00 ∧ Inside(ET(e),T) ∧ Before(T,T0 ) ISO-TimeML also supports the temporal anchoring of an event with a specification of frequency, which may involve several temporal elements, such as two hours a day and three days every month. The ISO-TimeML annotation of such cases and our formal representations are as follows: (9) <TIMEX3 tid="t1" type="SET" value= "P1M" quant="EVERY" freq="3D"> three days every month </TIMEX3> λ P. ∀T1 : MONTH(T1 ) ∧ ∃!3 T2 : DAY(T2 ) ∧ Inside(T2 ,T1 ) ∧ P(T1 ) (b) Igor coughs. ∃e ∃x : SLEEP(e) ∧ AGENT(x,e) ∧ IGOR(x) ∧ Inside(T0 ,ET(e)) (c) Igor will cough. ∃e ∃x : SLEEP(e) ∧ AGENT(x,e) ∧ IGOR(x) ∧ Before(T0 ,ET(e)) Note that in these examples we consider a literal interpretation of tenses, treating tense as an indicator of the temporal ordering relation between event time, speech time and reference time. Tense information should not always be taken literally, however. For instance, in (6) the event time lies after the speech time, in spite of the present tense of the verb: (6) I am at the office tomorrow. The temporal adverb tomorrow determines this, even though the present tense of the verb would suggest that the event time includes the speech time. This is a complication for any semantic interpretation of temporal annotation. One way to handle this problem could perhaps be to assign a different value to the tense attribute in such cases when annotating the text (e.g., Lee (2008) uses ‘future present’), but this has the drawback altering the linguistic concept of tense. Similar problems may arise with the interpretation of other syntactic attributes like gender and number. The progressive aspect indicates that an event is occurring over a certain period of time and has not yet ended. That is, the speech time lies between the starting point and the end point of the event time. Similarly, the perfective aspect indicates that an event has been ended, or refers to a state resulting from an event that has occurred: (7) Igor had already slept. ∃e ∃x : SLEEP(e) ∧ AGENT(x,e) ∧ IGOR(x) ∧ Before(ET(e),RT) ∧ Before(RT,T0 ) 2.4. Temporal Anchoring The Reichenbach (1947) notion of ‘event time’, originally introduced to interpret tenses, can obviously also be used for describing the temporal anchoring of an event to a time point or a temporal interval: 2.5. Relations between events ISO-TimeML distinguishes three types of relation linking events to temporal elements or other events. First, TLINK relates two temporal elements to one another, temporal elements to events, or eventualities to events, like for instance 20 minutes to every Friday and every Friday to RUN in Igor runs 20 minutes every Friday, and LEAVE to ARRIVE in Amy left before Igor arived. Second, SLINK is a subordination link between events for cases like Igor wants to run and Amy believes that Igor loves her. There are six types of SLINK relations: modal (e.g. PROMISE , WANT), evidential (e.g. SEE), negative evidential (e.g. DENY), factive (e.g. REGRET), counter-factive (e.g. PREVENT), and conditional (e.g. if). SLINK is not a temporal relation, and its interpretation is thus outside the scope of this paper (but see Bunt, 2007). Third, ALINK indicates an aspectual relation between two eventualities: initiation, culmination, termination, continuation, or re-initiation, as exemplified by Igor started to run. These relations are more than just temporal relations. They can be viewed as a thematic relation (notably a THEME relation) plus certain specific properties. In the case of initiation, the specific property is that the starting point of the initiating event equals the starting point of the initiated event. Culmination means that the subordinate event has been completed, whereas termination implies that the subordinate event has not been completed. 3. From Annotations to Formal Representations We follow the ”interpretation through translation” approach for interpreting ISO-TimeML annotations, and formulate a compositional translation from the XML representations of ISO-TimeML annotations into formulas of First-Order Logic. The translation is defined by a set of rules for translating ISO-TimeML subexpressions and a set of operations for combining these translations, ultimately leading to the construction of a formal representation of the annotated text. We mentioned in the beginning of this paper that the proposed ISO-TimeML semantics in terms of Interval Temporal Logic (see Pratt-Hartman (2007) and the ISO (2007) document) is not fully compositional. In a nutshell, the problem of translating (XML-representations of) ISOTimeML annotations into formulas of a logical language in a compositional way is the following. Compositional translation means that every well-formed subexpression of the source language is translated into the target language independently of other subexpressions; these translations are subsequently combined in a way that is determined by the structure of the source expression as a whole, as encoded in the TLINK , ALINK and SLINK tags that link the various subexpressions. ISO-TimeML annotations contain two kinds of subexpressions: on the one hand the expressions corresponding to events and temporal objects (<EVENT .../EVENT> and <TIMEX3 ... /TIMEX3> subexpressions) and on the other hand subexpressions that indicate temporal, aspectual, or subordinate relations (TLINK, ALINK, and SLINK expressions). The latter type of expressions contain attributes whose values are identifiers in the subexpressions denoting events or temporal objects, thereby ‘linking’ these subexpressions. Now when the various types of subexpressions are translated into logical formulas, this linking information is lost because the logical formulas do not have identifiers like the XML structures of the ISO-TimeML annotation. The following example illustrates the problem for the ITL-based semantics of ISO-TimeML provided in the ISO (2007) document. (10) John <EVENT eiid="ei" type="OCCURRENCE"> drove /EVENT> to Boston on <TIMEX3 tid="t1" > Saturday TIMEX3> <TLINK eventInstanceID="ei" relatedToTime="t1" relType="DURING"> The event tag is translated into ∃ Iei : Pei (Iei ), which says that there is a temporal interval Iei for which the predicate Pei holds, i.e. for which it is true that John drove to Boston during that interval: <EVENT eiid="ei" type="OCCURRENCE" > drove /EVENT> ❀ ∃ Iei : Pei (Iei ) The TLINK structure is subsequently translated in such a way that it takes this latter formula and conjoins it with a formula expressing that the interval Iei is related to another interval It1 (corresponding to Saturday) through the relation specified as the relType value in the TLINK expression: <TLINK eventInstanceID="ei" relatedToTime="t1" relType="DURING"> ❀ ∃ Iei : Pei (Iei ) ∧ ∃ Iei′ : DURINGr (Iei ,Iei′ ) Now note that this formula has not been constructed by indepently translating the TLINK structure into a formula which is combined with the formula that translates the event, but in fact the translation rule operating here says: When translating a TLINK expression, find the EVENT expression that is identified by the value of the eventInstanceID attribute; take the translation of that structure, and build within the scope of the existential event quantifier of that formula a conjunction which adds the temporal relation encoded in the TLINK structure. Kiyong Lee (2007), in trying to provide an alternative semanticss for ISO-TimeML, struggled with the same problem, and adopted the solution that is described below. Katz’ (2007) attempt to give a denotational semantic to ISO-TimeML also runs into scoping problems. We present a solution to this problem and specify a fully compositional translation at the price of having to deal with more complex intermediate representational structures during the translation process. These intermediate representations are triples consisting of a FOL formula plus two components, that we call a ‘combination index’ and a ‘derivation index’. The first of these is a list containing the ISOTimeML identifiers of the subexpressions whose translations are to be combined with the present representation; the second is another list of ISO-TimeML identifiers, indicating the subexpressions whose translations have been used to construct the present representation. As such, they act as a kind of storage which allows to keep track of (a) which pieces of semantic information should be combined, according to the links in the ISO-Timeml/XML representations, and (b) which pieces have already been combined. With the help of these devices, we can make sure that those and only those translations of the ISO-TimeML subexpressions which are linked through TLINK , SLINK or ALINK structures are combined, and in a correct way. 3.1. Translating ISO-TimeML Subexpressions Here we will deal with the translation of each type of ISOTimeML tag. (We will not take into account the SIGNAL tag of ISO-TimeML, which has been left out of consideration in this paper, since all it does is assign an index to a signal word such that it can be referred to in other tags.) 3.1.1. The EVENT Tag The translation of event tags is determined by their polarity. There are two translation rules, one for each polarity value. The notation ∃e∈E is used here and throughout as a shorthand for ∃e: E(e). <EVENT eiid="e" tense=T aspect=A polarity="POS"> ❀ λE . λP . ∃e ∈ E : P(e) ∧ T’(e) ∧ A’(e) <EVENT eiid="e" tense=T aspect=A polarity="NEG"> ❀ λE . λP . ¬∃e ∈ E : P(e) ∧ T’(e) ∧ A’(e) The translations of the time and aspect values are given in Table 1. and Table 2, respectively. tense value Translation tense="PAST" λe . Before(ET(e),T0 ) In its other main use in ISO-TimeML, to represent a temporal relation between two events, a TLINK tag is translated as: tense="PRESENT" λe . Inside(T0 ,ET(e)) λ e1 . λ e2 . R’(e1 , e2 ) tense="FUTURE" λe . Before(T0 ,ET(e)) where e1 and e2 correspond to the two related events and R’ translates the value of the relType attribute (which has values like when, while, after). Table 1: Translation table for the EVENT tag attribute tense. aspect value/ Translation aspect="PROGRESSIVE" λe . Before(START(e),T0 ) ∧ Before(T0 ,END(e)) aspect="PERFECTIVE" λe . Before(END(e),RT) aspect="PERFECTIVE PROGRESSIVE" λe . Before(START(e),T0 ) ∧ Before(T0 ,END(e)) ∧ Before(END(e),RT) Table 2: Translation table for the EVENT tag attribute aspect. relType value BEFORE Translation λx . λy . Before(x,y) AFTER λx . λy . Before(y,x) AT lambdax. λy. x=y INCLUDES λT . λe . Before(START(T),START(e)) ∧ ∧ Before(END(e),END(T)) IS INCLUDED λT . λe . Before(START(e),START(T)) ∧ ∧ Before(END(T),END(e)) DURING λe1 . λe2 . Before(START(e2 ),START(e1 )) ∧ ∧ Before(END(e1 ),END(e2 )) Table 3: Translation table for some relType values of the TLINK tag. 3.1.2. The TIMEX 3 Tag ISO-TimeML uses an adapted form of the TIDES 2002 standard (Ferro et al., 2002), called TIMEX3, for marking up descriptions of time points and intervals. In natural language, events are often temporally anchored to an underspecified moment or period. The temporal anchoring of events can be represented in such cases with the (polymorphic) Inside relation (where T2 stands for the underspecified moment or period): <TIMEX3 tid="t2" type=TYPE value=VALUE temporalFunction="TRUE" anchorTimeID="t1"> ❀ λP . λt1 . ∃T2 : Inside(t1 ,T2 ) ∧ P(T2 ) The translation of TIMEX3 tags with specified starting points and end points is quite straightforward: 3.1.4. The ALINK Tag The different possible aspectual relations that can be marked up in an ALINK tag are encoded in the values of its relType attribute. Since an aspectual relation always seems to correspond to a thematic relation plus a temporal relation, we translate all ALINK tags to a formal representation of the form: λe1 .λe2 . THEME(e1 ,e2 ) ∧ τ where τ is the temporal component that depends on the value of the relType attribute. Table 4 specifies the translations of the various relType values. relType value INITIATES Translation component ET (e1 ) = START(e2 ) TERMINTATES ET (e1 ) = END(e2 ) ∧ ¬COMPLETED(e2 ) 3.1.3. The TLINK Tag CULMINATES ET (e1 ) = END(e2 ) ∧ COMPLETED(e2 ) A TLINK tag, used to anchor an event in time, is structured in ISO-TimeML as follows: CONTINUES Before(START(e2 ),ET(e1 )) ∧ ∧ Before(ET(e1 ),END(e2 )) <TIMEX3 tid="t1" type=TYPE value=VALUE beginPoint="t2" end="t3"> ❀ λP . λt2 . λt3 . ∃T1 : START(T1 ) = t2 ∧ END(T1 ) = t3 ∧ P(T1 ) (11) <TLINK eventInstanceID=e1 signalID=s1 relatedToTime=t1 relType=R /> Here, the attribute relType has values corresponding to the use of temporal prepositions such as at, before, in, during; these values correspond to temporal relations in the underlying temporal ontology. The translation of such a TLINK tag has the following form: λ e. λ t. R’(ET(e),t) where R’ is the translation of the relType value. Table 3 exemplifies the translation of these values. ‘Before’ is the polymorphic temporal ordering relation between instants and intervals. Table 4: Translation table for the ALINK tag. 4. Combining Translations In order to compositionally translate an entire ISO-TimeML annotation into FOL, we need to combine the translations of its subexpressions. This poses a problem, as he following example shows. (12) Igor arrived at 11 AM. Igor <EVENT eiid="e1" tense="PAST" polarity="POS"> arrived </EVENT> <SIGNAL sid="s1"> at </SIGNAL> <TIMEX3 tid="t1" type="TIME" value="T11:00"> 11 AM </TIMEX3> <TLINK eventInstanceID="e1" signalID="s1" relatedToTime="t1" reltype="BEFORE" /> The respective translations of the event tag and the TLINK tag are as follows (where zc as before indicates the contextually relevant time zone for the clock time): λP. ∃e1 ∈ARRIVE: ∃t1 :clockTime(t1 ,zc )=11:00 ∧ ∧ Before(ET(e1 ),t1 ) ∧ P(e1 ) λe1 . λt1 . ET(e1 ) = t1 We would like to combine these representations, and in this case that’s quite simple. However, the simplicity of the example is deceptive. When we consider a more complex example, such as Amy was happy when Igor arrived before 11 AM, then we get two translations of event tags and we must make sure that the translation of the TLINK tag is combined with that of the ARRIVE event, not with that of the REJOICE event. This is an instance of the problem of defining a compositional translation, pointed out above. Here, the problem is that the translations of the event- and TLINK tags have lost the linking information captured in the XML tags by the values of the eventInstance and relatedToTime attributes; the use of the same variables e1 and t1 in the translations of the tags only optically preserves the linking information; formally the names of these variables are insignificant. We resolve this problem by keeping track of the linking information in the annotations and reformulating all translations as using intermediate representations in the form of triples < ci, di, ϕ> where ci (the ‘combination index’) contains XML identifiers such as the values of the eventInstance and relatedToTime attributes, for keeping track of the ISO-TimeML tags whose translations should be combined with the present representation, and where di (the ‘derivation index’) contains XML identifiers like the value of the eiid attribute in an event tag; this keeps track of which translations of ISO-TImeML subexpressions have already been used in the translation. After translating the various tags in terms of such triples, the rest of the translation process consists of combining these triples, until a triple has been constructed whose combination index is empty and whose derivation index indicates that all the ISO-TimeML subexpressions have been linked together. For the combination of these triples we use a number of formal operations which are defined in the next subsection. 4.1. Combination operations The operations that we use for combining the translations of ISOTimeML subexpressions involve a few formula-manipulation operations defined in (Bunt, 2007). The most important one is a type of function application called late unary application, where a one-argument function is applied to an argument expression of the form λx1 ,...,xk . E(x1 ,...,xk ). The definition of this operation, designated by ‘✷’, is as follows: F ✷ λx1 ,...,xk λa . E = λx1 ,...,xk F(λa . E) This operation and the others that we will describe below have to be extended to triples. In what follows, we will use the same symbols for the operations when applied to triples as when applied to formulas, except in the definitions where the subscript ‘3’ is used to make clear that an operation is applied to triples. (We will use ‘·’ to indicate concatenation of lists, and‘ -’ subtraction of lists.) For late unary application the triple-definition is:2 <ci1 , li1 , ϕ1 > ✷3 <ci2 , li2 , ϕ2 > = <<ci2 -li1 >, <li1 · li2 · <ci1 >>, λx1 ,...,xk−1 . ϕ1 (λxk ✷ ϕ2 )> Second, an operation called lambda insertion-application (designated by ⊕) is defined, which combines a lambda abstraction λa . F, where F is a function expression, with an expression of the form λx1 ,...,xk . E1 ∃z : E2 into λx1 ,...,xk . λa . E1 ∃z : F(z) ∧ E(z). In terms of triples:3 <ci1 , li1 , ϕ1 > ⊕3 <<>, li2 , ϕ2 > = <ci1 -li2 , li1 · li2 · ci1 , ϕ1 ⊕ ϕ2 > A variant of this operation, designated by ⊕’, swaps the order of its arguments in application, and is defined as follows, with its obvious extensions to triples: (λx1 . λx2 . F) ⊕’ A = (λx2 . λx1 . F) ⊕ A A third operation, called cross-application (designated by ⊗), merges two expressions of the form λv . ∃x : E1 (v,x) ∧ E2 and λw . ∃y : E1 (y,w) ∧ E3 into ∃x ∃y : E1 (y,x) ∧ E2 ∧ E3 . In terms of triples: <ci1 , li1 , ϕ1 > ⊗3 <<>, k · ci1 , ϕ2 > = <<>, k, ϕ1 ⊗ ϕ2 > Finally, an operation called merge-application (designated by ⊙), is defined for any two representations E1 = <ci1 , di1 , α> and E2 = <ci2 , di2 , λz . β>, where the set of first elements in the pairs constituting di1 equals the set of identifiers in ci1 ; β is not of the form λx..., and the length of the sequence of λ-abstractions in E2 equals the length of the list di2 . If α is a formula of the form γ Qz δ, where Q is a (generalized) quantifier, then the logical formula resulting from merge-application is γ Qz [λz . β](z) ∧ δ. In terms of triples: 4 <<>, li1 , ϕ1 > ⊙3 <ci2 , li2 , ϕ2 >= <ci2 -li1 , <li1 >, ϕ1 ⊙ ϕ2 > These operations can be applied in any order to any triples that satisfy the properties required in the definitions of the operations, without any further constraints, thus ensuring the compositionality of the process. In the next subsection we will give some examples to illustrate the process. 4.2. Worked examples (13) Igor arrived at 11 AM. We considered the ISO-TimeML annotation of this example in the previous subsection (see (11)). We describe the translation step by step. The TIMEX 3 tag and the TLINK tag: 2 A condition on the applicability of the operation ✷3 is that the combination index <ci2 of the second operand has the form <ci′2 ·li1 . 3 A condition on the applicability of the operation ⊕3 is that the combination index <ci1 of the first operand has the form <ci′− ·li2 . 4 See footnote 2. T’ = <<>, <t1>, λP. ∃t1 : clockTime(t1 ,zc )=11:00 ∧ P(t1 )> TLa ’ = <<e1,t1>, <>, λa. λb. ET(a) = b> Combination of the two translations using late unary application: T’ ✷ TLa ’ = <<e1>, <t1>, λa. ∃t1 : clockTime(t1 ,zc )=11:00 ∧ ET(a)=t1 > Translation of the EVENT tag: E’ = <<>,<e1>,λQ. ∃e1 ∈ARR: Before(ET(e1 ),T0 ) ∧ Q(e1 )> The EVENT translation is combined with that of the combination of the TIMEX 3 tag and the TLINK tag using late unary application, which delivers the desired end result: E’ ✷ (T’ ✷ TLa ’) = <<>, <t1,e1>, ∃e1 ∈ ARRIVE : ∃t1 : clockTime(t1 ,zc ) = 11:00 ∧ ET(e1 ) = t1 ∧ Before(ET(e1 ),T0 )> Next we consider an example with two temporally ordered events: (14) Amy left before Igor arrived. Amy <EVENT eiid="e1" tense="PAST" polarity="POS"> left </EVENT> <SIGNAL sid="s1"> before </SIGNAL> Igor <EVENT eiid="e2" tense="PAST" polarity="POS"> arrived </EVENT>. <TLINK eventInstanceID="e1" signalID="s1" relatedToEventInstance="e2" reltype="BEFORE" /> We finally consider an example with three related events, two of which have an aspectual relation and two a temporal ordering relation. (15) Amy started to laugh when Igor arrived. Amy <EVENT eiid="e1" tense="PAST" polarity="POS"> started </EVENT> to <EVENT eiid="e2" tense="NONE" vform="INFINITIVE" polarity="POS"> laugh </EVENT> <SIGNAL sid="s1"> when </SIGNAL> Igor <EVENT eiid="e3" tense="PAST" polarity="POS"> arrived </EVENT>. <ALINK eventInstanceID="e1" relatedToEventInstance="e2" reltype="INITIATES" /> <TLINK eventInstanceID="e3" signalID="s1" relatedToEventInstance="e1" reltype="IDENTITY" /> The translation of Amy started to laugh: E1’ ✷ (E2’ ✷ AL’) = <<>, <e1,e2>, ∃e1 ∈ START : Before(ET(e1 ),T0 ) ∧ ∃e2 ∈ LAUGH : THEME (e2 ,e1 ) ∧ ET(e1 ) = START(e2 )> The ARRIVE event tag: E3’ = <<>, <e3>, λQ . ∃e3 ∈ ARRIVE : Before(ET(e3 ),T0 ) ∧ Q(e3 )> The TLINK tag: The two EVENT tags: TLe ’ = <<e1,e3>, <>, λa . λb . ET(a) = ET(b)> E1’ = <<>, <e1>, λQ. ∃e1 ∈ LEAVE: Before(ET(e1 ),T0 ) ∧ Q(e1 )> E2’ = <<>, <e2>, λQ . ∃e2 ∈ ARRIVE: Before(ET(e2 ),T0 ) ∧ Q(e2 )> Combination of the translation of the third EVENT tag with the that of the TLINK tag using late unary application: The TLINK tag: E3’ ✷ TLe ’ = <<e1>, <e3>, λa . ∃e3 ∈ ARRIVE : Before(ET(e3 ),T0 ) ∧ ET(a) = ET(e3 )> TLe ’ = <<e1,e2>, <>, λa . λb . Before(a,b)> Combination of the translation of the second EVENT tag with that of the TLINK tag using late unary application: Application of lambda-insertion applicaton with swapping of variables: E2’ ✷ TLe ’ = <<e1>, <e2>, λa . ∃e2 ∈ ARRIVE : Before(ET(e2 ),T0 ) ∧ Before(a,e2 )> TLe ’ ⊕’ (E1’ ✷ (E2’ ✷ AL’)) = <<e3>, <e1,e2>, λb . ∃e1 ∈ START : Before(ET(e1 ),T0 ) ∧ ∃e2 ∈ LAUGH : THEME(e2 ,e1 ) ∧ ET(e1 ) = START(e2 ) ∧ ET(e1 ) = ET(b)> Combination of the translation of the first EVENT tag (Amy left) with that of the second EVENT tag plus the TLINK tag (before Igor arrived) using late unary application, gives the desired end result: Application of cross-application to this representation for Amy started to laugh and the translation of when Igor arrived gives the desired end result: E1’ ✷ (E2’ ✷ TLe ’) = <<>, <e1,e2>, ∃e1 ∈ LEAVE : Before(ET(e1 ),T0 ) ∧ ∃e2 ∈ ARRIVE : Before( ET(e2 ), T 0 ) ∧ Before(e1 ,e2 )> (E3’ ✷ TLe ’) ⊗ (TLe ’ ⊕’ (E1’ ✷ (E2’ ✷ AL’))) = <<>, <e1,e2,e3>, ∃e1 ∈ START : Before(ET(e1 ),T0 ) ∧ ∃e2 ∈ LAUGH : THEME (e2 ,e1 ) ∧ ET (e1 ) = START(e2 ) ∧ ∃e3 ∈ ARRIVE : Before(ET(e3 ),T0 ) ∧ ET(e1 ) = ET(e3 )> 5. Discussion and Conclusions The method described in this paper enables a larger part of ISOTimeML to be formally interpreted than the ITL approach, including the interpretation of tense and aspect, the treatment of durations, and that of calendar years, clock times, and so on. A treatment of calendar years and the like in an ITL-based semantics would probably not be hard, adding predicates applicable to certain temporal intervals as we have have done here. It would be more difficult to extend would be difficult to extend the ITL-based semantics with the interpretation of tense and aspect, since tense interpretation for instance requires the representation of event times (as temporally related to speech times and reference times), which is a property of events and thus necessitates the availability of events as such. Even more difficult would be the addition of durations, since this requires new concepts (temporal units and amounts of time, defining equivalence classes of pairs of a temporal unit and a numerical value) to be added to the underlying ontology. More important from a theoretical point of view, is that we have specified a fully compositional interpretation of ISO-TimeML. This has been achieved at the price of making use of more complex intermediate representations, but has, besides the obvious theoretical importance, the advantage of allowing a very flexible translation process, which consists of a number of operations that can be applied in any order. The attempt to formally interpret ISO-TimeML annotations has also revealed interesting interferences with the annotation of other semantic information, such as semantic roles and quantification. As long as semantic annotation is restricted to temporal annotation only, it may be reasonable to annotate the relations between events for which ISO-TimeML uses SLINK structures in the temporal annotation language, but these relations are not really temporal in nature and would be better treated as semantic role relations which have certain temporal implications. Also, aspectual relations, as captured in ALINK tags, are by their very nature a combination of thematic and temporal relations. Temporal quantification does not have a fully satisfactory treatment in ISO-TimeML, and indeed this only seems possible by taking quantification into account more generally. For ISO-TimeML interpretation only, it might be feasible to cast the formal semantics in terms of a description logic like OWL-DL; however this would restrict the extensibility of the approach. An important aspect of the ISO-TimeML semantics outlined in this paper is that it has a richer underlying ontology than Interval Temporal Logic, including events and nontemporal individuals, which makes it possible to extend the approach to the semantic annotation of other information related to events. This would notably include the roles that the participants in an event play (‘semantic roles’), as well as other properties of such participants, such as referential relations among participants in different events, and aspects of quantification for dealing with cases where sets of participants are involved in sets of events. The possibilities in this direction are explored in (Bunt, 2007) and Bunt & Overbeeke (2008). References Allen, J. (1984). A General Model of Action and Time. Artificial Intelligence, 23-2. Bach, E. (1981). On Time, Tense, and Aspect: An Essay in En- glish Metaphysics. In Cole, P., editor, Radical Pragmatics. Academic Press, New York. Bunt, H. (1985). Mass Terms and Model-Theoretic Semantics. Cambridge University Press. Bunt, H. (2007). The Semantics of Semantic Annotation. In Proceedings of the 21st Pacific Asia Conference on Language, Information and Computation (PACLIC21). Bunt, H. and Overbeeke, C. (2008). Towards formal interpretation of semantic annotation. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech. ELRA. Bunt, H. and Romary, L. (2002). Towards Multimodal Content Representation. In Choi, K. S., editor, Proceedings of LREC 2002, Workshop on International Standards of Terminology and Language Resources Management, pages 54–60, Las Palmas, Spain. Paris: ELRA. Ferro, L., Gerber, L., Mani, I., Sundheim, B., and Wilson, G. (2002). Instruction Manual for the Annotation of Temporal Expressions. MITRE Washington C3 Center, McLean, Virginia. Han, B. and Lavie, A. (2004). A Framework for Resolution of Time in Natural Language. TALIP Special Issue on Spatial and Temporal Information Processing, 3. Hobbs, J. and Pan, F. (2004). An Ontology of Time for the Semantic Web. TALIP Special Issue on Spatial and Temporal Information Processing, 3-1:66–85. Ide, N. and Romary, L. (2004). International Standard for a Linguistic Annotation Framework. Natural Language Engineering, 10:211–225. ISO (2007). Language Resource Management - Semantic Annotation Framework (SemAF) - Part 1: Time and Events. Secretariat KATS. ISO Report ISO/TC37/SC4 N269 version 19 (ISO/WD 24617-1). Katz, G. (2007). Towards a Denotatial Semantics for TimeML. In Schilder, F., Katz, G., and Pustejovsky, J., editors, Annotation, Extraction, and Reasoning about Time and Events. Springer, Dordrecht. Lee, K. (2008). Against a Davidsonian Analysis of Copula Sentences. In Kadowaki, M. and Kawahara, S., editors, NELS 33 Proceedings. Pratt-Hartmann, I. (2007). From TimeML to Interval Temporal Logic. In Proceedings of the Seventh International Workshop on Computational Semantics (IWCS-7), pages 166–180, Tilburg, Netherlands. Pustejovsky, J., Castano, J., Ingria, R., Gaizauskas, R., Katz, G., Saurı́, R., and Setzer, A. (2003). TimeML: Robust Specification of Event and Temporal Expressions in Text. In Proceedings of the Fifth International Workshop on Computational Semantics (IWCS-5), pages 337–353, Tilburg, Netherlands. Pustejovsky, J., Knippen, R., Littman, J., and Saurı́, R. (2007). Temporal and Event Information in Natural Language Text. In Bunt, H. and Muskens, R., editors, Computing Meaning, volume 3, pages 301–346. Springer. Reichenbach, H. (1947). Elements of Symbolic Logic. Macmillan, New York. Using Treebank, Dictionaries and GLARF to Improve NomBank Annotation Adam Meyers New York University New York, NY meyers@cs.nyu.edu Abstract In the field of corpus annotation, it is common to annotate text multiple times and then adjudicate the results. The resulting annotation is generally regarded as more consistent and more accurate than the results of a single pass. However, it is also very expensive to annotate in this way. Given text corpora that are annotated by many different research groups, another source of comparison is available: annotation of other linguistic information on the same corpora. By exploiting violations of expected relationships between the two annotation schemes, likely errors can be detected. This paper describes such an effort involving the NomBank annotation of noun arguments in the Wall Street Journal Corpus. These techniques made it possible to complete NomBank annotation efficiently and accurately. 1. Introduction As with many annotation projects, NomBank took longer to finish than the creators initially expected. It eventually became necessary to find a way to complete the annotation in a way that minimized expenses, while maintaining high quality. In many projects involving the manual annotation of corpora with linguistic features, each text is annotated by two different annotators and the differences between their output are adjudicated. The resulting annotation is more consistent than singly annotated corpora and this increased consistency is usually assumed to indicate a corresponding improvement in accuracy. Due to practical constraints, this was not an option for NomBank. Fortunately, the NomBank project was annotating a text corpus for which there was already previous annotation (in particular, Penn Treebank annotation). We established several expected relationships between the NomBank and the Penn Treebank annotation schemes. When any of these expected relationship did not hold, there were three possibilities: (1) there was an error in NomBank; (2) there was an error in the Penn Treebank; or (3) the expected relationship did not hold for this instance. Given these possibilities, annotation that violated an expected relationship was more likely to contain a NomBank error, than randomly selected annotation. In addition, some parts of NomBank annotation had expected relationships with syntactic dictionaries, both ones created during the NomBank project (ADJADV, NOMLEX-PLUS) and existing ones (NOMLEX and COMLEX Syntax). By examining cases where these expected relationships were violated, we could predict likely NomBank (or dictionary) errors. As a result of these efforts, approximately 26% of NomBank manual annotation was predicted to contain likely errors and was examined and corrected by an expert annotator, a substantial savings in time and effort. Methods for evaluating the effectiveness of this effort are under consideration for future work. NomBank annotators reviewed a total of 200,000 instances of nouns in the Penn Treebank corpus to produce 114,500 NomBank propositions. On average, they looked at about 20–25 noun instances per hour, working at a considerably slower pace than PropBank (Palmer et al., 2005)(less than one half the speed). This made double annotation of NomBank impractical. By comparing NomBank annotation to previous annotation, we were able to select approximately 30,000 propositions that were likely to contain errors and review those propositions in a focused way. This made for realistic and effective quality control. We will now sketch an outline of the remainder of this paper. Section 2. provides an overview of NomBank annotation. Section 3. describes our approach to merging together various annotation schemes into a GLARF representation, which we use for our error detection system. Sections 4. through 7. describe the various constraints that we use to detect likely errors. Finally, Section 8. discusses ramifications and future research . 2. NomBank Annotation NomBank.1.0 (Meyers et al., 2004a) provides a predicate argument structure representation of approximately 114,500 noun instances in the Wall Street Journal corpus. Like PropBank, this representation links particular word instances with words and phrases that are either arguments (ARG0, ARG1, . . .) or belong to one of the classes of nonarguments (ARGMs) defined in the specifications. For each word, there is a dictionary entry (its frame file) which defines the set of possible arguments. The set of markable ARGMs are essentially those that have counterparts in verbal argument structure, e.g., temporal, locative, manner, etc.1 In addition, we mark SUPPORT items, words that link arguments outside of the noun phrase to the nominal predicate. Some example sentences are provided below. The nominal predicate is underlined and the other parts of proposition are in bold. The labels following the arguments indicate the roles they play in the NomBank proposition. The set of support words in each of these examples forms a chain in that sentence connecting an argument outside the NP to the underlined predicate. For example the support chain, consisting of gave + dozens + of, links John to kisses – the chain should be viewed as filling a single SUPPORT slot in the NomBank proposition. 1 See the NomBank manual, available from nlp.cs.nyu.edu/meyers/NomBank.html, for more information. 1. Mary’s/ARG0 promise/ARG1-REF to John/ARG2 2. The Press’s/ARG0 criticism of the candidate/ARG1 3. John/ARG0 gave/SUPPORT Mary/ARG2 dozens of/SUPPORT kisses 4. They/ARG0 accorded/SUPPORT minorities/ARG1 an opportunity for/SUPPORT representation. Like PropBank, each word and phrase in NomBank is represented as a link to one or more nodes of Penn Treebank annotation (Marcus et al., 1994). This contrasts with most approaches to annotation such as: (a) inline annotation where the text is modified to include annotation features and (b) offset annotation which points to particular spans of text using another document (these text spans are usually referenced by byte offsets from the beginning of the target file). In this sense, NomBank is annotation of annotation, i.e., NomBank assigns features to units defined by pre-existing Penn Treebank annotation. 3. GLARFBANK As part of the Unified Linguistic Annotation project (Pustejovsky et al., 2005), researchers at several United States universities are studying ways to merge together distinct annotation schemes. At New York University (NYU), we are taking an approach to merging that we call “aggressive” because we change incompatible aspects of the input annotation schemes so that they are compatible with each other, i.e., we change tokenization, phrase boundaries and text spans to maximize overlap between the input annotation schemes. In this respect, we are taking annotation created under different theoretical assumptions and converting them into a single-theory analysis. The output of the merging process is formalized as a Typed Feature Structure in the GLARF framework (Meyers et al., 2001a; Meyers et al., 2001b).2 The current GLARF’d version of the Wall Street Journal data annotated for NomBank includes the following annotation schemes: Penn Treebank, PropBank, NomBank, Penn Discourse Treebank (overt relations)(Miltsakaki et al., 2004) and BBN Named Entity tags. Future merged GLARFBANKs will also include Brandeis’ TimeML (Pustejovsky et al., 2004) and University of Pittsburgh’s Opinion annotation (Wilson and Wiebe, 2003). The WSJ GLARFBANK also includes various automatically generated features based on both heuristic rules and lexical lookup (COMLEX Syntax, NOMLEX, ADJADV, and others). GLARF rules correct parts of speech, mark focused constituents, fill gaps not covered by Treebank annotation, assign grammatical roles to constituents, add semantic features, etc.3 A sample (simplified) GLARF representation is 2 Currently several applications are using GLARF’d data for Information Extraction including the systems described in (Zhao et al., 2004; Shinyama and Sekine, 2006) as well as NYU’s recent Automatic Content Extraction (ACE) submissions. We have also begun a Machine Translation effort at NYU that uses Chinese, Japanese and English GLARF. 3 We intend to make a GLARF representation of the ULA shared corpus available at nlp.cs.nyu.edu/wiki/corpuswg/ULA- (S (ADV (ADVP (HEAD (ADVX (HEAD (RB Meanwhile)) (P-ARG1 (S (EC-TYPE PB) (INDEX 0+0))) (P-ARG2 (S (EC-TYPE PB) (INDEX 0))))) (INDEX 1))) (PUNCTUATION (, ,)) (SBJ (NP (HEAD (PRP they)) (INDEX 2))) (PRD (VP (HEAD (VX (HEAD (VBN made)) (P-ARG0 (NP (EC-TYPE PB) (INDEX 2))) (P-ARG1 (NP (EC-TYPE PB) (INDEX 4))) (P-ARGM-TMP (ADVP (EC-TYPE PB) (INDEX 1))) (INDEX 3))) (OBJ (NP (T-POS (CD three)) (HEAD (NX (HEAD (NNS bids)) (P-ARG0 (NP (EC-TYPE PB) (INDEX 2))) (SUPPORT (VX (EC-TYPE PB) (INDEX 3))))) (INDEX 4))))) (PUNCTUATION (. .)) (SENT-NUM 1) (INDEX 0)) Figure 1: GLARF for: Meanwhile, they made three bids provided as Figure 1. It represents the merger of annotation for the sentence Meanwhile, they made three bids:4 The GLARF representation5 essentially adds structure to the Penn Treebank and if you delete this additional structure, the result would be the original Penn Treebank (with minor changes). We will highlight two of these elaborations OANC-1. Prior to the availability of hand annotation, automatically generated features are provided for PropBank, NomBank and the Penn Discourse Treebank. The author intends to make the Wall Street Journal GLARFBANK available either through the Linguistic Data Consortium, or by download should licensing restrictions on this corpus be relaxed. 4 In the GLARF system the typed feature structure includes all the information in GLARF. A multi-level dependency representation is also available that is similar to the 2008 CONLL task representation (www.yr-bcn.es/conll2008/). In fact the latter is partially derived from the former. 5 There are actually several different GLARF representations. The typed feature structure representation contains the most information and a dependency representation is the one that is most often used for Information Extraction and other applications. here: (1) relational labels like HEAD, ADV, PRD, OBJ, that indicate relations between constituents, e.g., the constituent labeled SBJ is the subject of the sister constituent that is labeled PRD; and (2) Empty Categories that may or may not be part of the original Penn Treebank, e.g., the features prefixed with P- point to empty categories which bear PropBank, NomBank and PDTB relations with the HEAD constituent. These empty categories point to other GLARF constituents, e.g., the the NP they has an INDEX feature value of 2. The empty categories that are values of the PARG0 of made and bids both also have this index, representing that they is the PropBank ARG0 of made and the NomBank ARG0 of bids. The P-ARG2 of Meanwhile has a value of the entire sentence, which would appear to include itself. However, by convention, we assume that such arguments exclude what we call the SELF-PHRASE, the ancestor of the predicate (in this case Meanwhile) that is a child of the argument. This same rule is used for marking arguments of parenthetical predicates in PropBank and NomBank. Thus in the following two examples, the entire sentence can be marked as an argument of claimed and request because the self-phrases Mary claimed and at John’s request can easily be accounted for: Irving, Mary claimed, is ten feet tall, Mary, at John’s request, made ridiculous claims about Irving. The P-ARG1 of Meanwhile refers to the previous sentence (sentence 0, index 0). Our system checks new NomBank data for its compatibility with other annotation frameworks, using the GLARFBANK annotation as a way of incorporating the other annotation into a single representation. Following sections describe these compatibility tests and the subsequent adjudication.6 4. Structural Constraints on Internal Arguments of Nouns We use the GLARF representation as a means to implement several types of constraints. First of all, by recognizing particular kinds of constituents, we can constrain how they appear in NomBank. Relative clauses typically are not markable in NomBank propositions. Thus, given a NomBank Proposition for a noun N , if one of the arguments (ARG0 . . . ARG9) or ARGMs is a relative clause, this is flagged as a likely error, e.g., the that relative in the banner that proclaims the renewal of socialism was detected as a likely error and then removed during adjudication. It is easy to identify relative clause arguments because relative clauses are labeled as such in the GLARF’d version of the Penn Treebank. The GLARF-generating program uses a combination of the representation in the original Penn Treebank (the appearance of empty categories in that-clauses following nouns, the POS markings on that, etc.) and whether or not a that-clause is a possible complement for the head noun (using COMLEX Syntax) to determine if a structure is a relative clause (if a that phrase follows a noun that can’t 6 Tests for compatibility between the structure of the GLARFBANK and NomBank are mostly tests for compatibility between the Penn Treebank and NomBank. However, the GLARFBANK actually incorporates structures from other annotation. So the relation is not one to one. take that complements, the phrase is likely to be a relative clause). NomBank annotators have the option of linking together constituents in the Penn Treebank to form a single NomBank argument. These combinations often correctly identify constituents not marked in the Penn Treebank, due to (for example) Penn Treebank’s tendency to underspecify prenominal structure, e.g., in a phrase like The ice cream man, ice and cream would probably be left as separate constituents. However, it turns out that some constituent combinations are unlikely to be correct. For example, given D and N two adjacent prenominal modifiers of some head H, if D is a determiner or possessive and N is a noun or adjective, it is unlikely that D and N form a constituent. For example, one annotator marked their financial as a single constituent (an ARG1) of the predicate viability in the phrase their financial viability. In the corrected version, their is marked as an ARG3 and financial is marked as an ARG1. The reason for this error is clear. ARG3 and ARG1 are similar roles for nouns like viability which belong to the ATTRIBUTE class and the annotator opted to combine the two rather than mark them separately. The ARG1/ARG3 split in NomBank reflects that viability is an attribute of the financialness and financial viability is an attribute of them. In this case, the ARG3 is a secondary-theme a type of argument that has this interpretation (as per the NomBank manual). Their financial viability is a phrase that represents the degree or VALUE of the viability trait and therefore viability is marked as its own ARG2. This error detection routine occasionally identifies non-errors. For example, the GLARF generating program incorrectly marked the numeral 1 as a determiner in the sentence CBS held the previous record for consecutive No. 1 victories. The annotator had correctly marked No. 1 as a single ARG1 – so this annotation was not changed during adjudication. In a similar vein, annotations of discontinuous constituents are unlikely to be correct. Any series of constituents that form a NomBank argument are almost always consecutive. Nevertheless, NomBank annotators will occasionally mark discontinuous constituents, the most common reasons being: (1) one token is missed from a sequence, e.g., the comma was not included as part of the ARG1 stock, bond and foreign exchange in the initial marking of the phrase its stock , bond and foreign exchange trading; and (2) as in the determiner plus prenominal case above, the two arguments have similar relations to the head noun. For example, although one annotator marked a combination of conversion and on the stock as a single ARG1 of rights in the phrase conversion rights on the stock, the final version of NomBank makes conversion an ARG1 and on the stock an ARG3. The one consistent exception, discussed in Section 6., is where the entire sentence or NP is an argument of the noun (minus the self-phrase containing the nominal predicate). For example, in Mr. Nadeau said discussions are under way with potential purchasers of each of the units, the entire phrase minus under way is an ARG1. Apart from these carefully defined exceptions, there are also 10 cases involving the noun predicate age, where marking discontinuous constituents seemed unavoidable even though the examples did not fit into one of cases of external argu- ments of nouns, e.g., we marked under 13 the ARG2 of age in the phrase 1,859 children under age 13. 5. A Constraint on Empty Categories Empty categories (Penn Treebank’s way of representing gaps) are not typically noun arguments unless they are part of chains that link the empty category to a (pronounceable) word or phrase (the filler of the gap). Consider, for example, the NomBank annotation of veto in the following sentence: Mr. Bush and some other aides are strongly drawn to the idea of trying out a line-item veto. Mr. Bush and some other aides should be the ARG0 of veto as mediated by: (1) a number of empty categories in the Penn Treebank: the passive object of drawn and the subject of trying; and (2) the support verb trying. In the initial annotation, a NomBank annotator failed to make the final link from the passive object empty category to the lexical NP, but the error detection program predicted that this was a likely error. Exceptions do occur when an empty category represents an unfilled argument. For example, in the following definition of stock-index arbitrage, the ARG0 of trades should be the same as the empty subject of executing, which itself is unbound: Stock-index arbitrage – Buying or selling baskets of stocks while at the same time executing offsetting trades in stock-index futures or options. The Penn Treebank resolves the referential properties of some, but not all empty categories. In the following example, a NomBank annotator needed to add the link between the possessive phrase Illinois Supreme Court’s and the empty subject of to institute: Illinois Supreme Court ’s decision to institute the changes. Here institute acts as a support verb linking its subject to the ARG0 position of the noun changes, i.e. the Illinois Supreme Court is assumed to be the AGENT of the changes. Therefore, it turns out that only some of the cases where empty categories are not bound in the Penn Treebank need to remain so and it turns out that unbound empty categories are unlikely to be correct as NomBank arguments – their presence signals a likely error. 6. Structural Constraints on External Arguments of Nouns NomBank specifications place restrictions on the markability of a given potential argument A of a noun N that lies outside of the NP headed by N . It turns out that, for the most part, these restrictions were codable in terms of GLARF’d representations of the sentence and therefore could be automatically checked. Although there are some outliers that the automatic system did not handle correctly, the automatic detection system tended to overpredict errors, rather than underpredict. This made it possible to accurately identify many cases that we needed to review more carefully and it resulted in corrections of many NomBank propositions. There are three environments in which External arguments can be licensed: (a) support; (b) predication; and (c) PP constructions containing the nominal predicate. Each of these configurations make specific requirements on how the NP-external arguments are linked to the nominal predicate. Furthermore, the absence of any of these configurations means that an NP-external argument is unlicensed and thus tagged as a likely error. 6.1. Constraints on Support Structures A NomBank external argument A is a legal argument of a nominal predicate P , by virtue of support, if there exists a support chain S linking A to P . To be well-formed, a support chain must meet the following criteria7 : (1) consist completely of lexical items (leaf nodes) in the Penn Treebank; (2) forms of be, auxiliaries, infinitival to and modals are skipped, i.e., for purposes of the support chain, we pretend that they do not exist and that the main verb, predicate adjective, or other predicative item is the main predicate of its clause8 ; (3) at least one item in the support chain must have as its part of speech: noun, adjective, verb or determiner9; (4) each link in the chain must be the head of the phrase containing it (after allowing for 2)10 ; (5) the first link in the chain must take A as its argument; (6) Each link N in the chain must take the phrase headed by link N + 1 as its argument; (7) the last link in the chain must take the phrase headed by P as its argument; and (8) the chain cannot cross any tensed clause phrasal boundaries. A schema of a support chain is provided as Figure 2. Some examples of legal support chains are provided as Figure 3.11 There are several ways which we use the constraints on support to verify the accuracy of NomBank annotation: (1) we verify that annotated support chains meet the criteria above; (2) we verify that there are external arguments that require support chains and propose the removal of annotated support chains that are extraneous; (3) we automatically generate a support chain and compare it to the one annotated. In each of these cases, we use the error detection procedures to identify potential errors. Should we determine that they are actual errors, we correct them. Given a possible external argument A and a nominal predicate P , we assume that exactly one support chain is structurally possible. In simple cases, one can think of the typed feature structure as a labeled tree, although it is actually a rooted directed acyclic graph.12 In most cases, to find the support chain, one first must identify the path derived by going up the tree from A to the common ancestor of A and P , and then down the tree to P . The support chain is the 7 For simplicity, we ignore the complications caused by filler/gap constructions (passivization, WH, etc.) and coordination. Nevertheless, these phenomena are handled as well. 8 This is roughly equivalent of a Verb Group Analysis, extended to cover copula constructions. 9 The choice of noun, verb and adjective is more limited than the automatically implemented constraints currently allow. One could further limit support items to prepositions, transparent nouns (a variety of problems), determiners in partitive constructions (all of the worst problems), control predicates (try, ability, able), and lexically specific combinations of verbs and nouns (take a walk, make a mistake, etc.). 10 For purposes of discussion, the main verb of a sentence is assumed to be the head. 11 The final example includes partner, a CRISSCROSS noun which simultaneously is a support word for and an argument of cooperation. 12 These graphs are like labeled trees, except they allow shared structure. 1. The real/ARGM-ADV battle is over who will control the market/ARG2 2. This book is about his son/ARG1 3. Trying to time the economy/ARG1 is a mistake Head_1 4. They/ARG1 are some/ARG2 distance apart Shared Argument Figure 4: Linking External Arguments to Nouns Via Predication Head_2 port chains linked a given A with a given P .13 Special allowances are made so that conjoined predicates can both be part of the same support chain, e.g., in Mary gave and received lots of kisses, gave and received are assumed to be branches of the same support chain (gave + received + lots + of). It is as if the support chain splits in the middle and then merges together gain because, for the purpose of a support chain, coordinate structures are assumed to have multiple heads. Head_N−1 Head_N Nominal Predicate Figure 2: Schema for a Support Chain 1. IBM/ARG0 made/SUPPORT an agreement 2. This desk/ARG1 has/SUPPORT a height of 25 inches/ARG2 3. their/ARG0 responsibility hard/ARGM-MNR decisions. for/SUPPORT 4. The adjuster/ARG0 does/SUPPORT a lot of/Support work by phone/ARGM-MNR 5. it/ARG1 is scheduled for/SUPPORT completion by Dec. 10/ARGM-TMP 6. I/ARG0 take advantage of/SUPPORT opportunity to make a plea to readers/ARG1 this 7. We/ARG0 had lots of/SUPPORT internal/ARGMMNR debate about this one/ARG1 8. Saab/ARG0 is looking for/SUPPORT a partner/ARG2+SUPPORT for/SUPPORT financial/ARGM-MNR cooperation Figure 3: Examples of Legal Support Chains set of heads of all the phrases in this path. The complete algorithm for finding support chains must factor in filler gap constructions and coordination. Filler gap constructions complicate the simple algorithm because they are responsible for making the tree into a directed acyclic graph. The graph is derived by changing arcs that point to gaps so that they point to the fillers of those gaps instead. Nevertheless, in the entire Wall Street Journal corpus, we have not encountered a single instance in which multiple sup- 6.2. Constraints on Predication There are a number of instances in which predication licenses a connection between an argument and a noun predicate which we have determined are legitimate for marking NomBank arguments. We specifically avoid cases in which the argument can duplicate existing arguments, e.g., for argument nominalizations like teacher, we will always mark teacher as its own ARG0 and never NPs linked by predication, e.g., Mary is John’s teacher. We recognize the following markable instances of linking external argument to nouns via predication: (1) when the noun predicate is the subject of the sentence and one of its arguments follows a copula, e.g., Examples, 1 and 2 in Figure 4; and (2) when the noun predicate P follows the copula and its argument precedes the copula and P is either a nominalization of an adjective, an ATTRIBUTE noun (a NomBank class) or in a preposition plus noun construction that has an adjective-like distribution, e.g., 3–4, in Figure 4. A subset of the nouns in COMLEX Syntax that are marked with the feature (COUNTABLE :PVAL) combine with the preposition to form adjective-like constituents, e.g., the entry of alert is marked (COUNTABLE :PVAL (”on”)). These entries can be used to identify instances of the aforementioned adjective-like PP construction. Identifying these environments automatically is easy. One merely has to identify copulas, the subjects of those copulas (typically the NP or sentence immediately following the copula) and the underlying predicate (typically the phrase immediately following the predicate and often marked with the function tag -PRD). Other predicative environments, though rarer, are also easy to detect in the Penn Treebank: small clauses are S constituents consisting of an NP followed by another constituent marked with -PRD, as con13 This is, at least in part, due to the constraint that a support chain cannot cross a tensed sentential node. This prevents, for example, support chains from including predicates on both sides of a relative clause boundary. 1. Without/ARGM-NEG question, something intriguing is going on/ARG1 [PP Parenthetical] 2. Some last-minute phone calls that Mr. Bush made/ARG1 (at the behest of some conservative U.S. senators/ARG0) to enlist backing for the U.S. position/ARG1 [PP Parenthetical] 3. He/ARG1 was under consideration to succeed Joshua Lederberg/ARG2 [PP + Extraposition] 4. ABC’s baseball experience/ARG0 may be of interest to CBS Inc./ARG1 [PP + Extraposition] 5. they/ARG0 exercise for enjoyment [Subject-Oriented PP] 6. Garbage/ARG0 made its debut this fall with the promise to give consumers the straight scoop on the U.S. waste crisis/ARG1 [Subject-Oriented PP] 7. Participants/ARG0 in the meeting [Noun-Modifying PP] 8. the bitterness/ARGM-MNR of the battle [NounModifying PP] 9. That/ARG1 was in addition to $34,000 in direct campaign donations/ARG2 [Discourse Connective] 10. That $130 million gives us some flexibility/ARG1 in case Temple raises its bid/ARG2. [Discourse Connective] 11. In important particulars, the Soviets are different from the Chinese/ARG1 [Discourse Adverbial] 12. In fact, they don’t take it seriously at all/ARG1 Figure 5: PP constructions that license External Arguments stituents begin with the word as, etc. 6.3. PP constructions and External Arguments When the NP headed by a predicate noun is the object of a preposition, the argument taking properties of that noun may change. This subsection describes a set of argumenttaking environments in which such PPs license external arguments according to NomBank guidelines. These environments include: (1) The PP-parenthetical construction; (2) The PP + Extraposition construction; (3) Subject Oriented PPs; (4) Noun modifying PPs; and (5) Other Adverbial PPs including discourse connectives. Examples are provided in Figure 5. Although we can automatically detect most of these environments, we have not implemented ways of detecting all of them. Thus our automatic procedures still flag many of these as instances of unlicensed external arguments. As a result, many of the rarer PP constructions are always revisited during the error detection phase of annotation. The PP-Parenthetical (Examples 1 and 2) and extraposed PP constructions (Examples 3 and 4) are both licensed by COMLEX Syntax dictionary entries and, in the former case, is limited to a short list of prepositions. The configurations are easily defined in terms of syntactic trees (or graphs). The PP-Parenthetical cases are licensed by nouns that take clausal complements and this lexical information is readily available from a combination of COMLEX Syntax and/or Nomlex (or Nomlex-Plus). These PP phrases (Examples 1 and 2 in Figure 5) are like their verbal counterparts (e.g., the say phrase in Mary, John said, is an incredible botanist) in that they can precede, follow or infix their sentential argument. In addition to the lexical subcategorization of the nominal predicate, another restriction is that only a narrow set of prepositions seem to license this construction: (with, without, at, on, in and possibly a few others). The PP is immediately dominated by the sentence that it takes as an argument (the PP is typically marked as a parenthetical in the Penn Treebank or offset by parentheses or commas). The Extraposition cases (Examples 3 and 4) are possible for a subset of nouns marked in COMLEX Syntax with the subcategorization features EXTRAPP-NOUN-THAT-S. The COMLEX entry also specifies the preposition. For example, the COMLEX entry for interest includes the subcategorization feature (EXTRAP-PNOUN-THAT-S :PVAL (“of”)). In the Penn Treebank, the nominal predicate is the rightward argument of the copula and the subject of the copula is one argument of the noun. Using a combination of these lexical clues and configurational data, it is easy to see how correctly licensed instances of these constructions can be automatically identified. Subject oriented adverbial PPs containing a NomBank predicate (Figure 5, Examples 5 and 6) can be identified by the following characteristics: (1) the subject of the sentence is an argument of the NomBank predicate (hence the name subject-oriented); (2) the PP is either a child of the sentential node or a child of the VP; and (3) the preposition belongs to a defined set which includes mainly temporal prepositions (after, before, during), instrumental prepositions (with, without, through by) and several others. These PPs are similar to other subject-oriented adverbs like willingly, vengefully, etc., which typically select for an animate subject. The fourth case (Figure 5, Examples 7 and 8) involves a noun A that is modified by a PP containing a nominal predicate P , such that P takes A as an argument. This is an easy to recognize configuration and is limited to approximately the same set of prepositions as the others. We have yet to fully figure out the distribution of the nominal predicates that can occur in this configuration, although it does seem that adjective nominalizations and ATTRIBUTE nouns are the most common. Finally, there are some NomBank frame entries that classify particular nouns as being either a discourse-connective (Examples 9 and 10) or discourse-adverbial (Examples 11 and 12). Similar entries are found in the NOMADV dictionary giving them one of the COMLEX Syntax classes applied to similar adverbs, i.e., the various sub-types of the METAADV class (the connectives belong to the (META-ADV :CONJ T) class). The discourse adverbials can take entire sentences as arguments, whereas the discourse connectives link two arguments in a similar manner to the discourse connectives in the Penn Discourse Treebank (PDTB). Nom- 1. After hours/SUPPORT+ARGM-TMP of/SUPPORT debate, the jury/ARG0 focuses on the facts 2. John/ARG1 is 40/ARG2 pounds/ARG2+SUPPORT in/ARG2 weight (ADJADV :ORTH :ADV :FEATURES (ADJADV :ORTH :ADV :FEATURES (ADJADVLIKE :ORTH :ADV :FEATURES Figure 6: Combining Support with Other Phenomena Bank discourse connectives can link two sentences, two NPs or one NP and one sentence. This contrasts with PDTB connectives, which always link two sentences. The discourse adverbials, like the Parentheticals can preceded follow or be embedded in the sentence it modifies. NomBank discourse connectives have a similar configurational distribution as the PDTB connectives: the connective forms a constituent with one argument (e.g., in case Temple raises its bid in Example 10 and the other argument is either the rest of the superordinate phrase (the subject and the verb) or the subject of the sentence (e.g., that in Example 10). However, unlike PDTB, NomBank does not link predicates in one sentence with arguments outside that sentence, e.g., NomBank does not mark the sentence preceding an example like no. 12 as an argument of fact. In summary, there are a number of configurations in which a PP containing a NomBank predicate (as the head of the prepositional object) that license external arguments of that noun. The configurations are easy to define and additional lexical restrictions makes it possible to identify the markable cases in NomBank. As of this writing, we recognize a subset of the admissible cases automatically. The remainder we must verify manually. 6.4. Combining Support with Other Constraints We end this section with the examples in Figure 6, which combine support with some of the other external argument licensing environments. Both cases involve transparent noun constructions, which are viewed as a type of Support in NomBank. After hours of debate is treated as if debate is the main predicate of this subject-oriented PP construction (the subject of the sentence is an argument of debate). The support chain hours + of makes this treatment possible. In a similar way, the support chain pounds + of makes it possible for weight to be connected to the subject of the sentence by predication. The support chains serve to bring the nominal predicate into the position required to link them via these other types of constructions. Figure 7: Sample ADJADV Entries nominal predicate. For example, the adjective recent is almost always marked ARGM-TMP due to lexical properties of recent, not lexical properties of the noun it modifies. Thus recent should be marked ARGM-TMP in the recent destruction of the documents, their recent marriage and the recent knowledge, regardless of what is in the frame entries of destruction, marriage and knowledge. We observed that the relevant information could not be found in the adjective entries of COMLEX Syntax, but could be found in related adverb entries. Specifically, recently, the adverb related to recent has the feature TEMPORAL-ADV. This motivated our construction of ADJADV. Some sample entries are given below in Figure 7. This dictionary was created in a semi-automatic way. For the most part, we simply found morphologically adjective adverb pairs and generated the entry based on the adverb. However, in some cases, e.g., big, we created an ADJADVLIKE entry based on a semantically related adverb. Given the assumption that specific adjectives tended to be compatible with the same ARGM function tags, we could automatically detect likely errors by comparing the ARGMs assigned adjective premodifiers in NomBank against the ADJADV dictionary entries for those adjectives.14 We assumed the table of compatibilities between function tags and COMLEX-SYNTAX features listed as Table 1. When an adjective was marked in a NomBank proposition in a way that was incompatible with the ADJADV entry, this would usually lead to either changing the NomBank annotation or changing the ADJADV lexical entry. In this way, we were able to simultaneously improve both NomBank and ADJADV. 7. Lexical Constraints on NomBank We will now describe one of the main dictionary-based constraints that we used to correct NomBank. At the same time, we used this constraint to correct the dictionary ADJADV (Meyers et al., 2004b), which we made along side of NomBank. Although ARG1 . . . ARG9 features were applied according to frames for particular words, the distribution of the ARGM features was left to the annotator’s interpretation of the NomBank specifications. Nevertheless, to a large extent the ARGM features are also lexical in nature, but of a different sort. ARGMs tend to be the same for particular modifiers (the value of the ARGM itself), rather than “abject” “abjectly” ((MANNER-ADV) (GRADABLE))) “actual” “actually” ((META-ADV :VIEWPOINT T))) “big” “immensely” ((MANNER-ADV) (DEGREE-ADV))) 8. Concluding Remarks Above, we have outlined major ways in which we have improved NomBank by evaluating the compatibility of annotation with other resources. As a result of these and similar techniques, we have looked closely at over 30,000 of 14 Some premodifiers were handled in other ways, e.g., prefixes were specially classified; numbers between 1000 and 2100 were assumed to be potential time modifiers, etc. Also, with respect to hyphenated items, we identified one hyphenated segment (typically the last segment) as the head and looked up the ADJADV entry for that segment. We omit a full description due to space limitations. COMLEX Feature (META-ADV :CONJ T) other META-ADV MANNER-ADV DEGREE-ADV EVAL-ADV LOC&DIR-ADV TEMPORAL-ADV ARGM ARGM-DIS ARGM-ADV ARGM-MNR ARGM-MNR ARGM-MNR ARGM-LOC, ARGM-DIR ARGM-TMP Table 1: ADJADV/ARGM Compatibility the 114,500 NomBank instances. We believe that these measures caused us to focus our efforts on the most likely causes of error, improving both the accuracy and efficiency of quality control. Had we annotated NomBank twice and then adjudicated instead of using this methodology, it would clearly have been a more expensive undertaking. Furthermore our attention would not have been as directed as it was using the error detection program.15 We have considered creating a degraded version of NomBank that consists of only pre-edited entries. We could then test to see if a automatic role labeling system (Jiang and Ng, 2006) trained on that version would not perform as accurately as a system trained on the final version. Better performance on the final system would confirm that we improved the system using our methods. However, this result would hardly be surprising because our technique does involve a selective second pass on the annotation by an expert annotator, methodology which is widely recognized to improve results. Clearer evaluation would require the annotation of additional data in a test setting in which duel annotation plus adjudication could be fairly compared with the method described here. This will be possible should we have the opportunity to annotate a substantial amount of additional NomBank data. However, given our limited resources, we are confident that we took the best possible approach. This paper provides examples of how constraints on a new annotation scheme can be formulated in terms of previous annotation in order to provide quality control. Researchers who would like to take advantage of this methodology should consider annotating corpora that has already been annotated by other members of the annotation community Acknowledgments This research was supported by the National Science Foundation, award CNS-0551615, entitled Towards a Comprehensive Linguistic Annotation of Language. 9. References Z. P. Jiang and H. T. Ng. 2006. Semantic role labeling of nombank: A maximum entropy approach. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), Sydney, Australia. 15 Of course, we fixed errors that we found that were not detected by the program as well. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. 1994. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313– 330. A. Meyers, R. Grishman, M. Kosaka, and S. Zhao. 2001a. Covering Treebanks with GLARF. In ACL/EACL Workshop on Sharing Tools and Resources for Research and Education. A. Meyers, M. Kosaka, S. Sekine, R. Grishman, and S. Zhao. 2001b. Parsing and GLARFing. In Proceedings of RANLP-2001, Tzigov Chark, Bulgaria. A. Meyers, R. Reeves, C. Macleod, R. Szekely, V. Zielinska, B. Young, and R. Grishman. 2004a. The NomBank Project: An Interim Report. In NAACL/HLT 2004 Workshop Frontiers in Corpus Annotation, Boston. A. Meyers, R. Reeves, Catherine Macleod, Rachel Szekeley, Veronkia Zielinska, and Brian Young. 2004b. The Cross-Breeding of Dictionaries. In Proceedings of LREC-2004, Lisbon, Portugal. E. Miltsakaki, A. Joshi, R. Prasad, and B. Webber. 2004. Annotating discourse connectives and their arguments. In A. Meyers, editor, NAACL/HLT 2004 Workshop: Frontiers in Corpus Annotation, pages 9–16, Boston, Massachusetts, USA, May 2 - May 7. Association for Computational Linguistics. M. Palmer, D. Gildea, and P. Kingsbury. 2005. The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106. J. Pustejovsky, B. Ingria, R. Sauri, J. Castano, J. Littman, R. Gaizauskas, A. Setzer, G. Katz, and I. Mani. 2004. The Specification Language TimeML. In I. Mani, J. Pustejovsky, and R. Gaizauskas, editors, The Language of Time: A Reader. Oxford University Press, Oxford. J. Pustejovsky, A. Meyers, M. Palmer, and M. Poesio. 2005. Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference. In ACL 2005 Workshop: Frontiers in Corpus Annotation II: Pie in the Sky. Y. Shinyama and S. Sekine. 2006. Preemptive information extraction using unrestricted relation discovery. In Proceedings of NAACL/HLT, New York, New York, USA. Association for Computational Linguistics. Theresa Wilson and Janyce Wiebe. 2003. Annotating Opinions in the World Press. In 4th SIGdial Workshop on Discourse and Dialogue (SIGdial-03). S. Zhao, A. Meyers, and R. Grishman. 2004. Discriminative Slot Detection Using Kernel Methods. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), Geneva. A Dictionary-based Model for Morpho-Syntactic Annotation 1 Cvetana Krstev1, Svetla Koeva2, Duško Vitas3 University of Belgrade, Faculty of Philology, Studentski trg 3, RS - 11000 Belgrade, 2 Bulgarian Academy of Sciences, 52 Shipchenski prohod, Bl. 17, BG - 1113 Sofia, 3 University of Belgrade, Faculty of Mathematics, Studentski trg 16, RS - 11000 Belgrade E-mail: cvetana@matf.bg.ac.yu, svetla@ ibl.bas.bg, vitas@matf.bg.ac.yu Abstract The main goal of this paper is to establish a proper and flexible method for morpho-syntactic annotation taking in consideration such language phenomena as multi-word units, complex word forms, regular and productive derivational processes, etc., which usually remain outside the scope of the morpho-syntactic annotation. We present the first results in the development of a multilingual resource that should enable the exploration of the possibility to apply various different lexical bases, such as inflectional dictionaries and multilingual lexical databases as Wordnet and Prolex that were developed during the last decade. This paper is limited to two Balkan languages, Serbian and Bulgarian. 1. Introduction The paper outlines an approach for morpho-syntactic annotation and the first results in the creation and exploitation of an aligned and annotated corpus for the Bulgarian-Serbian pair. The main goal of this effort is to establish a proper and flexible method for morpho-syntactic annotation taking in consideration such language phenomena as multi-word units, complex word forms, regular and productive derivational processes, etc., which usually remain outside the scope of the morpho-syntactic annotation. Some of the existing morpho-syntactic annotation schemes consider only tokens thus neglecting the fact that a token is not always equal to a word form – namely, a word form can consist of several tokens (not necessarily contiguous) and several word forms can build a token. On the other hand, some of the proposed sets of morpho-syntactic attributes and their values are inconsistently composed, not taking into consideration the relative function of the chosen attributes or their relations with the higher language levels. In our approach we accept the assumption that the morpho-syntactic annotation has to be assigned to the word forms irrespective of their continuity and contiguity. Thus the term word form here means (following in general the MAF 1 prescriptions) contiguous or non-contiguous unit consisting of one or more tokens that refers to a single concept: single word, complex word forms (i.e. complex tenses, mood, aspect) and multi-word units. We also agree that the standardization in morpho-syntactic annotation has to cover both correspondences between different languages as well as language specific features (Ide et al., 2003). That is why two Balkan and South Slavic languages are taken in focus: Bulgarian and Serbian, for which similar language resources have been developed recently. The particular research aims, stated in this paper, are as follows: • Briefly to show some of the gaps in the existing annotation schemes; • To offer some techniques for handling the 1 ISO TC 37/ SC 4 N225 morpho-syntactic annotation of word forms rather than tokens; • To exploit parallel language resources. These research aims are directed to the development of a complex method for morpho-syntactic annotation providing uniform and flexible way of treating word forms. In the following section we present a short analysis of the related work. In the third section, we describe the different resources for Bulgarian and Serbian developed during the past years in the same or comparable format which are used in the course of the work. The forth and the fifth sections explain how we apply different techniques for morpho-syntactic annotation compatible to word forms corresponding to one ore more tokens. Finally, we discuss the presented study and propose future work to be done2. 2. Previous research One of the basic common resources for European languages during the last decade was developed within the Multext-East project (Erjavec, 2004) 3 . This resource consists of three main components for each of the languages included in the project, namely a proposed standard for morpho-syntactic description (further on, MSD), the text of the translation of Orwell’s novel 1984 in the corresponding language, and the application of MSD on the annotation of the lemmatized version of the text of this novel. Besides these components, Multext-East encompasses the aligned versions of 1984 on the sentence level, by means of the Vanilla-aligner, for all languages included (and also added later) in the project. The results of this project on the level of the description of morpho-syntactic parameters were refined and enhanced several times, and the project results found a wide use in the research community. At present, despite the success of this project, its 2 The first version of this paper was presented before the very small audience at the Workshop 'A Common Natural Language Paradigm for Balkan Languages' that was held in conjunction with RANLP Conference 2007. 3 http://nl.ijs.si/ME/ shortcomings can be observed both regarding the content of the MSD, and in the way this description has been applied to specific languages, as presented, for instance, in (Przepiórkowski & Woli!ski, 2003). First of all, the principles taken into consideration when particular attributes and values are included in the Multext-East MSD are not always clear and consistent. Some of the attributes are properties of the lemma, some of them – properties of particular word forms only. The question is how the recommended attributes and values are chosen to be included in the MSD – are they those that determine the inflectional paradigm, or the agreement properties, or those relevant for the temporal and modal features, etc. If we consider the inflectional paradigms in Bulgarian and Serbian, we can give examples showing that other sets of categories than those defined in the Multext-East MSD determine these paradigms. The attribute Animateness with values human, animate and non-animate is not specified for Bulgarian but it determines the vocative and count slots in the noun paradigm. The word form dvojica ‘two men’ in Serbian is the nominative singular of the noun dvojica that behaves on the inflectional level as a noun of feminine gender in singular, while it actually represents the natural masculine gender in plural. This information has to be attached to the lemma and it determines the complex agreement conditions in Serbian which cannot be expressed within the Multext-East MSD. Thus the criterion for the morpho-syntactic specifications of any language has not to be based on the set of the attributes shared with a group of other languages rather than on the set describing morpho-syntactic properties of a given language: a minimal set has to include those attributes and values that are relevant for the inflectional paradigms of the single word; more descriptive sets have to include attributes and values relevant for complex word forms as well as for the multi-word units, etc. The parallel processing of two or more languages has not to be limited to a predefined set of attributes and values that the languages share but to a flexible set that can be relevant for a particular NLP task. Considering the application of MSD on text annotation of Orwell’s novel, a unique method for obtaining annotated versions of the novel in different languages was not established, neither were methods for producing annotated text (automatically or manually) explicitly stated. This observation means that the information for resolving possible ambiguities in annotation is not explicitly represented and especially that the manner of disambiguation is not explained. As a consequence of this inconsistency the obtained annotated texts of Orwell’s 1984 contain only the final result of the morphological and lexical analysis, where the mechanism of morphological analysis remains hidden, which means that the method of the assignment of the lemma and the MSD to the word form cannot be reproduced on a new text in the same manner. A possible application of a stochastic tagger trained on such a training set to a new text requires a thorough verification of the obtained results, which is in essence a more complicated task than the initial annotation of Orwell’s text (because it has to be established whether the MSD attributed to a word in the text in such a manner is correct or false, instead of selecting the correct MSD among several possibilities). To the great extend the ideas presented in this paper are synchronized with the proposal for an ambiguity handling through lattices based on a two level structuring for tokens and word forms involving the use of feature structures for morpho-syntactic content (Clément et al., 2005). 3. Parallel language resources 3.1. Parallel Bulgarian-Serbian corpus The parallel corpus is compiled from the French text of Jules Verne's novel Around the world in 80 days, which has been aligned with its translations in a number of languages including English, Bulgarian and Serbian4. The alignment was accomplished using the Xalign system (Romary & Bonhomme, 2000) 5 . From the TEI-format obtained in this way, several versions of texts have been created in other formats such as TMX, Vanilla-like format and HTML (Appendix 1). The alignment was performed at the paragraph and segment levels in a manner that established a one-to-one correspondence between the original and the translation by means of additional manual segmentation, but preserving the segmentation of the original. This enabled the maintenance of the one-to-one correspondence between all the language pairs processed ( Figure 1). Figure 1: Aligned Bulgarian-Serbian parallel corpus # words # sentences # paragraphs Bulgarian English French Serbian 58 162 4 435 1 963 64 831 4 435 1960 68 359 4 435 1963 60 227 4 435 1963 Table 1: Statistical data for the parallel corpus Although the aligned parallel corpus is relatively small at this stage (see figures in the Table 1) it is a part of the MaT6 project, whose aims are directed to the compilation of a large multilingual parallel corpus of Balkan, South Slavic and bigger European languages that will be constituted of texts from different ranges (most of the existing parallel corpora of European languages as Acquis 4 Besides these languages, the novel is fully or partially aligned in as many as nine other languages 5 led.loria.fr/outils.php 6 SEE-ERANET project Building Language Resources and Translation Models for Machine Translation focused on South Slavic and Balkan Languages Communautaire7 consist of legislation documents only). Starting from the text obtained in the above-mentioned manner we analyze different issues of morpho-syntactic annotation. All examples in this paper are taken from the Bulgarian and Serbian subparts of the parallel corpus. 3.2. Bulgarian and Serbian e-dictionaries Various formalisms for the representation of linguistic knowledge are available, and at the first place, different types of morphological dictionaries and local grammars. The basic monolingual lexical resources for Bulgarian and Serbian considered in this paper are systems of morphological dictionaries in the so-called LADL-format (Courtois & Silberztein, 1990). This format is compatible with the draft of Lexical markup framework (LMF) standard8, and an automatic conversion from this format into LMF is enabled (Krstev et al, 2006b). Automatic conversion from this format into Multext-East has also been successfully performed for Serbian (Krstev et al., 2004). The dictionaries for Bulgarian are described in (Koeva, 2004: Koeva, 2005) and some samples are presently available under the NooJ format9, whereas the dictionaries for Serbian, developed under both the Unitex system 10 and NooJ, are outlined in (Vitas & Krstev, 2005)11. The common feature of these dictionaries is that they are developed within the same theoretical and methodological framework that enables a multi-level application of the results of the theory of finite state transducers to text processing. The basic form 12 of the entry in the morphological dictionary is described by the following pattern: (*) ..... word form, lemma. K+SynSem:(mc;)* where word form and lemma are simple words or continuous multi-word units, whereas K is a code that contains the information on the part-of-speech and inflective properties of the lemma, usually in the form of a corresponding finite transducer. SynSem is the sequence of syntactic and semantic attributes attached to the lemma, while mc represents the sequence that describes the relation between the word form and the lemma by means of specified values of grammatical categories. For instance, the following entries from the Serbian and Bulgarian dictionaries 7 http://langtech.jrc.it/JRC-Acquis.html http:// www.tc37sc4.org 9 http://www.nooj4nlp.net/ 10 http://www-igm.univ-mlv.fr/~unitex/ 11 Dictionaries in this format exist for several other Balkan and South Slavic languages: Greek (Kyriacopoulou et al., 2002), Romanian (Dimitriu, 2005), Macedonian as well as for Albanian and Croatian in an initial stage. 12 This is the format of the dictionary of inflected forms, the so called DELAF, derivable from a dictionary of non-inflected forms called DELAS 8 lisca,lisac.N+Hum+Zool:ms2v;ms4v !"#"$%,!"#"$%.N+F:s0 establish lisca ‘fox’ in Serbian as the genitive (2) or accusative (4) singular (s) form of masculine gender (m) noun (N) lisac that is marked as animate (v) and that can have the semantic feature Hum (for humans) (e.g. seg 1915: Svoje brige poveri Fiksu, koji - prevejani lisac pokuša... ‘He had confided his anxiety to Fix who--the sly rascal!--tried...’) and Zool (for animals) and respectively !"#"$% ‘fox’ in Bulgarian as feminine (F) noun (N) whose form !"#"$" is in singular (s) and indefinite (0) (e.g. seg. 1915: &'( )* +',*-"! .-"/*#0*0"1/% #" 0% 2"3#, % /'( – 4"/-%/% !"#"$% – #* '."/,%5* ...). The WS4LR tool (Krstev et al, 2006b) can be used to enrich the SynSem field in the pattern (*) by transferring the information from semantic networks. For instance, the information on currencies was in this way transferred from Wordnet to e-dictionary: gvineja,.N+Cur:fs1q:fp2q // guineas Here, the marker Cur represents the names of currencies. The attributes form the lexical database Prolex (Vitas et al., 2007) can be transferred into e-dictionaries by applying the same procedure: Bombaj,.N+NProp+Top+Gr:ms1q:ms4q // Bombay bombajskoj,bombajski.A+PosQ+NProp+Top+Gr:aefs3g Here NProp represents a proper name, PosQ a relational adjective, Top a toponym, Gr a city. 4. Ambiguity At least two types of PoS ambiguity can be distinguished in Bulgarian and Serbian. Lexical ambiguity is observed when the ambiguous word forms pertain to different lemmas (usually with different POS) e. g. in Bulgarian the word -%64'+" may either be the plural indefinite form of the masculine noun -%64'+ ‘expense’ - "#$%&'(,"#$%&'.N+M:p0,, or the third person singular present tense; second person singular aorist; third person singular aorist; and second person singular imperative of the verb -%64'+1 ‘to take for a walk’ - "#$%&'(,"#$%&').V+F+T:P2s:R2s:R3s:I2s (i.e. seg 2345: &',% ,!"6% , ')7"/* &%'()*"! ‘This enters into ... general expenses’ and seg 1761: …8'#.'9% :;+%, 3'1/' )* .-'1,"!% 9*!%0"* +% #* &%'()*" ‘...Aouda, who betrayed a desire for a walk...’). Morphological ambiguity occurs when a given lemma has two or more identical distinct word forms, e.g. in Bulgarian inanimate masculine nouns such as ,<.-'# ‘question’ whose singular definite short article (sh) and counted form (c) coincide: *+,"&-#,*+,"&-.N+M:sh:c (seg. 1223: =<- 2-%0#"# >-'?%-/" ?; .'#/%," '/3-"/' +,-&)#%. ‘Sir Francis frankly put the question to him’ vs. seg. 1896: @%+%+* #/' +,-&)#% 0% 3%."/%0%, 'A"$*-"/*, ?'-1$"/* He overwhelmed the captain). Assume the processing of the words in the following sentence (seg 56): (1-sr) Pojavi se momak tridesetih godina i pozdravi. (1-bg) B!*6* *+"0 ?!%+ ?<9 0% '3'!' /-"+*#*/ 8'+"0" " .'6+-%,". (1-en) A young man of thirty advanced and bowed. All its possible morphological interpretations will than be listed where among other things, we can see that (a) the form of the word pojavi can be interpreted as a form of the noun pojava ‘appearance’ or as a form of the verb pojaviti (se) ‘to appear’, (b) the form pozdravi as a form of the noun pozdrav ‘greeting’ or a form of the verb pozdraviti (se) ‘to greet’. At the same time, both forms realize several different values of morphological categories. For example, if pozdravi is the form of the verb pozdraviti, then it can represent the third person of the present tense or the second person of imperative or aorist singular. Similar ambiguity is observed in Bulgarian: .'1," is either plural of the noun .'1,% ‘appearance’ or one of the four different forms of the verb .'1,1 #* ‘to appear’; .'6+-%," is either plural of the noun .'6+-%, ‘greeting’ or one of the four different forms of the verb .'6+-%,1 ‘to greet’. These different interpretations are for Serbian represented by the graph in Figure 2. included in the electronic dictionaries might be recognized by means of the respective morphological grammars. On the other hand, neither multi-word units (MWUs) nor complex word forms (both continuous and discontinuous) are recognized by the traditional electronic dictionaries. Continuous MWUs and complex word forms might be handled in an uniform way in morphological dictionaries together with the simple words, while discontinuous word forms might be processed by means of local grammars. Providing these techniques for a morpho-syntactic annotation of the word forms might bring new horizons in the POS tagging – to the best of our knowledge there are no POS taggers available that handle MWUs or complex word forms. Thus the basic annotation assigned from morphological dictionaries can be refined in several ways. We shall indicate here only some of these techniques in order to show the directions towards a proper morpho-syntactic annotation. 5.1. Contiguous multi-word units Figure 2: The sentence graph for the Serbian segment 53 This illustrates the problem of essential ambiguity in the interpretation of the incoming sentence. Let us compare the sentence (sr-1) with the result obtained by TnT (Brants, 2000), trained on the Serbian annotated text of 1984: Pojavi Vm-p3s-an-n---e se Q (..............................................) i C-s pozdravi Ncmpn--n Here MSD values are incorrectly established for the forms of pojavi (present instead of aorist) and pozdravi (noun instead of verb), and the problem of ambiguity of forms, characteristic for Slavic languages, remains completely hidden in the method used by TnT. 5. Annotation refinement The parallel Bulgarian-Serbian corpus is annotated with the grammatical information available from Bulgarian and Serbian morphological dictionaries. After the annotation most of the unrecognized words are foreign proper names but there are also some words built by the regular derivation rules. Bulgarian and Serbian are languages with highly productive derivation concerning diminutives, relative adjectives, negative adjectives, adjective and adverb comparative forms (we are not going to discuss here whether comparison reflects in different lexemes or different word forms), verb aspect pairs, etc. Some of the words built by the regular derivation rules which are not The issue of morpho-syntactic specifications of multi-word units (distributed among natural languages approximately equivalently and covering one fourth of the lexis according to the data represented in the European wordnets and one tenth of the words used in real texts according to the data coming from the Bulgarian sense tagged corpus) is very important. A multi-word unit can correspond to a single word in another language, for example: the multi-word unit A-*0#3' 8-'6+* ‘red currants’ in Bulgarian corresponds to the single word ribizlama in Serbian or groseilles in French (seg 154: … .<!0*0 #<# #/-<3C*/% -*,*0 " 6*!*0' .&/0#1) 2&)'*/… kolaDa punjenih stabljikama ravente i zelenim ribizlama…un gâteau farci de tiges de rhubarbe et de groseilles vertes... ‘a rhubarb and gooseberry tart’). Consequently multi-word units refer to a unique concept and have to be treated in a uniform way together with the single words. Attempts towards proper morpho-syntactic description of both single words and MWUs, were scarce so far. However, a description of the inflection of multi-word units based on dictionaries of simple words is given in (Vitas & Krstev, 2005), and further enhanced for some Slavic languages in (Koeva, 2004) and (Krstev et al., 2006). An example of a MWU is presented by the expression s vremena na vreme in Serbian or o/ ,-*?* 0% ,-*?* in Bulgarian that represents an adverbial syntagma. On the level of simple word categories that sequence would be analyzed as Preposition Noun Preposition Noun, for instance by TnT: s (Spsg) vremena (Ncnsg--n) na (Spsa) vreme (Ncnsa--n) On the level of the dictionary of multi-word units such a sequence is described as adverbial syntagma, and the result of the annotation would add the following information: s vremena na vreme.ADV+C o/ ,-*?* 0% ,-*?*.ADV+C (from time to time) Here C indicates that a compound adverb is in question. Another example of MWU is the sequence Hong Kong in the following sentence (seg 1963): (2-sr) Hong Kong je ostrvce koje je ... pripalo Engleskoj (2-bg) E'03'08 * '#/-',C* .'+ %08!"(#3' ,!%+*0"*... (2-fr) Hong-Kong n'est qu'un îlot (...) assura la possession à l'Angleterre (2-eng) Hong Kong is an island which came into the possession of the English by the Treaty of Nankin In Serbian Hong Kong can be written in three different ways: Hongkong (as in Bulgarian), Hong Kong (as in English) or Hong-Kong (as in French). In the first case, it is a simple word (as a contingent sequence of alphabetic characters), whereas in the other two cases it is a multi-word unit or a compound word (composed of two simple words divided by a separator). As components of the MWU Hong Kong do not exist in the dictionary (neither Hong, nor Kong), the analysis on the level of simple words will mark this sequence as two unknown words. One solution of this problem would be the construction of a dictionary of MWUs with a structure analogous to the structure described by the pattern (*). A formalism is presented in (Savary, 2005) that enables the formalization of inflections of MWUs, analogous to the definition of the inflection of simple words. Figure 3: The inflectional graph for two-component compound for which the first component does not inflect, and a space between them can be either omitted or replaced by a hyphen. The result of the application of this formalism is that, on the basis of the graph depicted in Figure 3, all forms of the inflectional paradigm will be generated for the three graphemic representations of the sequence Hong Kong. The analysis of the initial part of the sentence (sr-2) yields the graph given in Figure 4. In the interaction of the system of electronic dictionaries with the lexical database of proper names Prolex, the sequence Hong-Kong obtains also the attributes NProp (proper name), Top (toponym) and Gr (city). A comprehensive solution of the complex problem of numeral recognition is for Serbian presented in (Krstev & Vitas 2007). 5.2. Regular productive derivation Another issue arises on the level of regular derivation. Namely, in the recognition of the results of the derivational processes the meaning of the derived form that is usually not described in the dictionary of the simple words is deduced from the meaning of the initial word (Vitas et al., 2007). In this way it is possible to associate the word forms that usually do not belong to the dictionary of simple words and thus remain in the category of unrecognized words after text analysis with the precise description (the level of precision is the same as that obtained by the word forms belonging to the dictionary). These processes are present both in Serbian and Bulgarian in deriving the diminutives, possessive adjectives, negative adjectives, verb aspect pairs, etc. among others. This issue is illustrated in the example (sr-2) by the forms ostrvce in Serbian and (bg-2) by '#/-',C* in Bulgarian, which are diminutive forms of the respective nouns ostrvo ‘island’ in Serbian and '#/-', in Bulgarian. The productivity of certain derivational processes such as the formation of diminutives, possessive and relational adjective, etc. are characteristic for Bulgarian and Serbian. From the angle of the completeness of electronic dictionaries, it is clear that all results of such derivational processes, which we will call regular derivation, cannot be described in the dictionary of simple words. The forms generated by such processes can be described by a specific type of finite-state transducers, the so-called morphological grammars, which represent models of respective derivational processes. Such grammars are applied to words that remained unrecognized in the process of analysis, and enable the reduction of the unrecognized form to a lemma form missing from the dictionary. Thus, by applying the appropriate morphological grammar, for the word ostrvce in Serbian and '#/-',C* Bulgarian the following sequence is generated on the output of the analyzer: ostrvce,ostrvce.N+Dem+Sr:ns1q:ns4q:ns5q '#/-',C*,'#/-',C*. N+NE+Dem:s0 where the attribute Dem, added by a morphological grammar, indicates that the word is a form of diminutive. In the example of sentence (2) a problem in the multilingual context is also posed by the identification of proper names. Namely, in Serbian the toponym Engleska was used, whereas in (2-bg) the translation uses the adjective form %08!"(#3'. One solution that enables the linking of these two word forms in a multilingual context in a systematic way is analyzed in (Maurel et al, 2007). Figure 4: Sentence graph for the beginning of the sentence sr-2 5.3. Complex word forms and discontinuous MWUs The third question concerns complex morphological categories which are usually excluded from the morpho-syntactic specifications. But a synthetic form in one language might correspond to an analytical one in another language, i.e. 7* C*/% ‘will read’ in Bulgarian corresponds to Fe Ditatiti = FitaFe in Serbian, consequently they should also be treated in a uniform way. Most of the analytical forms are discontinuous – they allow other words – mainly clitics in Bulgarian to interrupt their parts. Local grammars, as concepts defined in (Gross, 1993), enable the construction of finite transducers, which recognize and tag different structures in a text, on the basis of the content of the dictionary (and other local grammars). One example of local grammars for Bulgarian and Serbian are local grammars for the recognition of complex tenses (for Serbian see (Vitas & Krstev, 2003). These grammars enable not only the recognition of a compound tense in the sentence, but also the transformation of the sequence of words, or the transformation of the tense. 5.4. Named entities Local grammars can be applied in other ways also. For example, let us observe the example of annotation of named entities on the aligned texts of Verne’s novel in the sense of (Chinchor et al., 1999). As a first example consider the regular expression of the following form: (<A+NProp+Top>+<E>) <N+Cur> Its meaning is: extract from a text any sequence of tokens that can be interpreted as a numeral, simple or compound, expressed by digits or words, that is followed by an optional adjective derived from a toponym that is followed by an obligatory noun that represents a currency. When this pattern is applied to the Verne’s text the examples presented in Appendix 2 are obtained. Figure 5: The graph measure.grf for the recognition of measure expressions The annotation of named entities for some measures on the aligned texts of Verne’s novel is more complex. The general expression for a measure is depicted by the graph Measure.grf in Figure 5 which describes it as a structure of a sequence of numbers written by words or digits followed by a measure indicator (kilometer, grade, mile, foot, etc). Examples of sequences which correspond to this graph are .*/, 5*#/ "!" +*#*/ #/<.3" in Bulgarian or in Serbian hiljadu tri stotine osamdeset i dve milje ‘one thousand three hundred eighty two miles’. The same graph refers to words that have the categories NUM (numbers) or N+NumN (number nouns) assigned in dictionaries of Bulgarian and Serbian. In the subgraph digit any sequence of digits is recognized. The difference between the Serbian and Bulgarian lexis of measures is described by the graph measure where the units of measure are named. Some examples of concordances extracted by the automaton in Figure 5 are given in Appendix 3. The graph produces the concordance lines that contain the number of segments where some entity appeared as well as the measure entity itself. Certain differences in recognition are a consequence of the phenomenon of regular derivation: (seg 2256, seg 2280) bg: +,%+*#*//'0*0 3'-%) = sr. brod od dvadeset tona = en. craft of twenty tons or inconsistency in the translation: (seg 4397) bg. /-"#/% " 5*#/+*#*/ 8-%+;#% = sr. tri stotine šezdeset meridijana = en. three hundred and sixty degrees. 6. Conclusion and further work We have presented some techniques directed towards the establishment of a flexible and uniform method for morpho-syntactic annotation concerning not only single words but multi-word units, complex word forms and productive derivational rules. We have treated single words and continuous MWUs in a uniform way presenting them in a common inflexional dictionary format. We have applied morphological grammars for the morpho-syntactic annotation of unknown words that are derived by productive derivational rules, and local grammars for the recognition of the complex word forms and named entities. • Further developments of the method include: • Compilation of large and range balanced multilingual parallel corpus of Balkan and South Slavic languages; • Development of large inflectional dictionaries including continuous multi-word units, • Coverage of all productive and regular derivational rules by means of morphological grammars, • Extensive coverage of complex word forms by means of local grammars, • Analyzing the similar language phenomena in Balkan and South Slavic languages. The further extension of the research is presupposed by the developing of equivalent language resources for other Balkan and South Slavic languages. 7. References Brants, T. (2000) TnT - a statistical part-of-speech tagger. In Proceedings of the 6th Applied NLP Conference, ANLP-2000, April 29 - May 3, 2000, Seattle, WA, pp. 224–231 Chinchor, N., Brown, E., Ferro, L. Robinson, P. (1999) 1999 Named Entity Recognition Task Definition (version 1.4). Technical Report, SAIC, http://www.nist.gov/ speech/tests/ie-er/er_99/doc/ ne99_taskdef_v1_4.pdf Clément Lionel and Éric de la Clergerie. (2005) MAF: a morphosyntactic annotation framework. In Proc. of the Language and Technology Conference, Poznan, Poland, pp 90--94. Courtois, B., Silberztein M. (Eds.) (1990) Dictionnaires électroniques du français, Langue française 87, Paris: Larousse Dumitriu, M. (2005) Grammaires de flexion du roumain en format DELA, Rapport interne 2005-02 de l’Institut Gaspard-Monge – CNRS Gross, M. (1993) Local Grammars and Their Representation by Finite Automata. Data, Description, Discourse, Papers on the English Language in honour of John McH Sinclair, ed. by M Hoey. London: Harper-Collins. pp. 26-38 Erjavec, T. (2004) MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In Fourth International Conference on Language Resources and Evaluation, LREC'04 Ide, N., L. Romary, and E. Villemonte de la Clergerie, (2003). International standard for a linguistic annotation framework. In Proceedings of HLT-NAACL’03 Workshop on The Software Engineering and Architecture of Language Technology. Edmonton. http://www.cs.vassar.edu/~ide/papers/ide-romary-clerg erie.pdf Koeva, S. (2004) Contemporary language technologies. In: Laws of/for language. Sofia, pp. 111- 135 Koeva, S. (2005) Inflection Morphology of Bulgarian Multiword Expressions. In: Computer Applications in Slavic Studies – Proc.of Azbuki@net, pp. 201-216, Sofia Krstev, C., Vitas, D., Erjavec, T. (2004) Morpho-Syntactic Descriptions in MULTEXT-East - the Case of Serbian, Informatica, No. 28, The Slovene Society Informatika, Ljubljana, pp. 431-436 Krstev, C., Vitas, D., Savary A. (2006a) Prerequisites for a comprehensive Dictionary of Serbian Compounds. FinTAL, LNCS 4139, pp. 552-563 Krstev, C., Stankovi., R., Vitas, D., Obradovi., I. (2006b) WS4LR - a Workstation for Lexical Resources, in Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy, pp. 1692-1697 Krstev, C. Vitas, D. (2007) Treatment of Numerals in Text Processing, in Proceedings of 3nd Language & Technology Conference, October 5-7, 2007, Pozna!, Poland, ed. Zygmunt Vetulani, pp. 418-422 Kyriacopoulou, T., Mrabti S., Yannacopoulou, A. (2002) Le dictionnaire électronique des noms composés en grec moderne, Lingvisticæ Investigationes 25:1, Amsterdam/Philadelphia John Benjamins, pp. 7–28 Maurel, D., Krstev, C, Vitas, D., Koeva, S. (2007) Prolex: a lexical model for translation of proper names: Application to French, Serbian and Bulgarian. In Slavic languages and French: formal approaches in contrastive studies, Bulag, 32, Besancon, pp. 55-72 Przepiórkowski, A., Woli!ski, M. (2003) A Flexemic Tagset for Polish. In Erjavec, T., Vitas, D. (Eds.) EACL workshop on Morphological Processing of Slavic Languages, Budapest, pp. 33-40 Romary, L., Bonhomme P. (2000) Parallel Alignment of Structured Documents, In J. Véronis (Ed.) Parallel text processing: Alignment and use of translation corpora, Kluwer Academic Press, pp. 211-218 Savary, A. (2005) Towards a Formalism for the Computational Morphology of Multi-Word Units. In Vetulani (ed.) Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 2nd Language & Technology Conference. Pozna!, Poland, pp. 305-309 Vitas, D., Krstev, C. (2005) Regular derivation and synonymy in an e-dictionary of Serbian, Archives of Control Sciences, Volume 15(LI), No. 3, Polish Academy of Sciences, pp. 469-480 Vitas, D., Krstev C., Maurel, D. (2007) A note on the semantic and morphological properties of proper names in the Prolex project. In Sekine, Satoshi and Elisabete Ranchhod (Eds.): Named Entities: Recognition, classification and use, Lingvisticæ Investigationes 30 (1) pp. 115–133 Vitas, D., Krstev, C. (2003) Composite Tense Recognition and Tagging in Serbian. In Erjavec, T., Vitas, D. (Eds.) EACL workshop on Morphological Processing of Slavic Languages, Budapest, pp. 55 - 62 Vitas, D. (2004) Morphologie dérivationnelle et mots simples: Le cas du serbo-croate. Syntax, Lexis & Lexicon-Grammar (Papers in honour of Maurice Gross), Lingvisticæ Investigationes Supplementa 24, Amsterdam/Philadelphia: John Benjamins Publishing Company, pp. 629-640 Appendix 1. Text fragment from Figure 1 in TMX format <tu> <tuv xml:lang="BG" creationid="n506 " creationdate="20070801T123334Z"> <seg>/(01"2& '&3"4 $2#4, 54 * 62'(), 7&)8& 4 #209(:-7# $4;), 2);# '# 4 * 34$&,#-2&-8. </seg> </tuv> <tuv xml:lang="SR" creationid="n506 " creationdate="20070801T123334Z"> <seg>On dobro zna da ne.e biti siguran u Indiji jer je to engleska zemlja. </seg> </tuv> <tuv xml:lang="FR" creationid="n506 " creationdate="20070801T123334Z"> <seg>Il doit bien savoir qu'il ne serait pas en sûreté dans l'Inde, qui est une terre anglaise. </seg> </tuv> <tuv xml:lang="EN" creationid="n506 " creationdate="20070801T123334Z"> <seg>He ought to know that he would not be safe an hour in India, which is English soil.</seg> </tuv> </tu> Appendix 2. Some of the entities extracted by the graph money.grf n175 : <'2# ,#57# 3#272&8(, *+$9($#=# 2# &0"&;2#8# -8&:2&-8 &8 <b_numex type="money">.*/+*#*/ " .*/ 4"!1+" !"-"<e_numex>, 34 *$48# &8 ;#-#8# 2# 09#*2() 7#-(4" 2# >#27 +? 6209#2' n175 : Svežanj nov@anica u iznosu od <b_numex type="money">pedeset i pet hiljada livara<e_numex> iš@ezao je sa stola glavnog blagajnika Engleske banke. n175: Une liasse de bank-notes, formant l'énorme somme de <b_numex type="money">cinquante-cinq mille livres<e_numex>, avait été prise sur la tablette du caissier principal de la Banque d'Angleterre. n176 : ...* -+=() ;&;428 7#-(4"+8 -4 4 $#2(;#*#9 - *,(-*#248& 2# ,"(%&'( &8 <b_numex type="money">8"( A(9(20# ( A4-8 ,42-#<e_numex> ( 54 5&*47 24 ;&B4 '# '+"B( *-(57& ,&' &7&. n176 : ... u tom trenutku blagajnik beležio primanje <b_numex type="money">tri šilinga i šest penija <e_numex> i da se ne može na sve obratiti pažnja. n176: ... à ce moment même, le caissier s'occupait d'enregistrer une recette de <b_numex type="money">trois shillings six pence<e_numex>, et qu'on ne saurait avoir l'oeil à tout. n2009 : ... ;4"7#28(92# C209() ,"&'#*# 0&'(A2& $# <b_numex type="money">+,*#/% " 5*#/+*#*/ ?"!"'0% A-%03%<e_numex> &8 8#$( -;+"8&2&-2# '"&0#, 7&)8& -4 2#"(5# &,(1;! n2009 : ... trgova@ka Engleska prodaje godišnje onu kobnu drogu nazvanu opijum za <b_numex type="money">dve stotine šezdeset hiljada franaka<e_numex>! n2009: ... la mercantile Angleterre vend annuellement pour <b_numex type="money">deux cent soixante millions de francs<e_numex> de cette funeste drogue qui s'appelle l'opium! n4342 : – D);# '# -( '#; ;&)8# 5#-8 &8 548("( %(9)'( 9("( * &39&0# – 7#$# C2'"E /8E#"8, -)'#:7(, – *-4 ,#7 =4 ,&915# <b_numex type="money">/-" 4"!1+" +*,*/#/'/"0 +*,*/+*#*/ " +*,*/ !"-"<e_numex>. n4342 : -- Ja svoj deo u opkladi ne bih dao pa da mi ko za njega daje <b_numex type="money">tri hiljade devet stotina i devedeset i devet livara<e_numex> - re@e Endrju Stjuart sedaju.i. n4316: Je ne donnerais pas ma part de quatre mille livres dans le pari, dit Andrew Stuart en s'asseyant,-- quand même on m'en offrirait <b_numex type="money">trois mille neuf cent quatre-vingt-dix-neuf<e_numex>! Appendix 3. Some of the entities extracted by the graph measure.grf <seg id="n50">... osamdeset @etiri stepena Farenhajtovih ... &-4;'4-48 ( 548("( 0"#'1-# ,& F#"42%#:8 // 84°F <seg id="n449">... dve hiljade osam sto tona ... '*4 %(9)'( ( &-4;-8&8(2 8&2# // 2 800 t <seg id="n464">... sto šezdeset kilometara ... -8& ( A4-8'4-4884 7(9&;48"# // 160 km <seg id="n493">... dve hiljade metara ... '*4 %(9)'( ;48"# // 2 000 m <seg id="n839">... hiljadu do hiljadu sto milja ... %(9)'# '& %(9)'# ( -8& ;(9( // 1 000 - 1 100 knots <seg id="n969">... sedamdeset i sedam stepeni ... -4'4;'4-48 ( -4'4; 0"#'1-# // 77 ° <seg id="n2689">... pet, šest, deset stopa ... ,48, A4-8 (9( '4-48 -8+,7( // 5, 6, 10 feet <seg id="n2961">... tri hiljade sedam stotina osamdeset šest milja ... 8"( %(9)'( -4'4;-8&8(2 &-4;'4-48 ( A4-8 ;(9( // 3786 knots <seg id="n3216">... sedam hiljada pet stotina dvadeset @etiri engleske stope ... -4'4; %(9)'( ,48-8&8(2 ( &-4;'4-48 #209(:-7( -8+,7( // 7524 feets <seg id="n3664">... pola milje ... ,&9&*(2 ;(9) // 1/2 feet Using inheritance and coreness sets to improve a verb lexicon harvested from FrameNet Mark McConville and Myroslava O. Dzikovska School of Informatics, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW, Scotland {Mark.McConville,M.Dzikovska}@ed.ac.uk Abstract We investigate two aspects of the annotation scheme underlying the FrameNet semantically annotated corpus — the inheritance relation on semantic types with its corresponding links between semantic roles of increasing granularity, and the specification of coreness sets of related semantic roles — against the background of our ongoing effort to harvest a lexicon of verb entries for deep parsing. We conclude that these aspects of the FrameNet annotation scheme do prove useful for reducing the complexity and ambiguity of verb entries, allowing for semantic roles of lower granularity for purposes of deep parsing, but need to be applied more systematically to make the lexicon usable in a practical parsing system. 1 Introduction Semantically annotated corpora and wide-coverage semantic lexicons are an important resource for building NLP systems. They have been used to train shallow semantic parsers (Gildea and Jurafsky, 2002), provide paraphrases in question answering (Kaisser and Webber, 2007), and extend lexicons for deep parsing (Crabbé et al., 2006). All these applications use a ‘frame-based’ representation to express sentence semantics, where the semantic type corresponding to the meaning of a verb is related to its dependents by means of semantic roles. An essential task in building this representation is to make a connection between the surface form of the utterance and its semantics, usually by linking between syntactic and semantic structure. Linking syntactic and semantic structure can be facilitated by a computational lexicon that describes possible mappings. McConville and Dzikovska (2007) report on an attempt to harvest a verb lexicon for deep linguistic processing from the FrameNet 1.3 semantically annotated corpus. We demonstrated that harvesting verb entries directly from annotations, as is done in the lexical entry files currently distributed with FrameNet, results in a number of subcategorisation frames which are unsuitable for inclusion in a computational lexicon used by a deep parser. We proposed a set of filtering rules to reduce the number of spurious subcategorisation frames generated by syntactic phenomena not directly captured in the FrameNet annotation. In this paper we evaluate how this lexicon can be further improved by using two other aspects of the linguistic annotation underlying the corpus — the organisation of the semantic types (a.k.a. ‘frames’) and roles (‘frame elements’) into a hierarchy, and the specification of certain ‘coreness sets’ of related roles. The FrameNet ontology is very expressive and richly structured, with the aim of simplifying a number of reasoning tasks. However, we argue that FrameNet’s level of role name granularity creates problems from the perspective of parsing, since it is traditionally assumed that verbs subcategorise for a relatively small number of arguments. We first of all demonstrate that it is possible to use role inheritance to reduce the size of the role set (and hence the lexicon as a whole) without losing information, thus restricting the granularity of the semantic roles used in the output representation. We then describe an attempt to apply the coreness sets defined in the FrameNet ontology to eliminate ambiguity in lexical entries, making the FrameNetbased lexicon easier to use in a parsing system. We conclude that the FrameNet annotation scheme provides for useful mechanisms for reducing the complexity and ambiguity of verb entries, but needs to be applied more systematically to make the lexicon usable in a practical parsing system. Section 2 provides some necessary background. Section 3 discusses our investigations into the use of semantic role inheritance to reduce the vocabulary of roles invoked by arguments in verb entries. Section 4 then turns to the topic of coreness sets in FrameNet, and the extent to which they can be used to eliminate redundancy in the harvested lexicon. Finally, Section 5 discusses how our algorithms could be used in the future to benefit applications other than deep parsing. 2 Background Regardless of the particular grammar formalism which they presuppose, lexicons used for parsing and semantic interpretation contain representations that map syntactic structure (a subcategorisation frame or a set of syntactic roles) to semantic structure (a predicate name and a set of arguments). For example, a lexical entry for the verb move would specify that: (a) the verb invokes a predicate which we might call ‘motion’; (b) it subcategorises for a noun phrase subject which denotes the ‘theme’ (i.e. the object undergoing movement); and (c) it also subcategorises for a prepositional phrase complement headed by the preposition to which denotes the ‘goal’ (i.e. endpoint of the trajectory). This kind of information can be harvested automatically from semantically annotated corpora such as FrameNet (Baker et al., 1998), PropBank (Palmer et al., 2005) or OntoNotes (Hovy et al., 2006). The ultimate goal of our project is to create a wide-coverage lexicon yielding representations that can be connected to the reasoning engine of a dialogue system. Thus, we chose FrameNet as our source for extracting lexical entries, since it includes an ontology which has already proved useful for information retrieval and question answering tasks (Surdeanu et al., 2003; Kaisser and Webber, 2007). The FrameNet annotation scheme allows one to harvest a lexicon by reading the subcategorisation frames and their corresponding role assignments directly off the annotated sentences. The resulting lexicon contains 2,770 verb entries, each specifying a semantic type, an orthographic form, and a set of subcategorisation frames. Subcategorisation frames are sets of arguments, each of which specifies a syntactic role, syntactic category and semantic role.1 Here is an example lexical entry for the verb fry, derived from an annotated sentence like Matilde fried the catfish:   ORTH !fry"  CAT V    TYPE Apply heat  '  $   ROLE Obj ROLE Ext   ARGS CAT NP ,CAT NP     ROLE Cook ROLE Food The subcategorisation frame lists two arguments, one for each annotated dependent in the sentence. While collecting such entries is straightforward on the surface, not all of them would be usable with a deep parser. To begin with, all entries have to correspond to “canonical” syntactic subcategorization frames - i.e. to indicative mood, direct word order entries, and include only syntactic complements but not modifiers. Entries for other constructions, such as passives and clefts, are normally derived by syntactic transformations and are not included in the lexicon. We addressed these issues previously (McConville and Dzikovska, 2007; McConville and Dzikovska, 2008), developing methods to remove such spurious entries from the lexicon. Secondly, we need to consider how well the syntaxsemantics mappings harvested from the corpus fit with the representations traditionally used for parsing. We observed that the representations in the extracted entries manifest at least one significant difference in this respect. While there is no easily definable “canonical” representation for semantic roles, deep parsers, generally speaking, assume that the target semantic representation utilises a relatively small set of roles. There are several reasons for this. Firstly, restricting the vocabulary of semantic roles is convenient from a representational perspective — many existing lexicons are hierarchical (Copestake and Flickinger, 2000; McConville, 2006), and having a large number of distinct roles may make the lexicon less compact because it offers fewer opportunities for re-use through inheritance. Secondly, it has been proposed that the syntactic and semantic behaviour of verbs is correlated (Levin, 1993), and can be mediated 1 We extracted this lexicon independently, but FrameNet contains an analogous set of lexical entries as part of the distribution, which we could have used as a starting point in the same way. through a small set of ‘thematic roles’, as for example encoded in the VerbNet lexicon (Kipper et al., 2000). Finally, disambiguating between a large number of roles may require world knowledge and pragmatic information which is difficult to obtain and integrate in a domainindependent way. For example, the FrameNet semantic type Closure defines two distinct roles which can be denoted by the direct object of a transitive verb: Container portal (e.g. John closed the tent flap), and Containing object (e.g. Mary buttoned her coat). Human annotators are able to distinguish these roles based on common sense knowledge, and whilst it is true that such distinctions may be important for certain reasoning tasks, a deep parser would find this kind of ambiguity extremely difficult to resolve. Thus, a more compact roleset may be necessary to reduce the ambiguity in parsing and semantic interpretation.2 The importance of having a relatively small set of basic semantic roles has not been lost on the creators of FrameNet. Indeed, a lot of recent effort (between versions 1.1 and 1.3) has gone into organising the semantic types in the FrameNet ontology into an inheritance hierarchy and, in particular, into linking the fine-grained roles of child types with the more generic roles of their parent types. In addition, a number of ‘coreness sets’ of semantic roles have been specified, the idea being that only one member of a coreness set need be explicitly invoked in a well-formed, non-elliptical sentence, and hence that these roles are equivalent in some way. In the rest of this paper we describe how we used inheritance and coreness sets to eliminate redundancy in both the vocabulary of semantic roles and in the verb entries themselves. As our general evaluation metric, we take the reduction in the number of individual roles and the reduction in the number of subcategorisation frames per verb entry in the lexicon. For comparison, we looked at two other lexicons: VerbNet, a lexicon of English verbs that aims to have a complete coverage of syntactic alternations for each verb covered, and the TRIPS lexicon (Allen et al., 2007) — a multi-domain lexicon used with a wide-coverage deep grammar. These lexicons were developed independently, but share the aim of explicitly representing the connections between syntax and semantics, with VerbNet focusing more on complete coverage, and TRIPS focusing on practical parsing applications that require syntactic and semantic disambiguation. Thus, while there is no way of determining the ‘ideal’ number of roles per se, comparison with these lexicons can give us some insight in the complexity or redundancy of the FrameNet-based lexicon compared to lexicons intended for parsing.3 The initial lexicon harvested from FrameNet (McConville 2 Additional information can be brought in at a post-processing stage, linking the more generic semantic representation with more specific knowledge representation (Dzikovska et al., 2007). 3 The various lexicons are not completely independent, in the sense that TRIPS contains an ontology of concepts inspired by an early version of FrameNet (Dzikovska et al., 2004), and it contains entries extracted from VerbNet (Crabbé et al., 2006). However, all entries were hand-edited to ensure that they conform to the independently developed lexicon design. and Dzikovska, 2007) contains 9,180 subcategorization frames, invoking 362 distinct semantic types, and arguments invoking 441 distinct semantic role labels, an average of 1.2 semantic role labels per semantic type. In comparison with other deep verb lexicons, this ratio of roles to types is quite high. The TRIPS lexicon contains verb entries invoking 284 distinct semantic types and arguments invoking 48 distinct semantic roles, yielding a ratio of 0.17 roles per semantic type. Similarly, the VerbNet lexicon has 395 verb classes, with arguments instantiating just 33 distinct semantic/thematic roles, giving a ratio of 0.084 roles per verb class. In addition, the FrameNet-based lexicon contains 3.3 subcategorisation frames per verb entry, compared to 2.8 in VerbNet and 1.3 in TRIPS.4 3 Using inheritance to reduce the role set We first consider how the inheritance relation encoded in the FrameNet ontology can be used to reduce the size of the vocabulary of semantic roles. The FrameNet ontology of semantic types is organised into an inheritance hierarchy, where child types are connected to their parents by means of an Inheritance relation. For example, this relation partitions the Motion semantic type (encoding events involving a theme traversing a path) into a number of more specific subtypes such as Self motion (the theme is a living being, acting under its own volition), Fluidic motion (the theme is a fluid), etc. All the semantic roles associated with a parent type must be implemented by some role of each child type. For example, two of the roles associated with Motion are Source (start of the trajectory) and Goal (end of the trajectory). These roles are implemented directly by all child types of Motion using roles of the same name. On the other hand the Theme role associated with the Motion type is implemented by different roles in subtypes: in Self motion it is implemented by Self mover, in Fluidic motion by Fluid, and so on. In addition, child types can introduce new roles which are not linked to roles of parent types. The existence of this inheritance relation and its associated links between parent and child roles has important implications for the vocabulary of semantic roles in the lexicon we harvested from FrameNet. For example, the transitive verb dismiss invokes the FrameNet semantic type Firing, and its subject and object instantiate the associated semantic roles Employer and Employee respectively, hence the following subcategorisation frame: (1) Sbj:Employer Obj:Employee However, the semantic type Firing is subsumed by the parent type Intentionally affect in the FrameNet ontology, with the Employer role linked to the superrole Agent and the Employee role linked to the Patient superrole. Thus, an alternative way of representing the tran4 Note that the TRIPS figure is significantly lower in part because the TRIPS lexicon has been built based on the subcategorisation frames attested in spoken dialogue corpora, so it does not contain many frames that are included in VerbNet but only rarely appear in speech and dialogue. sitive subcategorisation frame for dismiss, using the information contained in the inheritance hierarchy, is: (2) Sbj:Agent Obj:Patient Note that the semantic roles specified in this lexicon are much more generic, and are similar to the kinds of role names used in the VerbNet and TRIPS lexicons. The aim of the first part of our project was to investigate the extent to which we can use information about supertypes and ‘superroles’ in the FrameNet 1.3 ontology to decrease the number of distinct semantic roles invoked by arguments in the harvested lexicon, thus creating a less redundant verb lexicon for deep parsing. 3.1 Methodology We went through each argument of each subcategorisation frame of each verb entry in the harvested lexicon and, where the entry’s semantic type was linked to some parent type in the FrameNet ontology and the argument’s semantic role was linked to some role of the parent type, we replaced the original role with the superrole. We repeated this until we reached the root type in the ontology, which in this case involved five cycles (i.e. the maximum depth of the relevant part of the inheritance hierarchy is 5). In the cases where a role is linked to two or more distinct superroles (because of multiple inheritance in the FrameNet ontology), we included all of them. 3.2 Results The results are presented in Table 1 in the ‘full lexicon’ column. Each row represents a level of recursion, i.e. ‘0’ means that no supertypes are taken into account, ‘1’ means that we move one level up the hierarchy etc. The first column represents the number of distinct semantic role labels across the entire lexicon at each cycle, and the second column represents represents the number of distinct types of subcategorisation frame in the lexicon (where a subcategorisation frame is abstracted to a set of semantic roles). Thus, taking the lexicon we harvested from FrameNet as a whole, we can reduce the number of distinct semantic role labels by 21%, from 441 to 347. The five most common roles which are the beneficiaries of this process are presented in Table 2. Note that the number of distinct role labels, 347, still appears to be very high in comparison with the selection found in other deep verb lexicons like TRIPS and VerbNet. In addition, Table 2 demonstrates that, although the three most popular roles to be introduced are the generic roles Theme, Patient and Agent, familiar from both the VerbNet and TRIPS lexicons and from mainstream theories of thematic roles, there are still some overly specific roles in evidence, for example Communicator and Sought entity, We hypothesised that the very small reduction in the number of semantic roles is a function of the incomplete nature of the inheritance relation in the FrameNet ontology. Recall that the FrameNet 1.3 ontology contains 362 verbal types. However, a large proportion of these, 145, are ‘orphan types’, in the (strong) sense that they are not linked to any other type in the ontology, neither as child nor as cycle 0 1 2 3 4 5 full lexicon roles frames 441 1256 364 1129 348 1083 347 1083 347 1083 347 1083 restricted lexicon roles frames 289 807 196 653 177 596 176 596 176 596 Table 1: Results of the inheritance experiments frequency 1254 1150 777 709 225 full lexicon role Theme Patient Agent Communicator Sought entity restricted lexicon frequency role 1843 Agent 1486 Theme 1189 Patient 827 Communicator 591 Goal Table 2: Most common role labels in the resulting lexicon parent. In order to determine whether the disappointingly small reduction in distinct semantic roles as we climb the hierarchy is a result of the existence of these orphan types, we eliminated all verb entries from the harvested lexicon which invoke one of the 145 orphan types, and repeated the process. Our restricted lexicon now contains 1,729 verb entries invoking 217 distinct semantic types. There are 6,253 subcategorisation frames distributed across these entries. The results of substituting more general roles for more specific ones, according to the inheritance relation underpinning the FrameNet 1.3 ontology, are presented in the ‘restricted lexicon’ half of Table 1. The five most common roles which are now the beneficiaries of this process are presented on the right hand side of Table 2. Thus, assuming the subset of the FrameNet-harvested lexicon which only includes types which are incorporated into the inheritance relation underpinning the FrameNet 1.3 ontology, we can reduce the number of distinct semantic role labels by 39%, from 289 to 176. This is significantly higher than the 21% reduction we managed using the full lexicon, thus supporting our hypothesis that the more ‘connected’ the FrameNet inheritance relation is, the more useful it will be in allowing us to harvest a deep verb lexicon with a manageable set of semantic roles. The fact that only 975 of the 2,770 verb entries in the harvested lexicon have semantic types which are rooted in either the State or Event supertypes shows that the FrameNet ontology still has a way to go in this respect. 4 Using coreness sets to filter subcategorisation frames As discussed in the introduction, after filtering out modifiers and frames derived from non-canonical usages of target verbs, the lexicon we harvested from FrameNet con- tained 9,180 subcategorisation frames, distributed among 2,770 verb entries. One interesting feature of the FrameNet ontology which we have not considered until now involves the specification of certain kinds of dependency between the semantic roles associated with a particular semantic type. For example, in certain semantic types, a particular subset of the semantic roles may be grouped together in a ‘coreness’ set, only one of which need be expressed in order to produce a complete, non-elliptical sentence. The most prevalent example of this involves the following semantic roles within the Motion semantic type and its subtypes: • Source (e.g. from Cairo) • Goal (to Khartoum) • Path (down the Nile) • Area (around the country) • Direction (towards Alexandria) The fact that these five roles are grouped together into a coreness set, captures the fact that they are in some sense equivalent, or that they instantiate the same underlying role, that of “trajectory”. The existence of coreness sets has implications for lexical concision. For example, the harvested lexicon contains 115 entries invoking the Self motion semantic type, and these entries involve eleven distinct types of subcategorisation frame (ignoring syntactic categories) with the Self mover role as subject and these ‘trajectory’ roles as oblique dependents, for example: (3) Sbj:Mover Dep:Source Sbj:Mover Dep:Goal Sbj:Mover Dep:Source Dep:Goal ... However, if we assume that the trajectory roles are actually just alternative realisations of the same underlying semantic role, then we can condense all these frames into just the one, where the Kleene star denotes an unbounded number of instances of the specified argument type: (4) Sbj:Theme Dep:Trajectory* The FrameNet 1.3 ontology specifies 210 coreness sets for 174 verbal semantic types. Each coreness set brings together an average of 2.5 semantic roles. The aim of the second part of our project was thus to investigate to what extent we can use the coreness sets defined in the ontology to consolidate the harvested lexicon, in terms of reducing the number of subcategorisation frames that need to be specified. 4.1 Methodology We proceeded in two stages. First of all, we went through every argument of every subcategorisation frame of every verb entry and, where the argument’s semantic role was part of some relevant coreness set, we replaced the semantic role name with the coreness set name. Then we went through every verb entry and eliminated duplicate frames, assuming that two frames are identical if and only if they have the same arguments, and that two arguments are identical just in case they have the same syntactic role, syntactic category and semantic role/coreness set. 4.2 Results The first stage of the procedure, where we replaced semantic role labels with relevant coreness sets, affected 1,542 of the 2,770 verb entries in the lexicon, and 5,954 of the subcategorisation frames found in these entries. After eliminating duplicate subcategorisation frames, we were left with 7,804 frames across the lexicon as a whole (down from 9,180). Of the 7,804 subcategorisation frames left in the lexicon, 1,253 have potentially duplicate arguments, i.e. where two or more arguments have semantic roles from the same coreness set. Thus, we next eliminated all duplicate arguments from individual subcategorisation frames, resulting in a decrease in the total number of arguments across all extant subcategorisation frames, from 16,795 to 16,406. Finally, after again eliminating duplicate subcategorisation frames from within each verb entry, the lexicon contained 7,672 frames across the 2,770 verb entries. This constitutes an average of 2.8 subcategorisation frames per entry and a reduction of 16% on the original number of 9,180. 4.3 Evaluation We wanted to evaluate whether the use of coreness sets to consolidate pairs of subcategorisation frames corresponds with linguistic intuitions about which subcategorisations frames in a verb entry are really ‘equivalent’ and hence ‘collapsible’. To this end, we selected 100 random cases where our procedure had used coreness sets to make a judgment that two distinct subcategorisation frames were essentially the same. We ensured that our sample contained only one instance from each semantic type, so as to counteract the bias in the FrameNet corpus whereby certain types include more verbs than others and certain verbs have been more fully annotated. Where necessary, we referred to the equivalent verb entries in VerbNet and the TRIPS lexicon. Of the 100 entries chosen, 17 involved variations of the ‘trajectory’ coreness set discussed above, associated with an assortment of motion, orientation and spatial extension predicates. It is important to note, first of all, that this coreness set is independently motivated, for example in the ontology of paths outlined in Jackendoff (1983), where source, goal, and other unbounded path expressions are treated as equivalent in the sense that they are alternative realisations of one and the same thematic role in conceptual structure. We verified that in all 17 cases, the coreness set did in fact correlate with this linguistic intuition, and hence that combining the two subcategorisation frames was valid. Take for example, the following subcategorisation frames of the verb buzz from the Motion noise semantic type: (5) Sbj:NP:Theme Dep:PP:Goal Sbj:NP:Theme Dep:PP:Path The first of these includes a Goal argument (e.g. buzz into the room) and the second a Path (e.g. buzz across the room). Since the FrameNet ontology lists these in a coreness set for Motion noise, the two subcategorisation frames are combined into the following unified representation: (6) Sbj:NP:Theme Dep:PP:Goal/Path This decision corresponds with our linguistic intuitions about the argument structure of the verb buzz, which subcategorises for an unbounded number of trajectory expressions (e.g. The fly buzzed from the doorway across the room to the window). We used similar reasoning with the other 16 instances involving the ‘trajectory’ coreness set in our sample. Of the remaining cases in our sample, a substantial number (around 40) involve what can loosely be termed ‘partwhole’ alternations in the relevant argument. For example, the verb claw from the Manipulation type subcategorises for subjects with two distinct semantic roles, Agent and Bodypart of agent, related through a coreness set. These two usages are exemplified in the following two sentences: (7) Jane clawed at his back Fingers clawed at his back Other examples are somewhat more abstract. For example, the verb eclipse from the Surpassing type subcategorises for two kinds of subject in the FrameNet lexicon, Profiled item and Profiled attribute, again related through a coreness set, and where the latter can be approximated as a ‘part’ (or possibly ‘feature’) of the former: (8) John eclipsed Mary John’s talent eclipsed Mary’s Again, the consolidation of these arguments was judged to be linguistically valid, in part because VerbNet treats them as encoding the same thematic role (i.e. Theme1). Other coreness sets which occurred repeatedly throughout our sample involved agent-cause alternations (e.g. John/The blackout disabled the alarm system) and speakermedium alternation (e.g. The critics/survey labelled her a has-been). Again the intuitiveness of these coreness sets is supported by VerbNet thematic roles. However, there were at least ten cases where the coreness sets lead to an invalid consolidation of arguments, in general caused by the fact that FrameNet syntactic information, and hence our lexical entry extraction procedure, does not distinguish between preposition phrases headed by different prepositions. For example, consider the following two example sentences involving the verb jab from the Cause impact type: (9) Mary jabbed John with a bayonet Mary jabbed a bayonet at John In both these sentences, John would be annotated as an Impactee and a bayonet as an Impactor. Since these two roles are part of the same coreness set, the subcategorisation frames underlying both sentences are consolidated into the following unified representation: (10) Sbj:NP:Agent Obj:NP:Impactor/Impactee Dep:PP:Impactor/Impactor This is clearly undesirable, since it leads to an unnecessary level of ambiguity for a parser, a conclusion reinforced by the fact that VerbNet treats the impactee and impactor arguments with distinct thematic roles (i.e. Destination and Instrument respectively). It is worth dwelling a little on the possible reasons for FrameNet annotators formulating such an obviously unintuitive coreness set. In previous work (McConville and Dzikovska, 2007), we have noted the tendency to incorporate all uses of a particular verb into the same frame, even when syntax disagrees. For example, take the two uses of the verb rip in the following sentences: (11) John ripped his trousers below the knee John ripped the top off his packet of cigarettes In both sentences, annotators have judged that the target verb rip evokes the Damaging frame, which has two important ‘core’ roles — Agent (i.e. the ‘ripper’) and Patient (the object that gets ripped). In this respect, annotation of the first sentence is simple — John is the Agent, his trousers is the Patient, and the prepositional phrase below the knee is assigned to a ‘non-core’, locative role called Subregion. Assuming that the use of the target verb rip in the second sentence also involves the Damaging frame causes problems however — the top, is assigned to the non-core Subregion role, despite being realised as a (syntactically obligatory) direct object. Thus, in this case the syntactic generalisation that subjects and onjects realise core roles is overuled in favour of keeping all uses of the target verb within the same frame. A more appropriate analysis would have been to assign the use of the target verb in the second sentence to the Removing frame. Considering again the examples involving the target verb jab in (9), we see that similar forces are at work. The hypothesised reason for grouping roles into coreness sets is where a number of distinct roles are realised by the same syntactic role — in this case, the direct object can realise either the Impactee (i.e. John) or the Impactor (i.e. a bayonet), so the formulation of a coreness set Impactor/Impactee makes sense. Note however that this is purely an artifact of the decision to treat both uses of the target verb jab as evoking the same frame. If the second sentence were treated as involving the Cause motion frame, the undesirable coreness set would not have been formulated. Therefore, we can conclude that, although the FrameNet coreness sets correspond in the vast majority of cases with valid underlying thematic roles, there are a number of problematic cases, at least some of which involve target verbs being assigned to suboptimal frames by annotators. Note that information about the particular kind of preposition which can head a given PP argument is often considered to be a part of a subcategorisation frame, especially for deep parsers (c.f. the commonly used PFORM feature). If such information were available in FrameNet annotation, this would have the side effect of avoiding some of the problems caused by this kind of unintuitive coreness set, since the argument structures derived from the two sentences in (9) would not be identical — the first would have a PPwith dependent, whereas the second would have a PPat. However, it would also make it more difficult to merge arguments from some of the intuitive coreness sets such as that involving trajectory arguments, since these can be introduced by a large variety of prepositions. In the future, we are planning to improve our lexicon extraction algorithm so that prepositions are taken into account in extracting and differentiating subcategorisation frames. This would require a more detailed investigation which arguments can be merged despite using different prepositions, and in which cases they should be kept separate. One possible solution is suggested by the approach taken in the VerbNet. The arguments in the VerbNet subcategorisation frame can either be associated with a single preposition (such as with), or with a class of prepositions (such as P:loc corresponding to a set of locative prepositions). This encodes the intuition that in some cases the preposition is fixed by the verb, and therefore ‘meaningless’, while in other cases the preposition is ‘meaningful’ in that it corresponds to a specific predicate (on, in, under) and can be drawn from a large set of possibilities. We therefore are considering using the FrameNet corpus data to see if a preposition associated with a given role appears to be fixed, or can be drawn from a larger set, and using this as a basis for making the distinction between meaningless and meaningful prepositions associated with coreness sets. 5 Discussion In this paper, we argued that for purposes of parsing and semantic interpretation, a less specific set of semantic roles would ease lexicon construction and disambiguation. Consider an analogy with word sense distinctions: Palmer et al. (2004) argue that different levels of granularity are needed for different applications. For example, information retrieval may require coarser distinctions, at the level of PropBank sense groupings, while machine translation may require much more fine-grained distinctions, such as those found in WordNet (Miller, 1995). Similar reasoning can be applied to semantic roles: coarser distinctions, such as the argument labelling assumed in PropBank (i.e. ARG 0, ARG 1, etc.), may be the easiest to disambiguate and annotate; thematic roles as used in VerbNet (i.e. AGENT, THEME , etc.) may provide an appropriate level of generalisation when linking syntactic and semantic structure; and the fine distinctions encoded in FrameNet (i.e. COOK, FOOD , etc.) may be useful for reasoning. Ideally, these different levels could be mapped to each other, similarly to the way WordNet senses are linked to VerbNet and PropBank entries. Our study is a first step in evaluating to what extent the different levels of generalisation could be linked in FrameNet through the use of features defined in its ontology, and in attempting to automatically derive a set of semantic roles and lexical entries at lower granularity. While our research is primarily centered on the needs of a deep parser and lexicon, the algorithms we developed could also contribute to ongoing research on linking various lexical resources and annotated corpora, for both manual and automatic linking approaches. In case of manual linking, the SemLink project5 aims to develop correspondences between the semantic types and roles underlying PropBank, VerbNet and FrameNet. In the future, we plan to compare results of our automatic procedure with the correspondences made by human coders. Assuming that there is sufficient agreement, this automatic approach could be adapted in the future to reduce the need for manual linking. For automatic linking, Kwon and Hovy (2006) propose an automatic algorithm for aligning role names between semantic lexicons, which achieves around 78% accuracy in aligning FrameNet and PropBank roles based on corpus evidence. It may be interesting to consider whether using either inheritance or coreness set information could improve the accuracy of the alignment algorithm. Finally, statistical parsers and semantic role labellers (Gildea and Jurafsky, 2002) could benefit from having a smaller set of semantic roles, because this would reduce the data sparsity problem. Using the hierarchy to reduce the role set could be useful under the circumstances, without loss of data. It is admittedly less clear how the coreness set information could be used, but this too may be worth exploring if it could be utilised as a way of backing off to more general role names in a statistical model. 6 Conclusion The aim of the project reported in this paper was to take a verb lexicon harvested fairly directly from the FrameNet semantically annotated corpus, and to apply some of the mechanisms within the FrameNet ontology to make this lexicon more effective for use with a deep parser. We argued that the lexicon would be improved with a more concise and generic role set, because it will simplify making links between syntax and semantics in the lexical entries. We examined: (a) the inheritance relation on semantic 5 http://verbs.colorado.edu/semlink roles, and the corresponding links between semantic roles of increasing granularity, as a means of reducing the size of the vocabulary of roles across the lexicon as a whole; and (b) the coreness sets of related semantic roles specified within the FrameNet ontology, with the aim of consolidating subcategorisation frames within individual verb entries. In both cases, we concluded that the annotation scheme provide useful, though not perfect, mechanisms for our purposes. This is in part due to the fact that the relevant aspects of the scheme are not always applied in systematic manner across the FrameNet ontology. Making this part of the FrameNet annotation more consistent could benefit not only our application, but also applications that support linking between different resources, and potentially semantic role labelling applications. Acknowledgements The work reported here was supported by grants N000140510043 and N000140510048 from the Office of Naval Research. 7 References James Allen, Myroslava Dzikovska, Mehdi Manshadi, and Mary Swift. 2007. Deep linguistic processing for spoken dialogue systems. In Proceedings of the ACL’07 Workshop on Deep Linguistic processing, pages 49–56. C. F. Baker, C. Fillmore, and J. B. Lowe. 1998. The Berkeley FrameNet Project. In Proceedings of COLINGACL’98, Montreal, pages 86–90. Ann Copestake and Daniel Flickinger. 2000. An opensource grammar development environment and broadcoverage English grammar using HPSG. In Proceedings of LREC’00, Athens, Greece, pages 591–600. Benoit Crabbé, Myroslava O. Dzikovska, William de Beaumont, and Mary D. Swift. 2006. Increasing coverage of a domain independent dialogue lexicon with VerbNet. In Proceedings of the Third International Workshop on Scalable Natural Language Understanding (ScaNaLU 2006). Myroslava O. Dzikovska, Mary D. Swift, and James F. Allen. 2004. Building a computational lexicon and ontology with FrameNet. In Proceedings of the LREC’04 Workshop on Building Lexical Resources from Semantically Annotated Corpora. Myroslava O. Dzikovska, Mary D. Swift, and James F. Allen. 2007. Linking semantic and knowledge representations in a multi-domain dialogue system. Journal of Logic and Computation, Special Issue on Natural Language and Knowledge Representation. Daniel Gildea and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245–288. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 57–60. Ray Jackendoff. 1983. Semantics and Cognition. MIT Press. Michael Kaisser and Bonnie Webber. 2007. Question answering based on semantic roles. In Proceedings of the ACL’07 Workshop on Deep Linguistic processing. Karin Kipper, Hoa Trang Dang, and Martha Palmer. 2000. Class-based construction of a verb lexicon. In Proceedings of AAAI’00. Namhee Kwon and Eduard Hovy. 2006. Integrating semantic frames from multiple sources. In Proceedings of CICLing’06. Beth Levin. 1993. English Verb Classes and Alternations. The University of Chicago Press. Mark McConville and Myroslava O. Dzikovska. 2007. Extracting a verb lexicon for deep parsing from FrameNet. In Proceedings of the ACL’07 Workshop on Deep Linguistic processing, pages 112–119. Mark McConville and Myroslava O. Dzikovska. 2008. Evaluating complement-modifier distinctions in a semantically annotated corpus. In Proceedings of LREC’08. Mark McConville. 2006. Inheritance and the CCG lexicon. In Proceedings of EACL’06, pages 1–8. G. Miller. 1995. WordNet: A lexical database for English. Communications of the ACM, 38(5). Martha Palmer, Olga Babko-Malaya, and Hoa Trang Dang. 2004. Different sense granularities for different applications. In HLT-NAACL 2004 Workshop: 2nd Workshop on Scalable Natural Language Understanding, pages 49– 56, Boston, Massachusetts, USA, May 2 - May 7. Martha Palmer, Paul Kingsbury, and Daniel Gildea. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106. Mihai Surdeanu, Sanda M. Harabagiu, John Williams, and Paul Aarseth. 2003. Using predicate-argument structures for information extraction. In Proceedings of ACL03, pages 8–15. !"#$"%&'()*"%+,&-*.#!//01&23#%1#4*)&"%'2#51(*#!""1%&%'1"# 617(&#81%-17('&# !"#$%&'"(&)*+),$(-.$-")$(/),0(-.01&021) 3(04"%10&5)*+)611"7) 8*92:"1&"%;)3(0&"/)<0(-/*') #-:*&1="11"7>$2>.?) !,-%0&2%# @()&:01)#$#"%;)A")2*(10/"%)$()"(&$09'"(&BC$1"/)40"A)*+)&:")(*&0*()*+)1"'$(&02)%*9")$(/)#%*#*1")$()$((*&$&0*()12:"'")+*%)$11*20$&0(-) $%-.'"(&1)A0&:)+0("B-%$0("/;)#%*&*&5#02$9)#%*#"%&0"1)"(&$09"/)C5)&:")1"'$(&021)*+)#%"/02$&*%1>)D")"'#0%02$995)0(4"1&0-$&")&:")#*&"(&0$9) *+)0(2*%#*%$&0(-)$()"(&$09'"(&BC$1"/)9$5"%)&*)1"'$(&02)%*9")2*%#.1)$((*&$&0*()+*%)$2E.010&0*()$(/)+*%'$90F$&0*()*+)90(-.01&02)?(*A9"/-") $&)$)-"("%$9)15(&$7B1"'$(&021)0(&"%+$2">)) ) 9:! ;"%01.72%'1"# ,$%-"B12$9")9"702$9)1"'$(&02)%"1*.%2"1)&:$&)#%*40/")%"9$B &0*($9)0(+*%'$&0*()$C*.&)9"702$9)0&"'1)$%")$&)&:"):"$%&)*+) 2.%%"(&)%"1"$%2:)0()($&.%$9)9$(-.$-")#%*2"110(-)GH,IJ>)@() #$%&02.9$%) 2*%#*%$) A0&:) #%"/02$&"B$%-.'"(&) 1&%.2&.%") $(B (*&$&0*()2*(1&0&.&")&:")C$101)+*%)/"4"9*#'"(&)*+)1"'$(&02) #$%10(-)$9-*%0&:'1)&:$&)$.&*'$&02$995)0/"(&0+5)&:")1"'$(B &02)%*9"1)2*(4"5"/)C5)1"(&"(&0$9)2*(1&0&."(&1)KLM)+.%(01:B 0(-)$)1:$99*A)1"'$(&02)9"4"9)*+)&"7&)0(&"%#%"&$&0*(>)N"&;)*() $)#$%$99"9)&%$2?;)2*%#*%$)A0&:)1"'$(&02)%*9")$((*&$&0*()$%") "11"(&0$9)/$&$)+*%)$2E.010&0*()*+)90(-.01&02)?(*A9"/-")$&)$) #%0(20#9"/) 15(&$7B1"'$(&021) 0(&"%+$2">) O*%'$90F$&0*() *+) 2*%#.1B0(/.2"/)90(?0(-)0(+*%'$&0*()$&)$)1.0&$C9")9"4"9)*+) -"("%$90&5)*%)$C1&%$2&0*()2$()C").1"+.9)0()$)4$%0"&5)*+)A$51>) P"10/"1)#%*40/0(-)0(10-:&)0(&*)9"702$9)1"'$(&02)#:"(*'B "($;)-"("%$90F$&0*(1)*4"%)1#"20+02)'$##0(-1)*+)$%-.'"(&) 1&%.2&.%") &*) 15(&$2&02) +*%') 2$() C") 0(2*%#*%$&"/) 0(&*) $9B &"%($&04") 151&"'1) ">->) $##950(-) '"&$B9"$%(0(-) 1&%$&"-0"1) 1.2:)$1)$2&04")9"$%(0(-)+*%)1"'0B$.&*'$&02)$2E.010&0*()*+) "7&"(/"/)1"&1)*+)$((*&$&"/)/$&$)KQM>)67&%$2&0*()*+)90(?0(-) %"-.9$%0&0"1)$2%*11)4$%0*.1)#%"/02$&")1"(1"1)$(/)2*(1&%.2B &0*(1)2$()&:.1)C").1"/)$1)$)%"'"/5)+*%)&:")1"4"%")#%*C9"') *+)1#$%1")/$&$)0()9"702$9)1"'$(&02)2*%#.1)$((*&$&0*()G0>">) &:")0(1.++020"(&)2*4"%$-")*+)1#"20+02)1"(1"1)$(/)2*(1&%.2B &0*(1)A0&:0()1"(10C9")10F"1)*+)'$(.$995)$((*&$&"/)/$&$J>)) @()&:01)2*(&"7&;) A")2*(10/"%)&:")0'#902$&0*(1)*+)/0+B +"%"(&)&:"*%"&02$9)$##%*$2:"1)/"&"%'0(0(-)"11"(&0$9)/"10-() $1#"2&1) *+) 1"'$(&02) %*9") $((*&$&0*(>) R"950(-) *() &:") 0(B 10-:&1)*+)!*A&5S1)KTM)&:"*%5)*+)#%*&*B%*9"1)A")#%*#*1")$() $((*&$&0*()12:"'")&:$&)$11*20$&"1)$%-.'"(&1)A0&:)#%*&*B &5#02$9)#%*#"%&0"1)"(&$09"/)C5)&:")1"'$(&021)*+)#%"/02$&"1>) D") /012.11) &:") #*&"(&0$9) *+) 0'#9"'"(&0(-) $() "(&$09B '"(&BC$1"/) $##%*$2:) +*%) "7&%$2&0*() *+) -"("%$9) 0(+*%'$B &0*()$C*.&)#*110C9")15(&$7B1"'$(&021)'$##0(-1>)) <:! =10/10&#&".#4*)&"%'2#51(*-# 8*%#*%$)A0&:)1"'$(&02)%*9")$((*&$&0*()$4$09$C9")+*%)6(-B 901:)%"#%"1"(&)/01&0(2&)$##%*$2:"1)&*)&:")(*&0*()*+)1"'$(B &02)%*9">) U:")I%*#*10&0*()P$(?)GI%*#P$(?J)KVWM)01)$)*(")'09B 90*() A*%/) 2*%#.1) 0() A:02:) #%"/02$&"B$%-.'"(&) %"9$&0*(1) $%") $((*&$&"/) +*%) "4"%5) *22.%%"(2") *+) "4"%5) 4"%C) 0() &:") D$99)X&%""&)Y*.%($9)#$%&)*+)&:")I"(()U%""C$(?)KVZM>)!0+B +"%"(&) 4"%C) 1"(1"1) $%") /01&0(-.01:"/) '*1&95) *() 15(&$2&02) -%*.(/1>) O*%) "$2:) 1"(1";) $%-.'"(&1) $%") (.'C"%"/) 1"B E."(&0$995>) [9&:*.-:) &:") 1$'") $%-.'"(&) 9$C"91) $%") .1"/) +*%)$99)4"%C1)G[R\];)[R\V;)^;)[R\ZJ;)&:"1")9$C"91)$%") /"+0("/)*()$)#"%B4"%C)C$101;)0>">)&:"5):$4")$)4"%CB1#"20+02) '"$(0(-)$(/)$%")*(95)2*(101&"(&)$2%*11)15(&$2&02)$9&"%($B &0*(1)*+)$)10(-9")4"%C)1"(1">)67$'#9")I%*#P$(?)$((*&$B &0*(1_)) GVJ! K[R\]) P9."B2:0#) 2*(1.'"%) 1&*2?1M) K!"#) #%*40/"/M) K[R\V)$)90+&M)&*)K[R\WBU`)&:")0(/.1&%0$9)$4"%$-"M>))) GWJ! @()$//0&0*(;)K[R\]) &:")C$(?M):$1)$()*#&0*()&*)K!"#) C.5M) K[R\V) $) T]a) 1&$?") 0() P@IM) +%*') K[R\WBOR`b)X*20"&")\"("%$9"M)K[R\bBUbIV)$+B &"%)Y$(>V;)Vcc]M)$&)K[R\TB[U)V;]VZ)+%$(21)$)1:$%"M>)) ) I%*#P$(?)'$?"1)(*)$&&"'#&)&*)+*%'$90F")&:")1"'$(B &021)*+)&:")%*9")9$C"91)0&)"'#9*51>)U:01)01)#$%&02.9$%95)29"$%) A0&:) :0-:"%B(.'C"%"/) 9$C"91_) [R\W;) +*%) 0(1&$(2";) 0(/0B 2$&"1)$"%"&'()*+")A0&:)&:")4"%C),!-+*.")GVJ;)A:09")A0&:)&:") 4"%C)$/0)GWJ10&)0(/02$&"1)2-/!(">),*A"%B(.'C"%"/)9$C"91) /"(*&")4$%0*.1)%*9"1)$1)A"99;)$9&:*.-:)&:"5)$%")9"11)$%C0B &%$%5)$2%*11)4"%C1_)[R\])-"("%$995)2*%%"1#*(/1)&*)&%$/0B &0*($9) $-"(&1;) "7#"%0"(2"%1;) 2"%&$0() &5#"1) *+) &:"'";) "&2>) &:$&)1.%+$2")$1)1.Cd"2&1)*+)&%$(10&04")4"%C1)$(/)$)29$11)*+) 0(&%$(10&04"1) 2$99"/) .("%-$&04"1e) [R\V) 01) $110-("/) &*) *Cd"2&1)*+)&%$(10&04")4"%C1)$(/)1.Cd"2&1)*+).($22.1$&04"1) $(/)01)&:")"E.04$9"(&)*+)#$&0"(&;)&:"'";)"&2>)H*("&:"9"11;) &:"%")$%")1&099)0(2*(101&"(20"1)"4"()+*%)[R\])$(/)[R\V>) @()"++"2&;)10(2")(*)2*(101&"(&)'$##0(-)01)"(1.%"/)C"&A""() $)9$C"9)$(/)$)1"'$(&02)%*9";)I%*#P$(?)9$C"91)/*)(*&)9"(/) &:"'1"94"1)&*)$(5)+*%'$90F$&0*()*+)90(-.01&02)?(*A9"/-">) 8.%%"(&95;)$()$&&"'#&W)01)'$/")&*)'$#)$%-.'"(&)9$C"91)&*) 1"'$(&02$995) '*%")2*:"%"(&)%*9"1)C5)"(1.%0(-)&:"0%)2*(B 101&"(25)A0&:0()4"%C)29$11"1)/"+0("/)C5)f"%CH"&T>) )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))) V )[R\b)0(/02$&"1)$/d.(2&1>)@&)01)-"("%$995)+*99*A"/)C5)*(")*+)$) 1"&)*+)+.(2&0*($9)&$-1)/"(*&0(-)&:")%*9")*+)&:")"9"'"(&)0()E."1&0*(;) ">->)[R\bB,`8)+*%)9*2$&04"1;)[R\bBUbI)+*%)&"'#*%$91;)"&2>) W ):&&#_gg4"%C1>2*9*%$/*>"/.g1"'90(?g) T )f"%CH"&)KVTM)01)$()0'#9"'"(&$&0*()*+),"40(S1)KVhM)4"%C)29$11"1) /"+0("/)*()&:")C$101)*+)&:")$C090&5)*+)4"%C1)&*)#$%&020#$&")*%)(*&)0() #$0%1) *+) 15(&$2&02) +%$'"1) %"#%"1"(&0(-) $9&"%($&0*(1) 0() &:") "7B #%"110*()*+)&:"0%)$%-.'"(&1)G1*B2$99"/).*')3"2*21'#)"!%')*-%2J>) ) O%$'"H"&h;) *() &:") *&:"%) :$(/;) 01) 2%"$&0(-) $() *(90(") 1"'$(&02)9"702*()C$1"/)*()O099'*%"S1)KhM)&:"*%5)*+)+%$'") 1"'$(&021>)@&)/"12%0C"1)A*%/)'"$(0(-)0()&"%'1)*+).(/"%B 950(-)2*(2"#&.$9)1&%.2&.%"1)"(2*/"/)0()&:")+*%')*+)&!'4"2;) 0>">) 12:"'$&02) %"#%"1"(&$&0*(1) *+) 1&"%"*&5#"/) 10&.$&0*(1) 2$#&.%0(-)%"$9BA*%9/)?(*A9"/-">)6$2:)+%$'")01)$11*20$&"/) A0&:)$)1"&)*+)9"702$9)0&"'1)&:$&)"+-5")0&)$(/)$)1"&)*+)%*9"1) G&!'4"1"#"4"%)2J)2*%%"1#*(/0(-)&*)&:")#$%&020#$(&1)0()&:") /"10-($&"/) #%*&*&5#02$9) 10&.$&0*(>) [) /01&0(2&0*() 01) '$/") C"&A""()(-!")$(/)%-%6(-!")G'$%-0($9J)%*9"1>) O%$'"H"&) 0(29./"1) '$(.$995) $((*&$&"/) "7$'#9") 1"(&"(2"1) +%*') &:") P%0&01:) H$&0*($9) 8*%#.1) #%*40/0(-) $//0&0*($9) 9$5"%1) *+) #:%$1") 1&%.2&.%") $(/) -%$''$&02$9) +.(2&0*()$((*&$&0*()KZM>)@&)$91*)0(29./"1)&A*)1'$99)2*%#*%$) *+) +.99B&"7&) $((*&$&0*() 0(&"(/"/) &*) +$2090&$&") 1&$&01&02$9) $($95101) *+) +%$'"B1"'$(&02) 1&%.2&.%"1>) 8.%%"(&95;) 0&) 2*(B &$0(1) '*%") &:$() QWZ) +%$'"1) 2*4"%0(-) '*%") &:$() L;c]]) 9"702$9) .(0&1>) U:") +*99*A0(-) 1"(&"(2") "7"'#90+0"1) &:") X3II,N)+%$'")0()A:02:)i$)X3II,@6R)-04"1)$)Uj6b6)&*)$) R68@I@6HU) &*) +.9+099) $) (""/) *%) #.%#*1") GI3RB I`X6k`OkR68@I@6HUJ)*+)&:")R68@I@6HUl>) GTJ! KX3II,@6R)R.110$M)A099),!-+*.")KR68@I@6HU)X5%0$M) KUj6b6) A0&:) "E.0#'"(&) $(/) :0-:) &"2:(*9*-5M) KI3RI`X6k`OkR68@I@6HU) +*%) &:01) #"$2"+.9) #.%B #*1"M>) ) O%$'"H"&)$4*0/1)&:")/0++02.9&0"1)*+)$&&"'#&0(-)&*)#0() /*A()$)1'$99)1"&)*+)-"("%$9)%*9"1>)@(1&"$/;)+%$'")"9"'"(&1) $%")/"+0("/)9*2$995;)0>">)0()&"%'1)*+)+%$'"1)&:$&)$%")10&.$&"/) 0() 1"'$(&02) 1#$2") C5) '"$(1) *+) /0%"2&"/) G$15''"&%02J) %"9$&0*(1>)6$2:)+%$'"B&*B+%$'")%"9$&0*()-"("%$995)$11*20B $&"1)$)9"11)/"#"(/"(&)*%)'*%")-"("%$9)+%$'")GX.#"%k+%$'"J) A0&:)$)'*%")/"#"(/"(&)*%)9"11)-"("%$9)*(")GX.Ck+%$'"JZ>) U:") +*%'.9$&0*() *+) -"("%$90F$&0*(1) $C*.&) #*110C9") '$#B #0(-1) *+) +%$'") "9"'"(&1) &*) -%$''$&02$9) +.(2&0*(1) G0>">) 90(?0(-) -"("%$90F$&0*(1J) "11"(&0$995) %"90"1) *() &:") "1&$CB 901:'"(&)*+)$)+%$'"):0"%$%2:5)$(/)$)&:"*%5)*+)+%$'")"9"B '"(&)0/"(&0&0"1)*%)$($9*-1)$2%*11)+%$'"1>) >:! !"#$"%&'()*"%+,&-*.#!//01&23#%1# 4*)&"%'2#51(*#!""1%&%'1"# >:9#?&2@A017".# [)1.C1&$(&0$995)/0++"%"(&)$##%*$2:)&*)1"'$(&02)%*9"1)01)#.&) +*%&:)C5)!*A&5)KTM>)R"+%$0(0(-)+%*')&:")0/"$)*+)1"'$(&02) %*9"1) $1) /012%"&") 2$&"-*%0"1 Q )!*A&5) /"12%0C"1) $%-.'"(&) 1"9"2&0*() G0>">) &:") E."1&0*() *+) A:$&) #%0(20#9"1) /"&"%'0(") A:02:) $%-.'"(&) *+) $() %B#9$2") %"9$&0*() /"(*&"/) C5) $) #%"/02$&")01)"7#%"11"/)C5)A:02:)-%$''$&02$9)%"9$&0*(J)0() &"%'1)*+)+0("B-%$0("/;)#%*&*&5#02$9)#%*#"%&0"1)"(&$09"/)C5) &:")1"'$(&021)*+)#%"/02$&"1>)j")-04"1)&:")+*99*A0(-)901&1)*+) "(&$09'"(&1)29$110+0"/)0()&A*)29.1&"%)2*(2"#&1)&:$&):")2$991) I%*&*B[-"(&)$(/)I%*&*BI$&0"(&_) )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))) h ):&&#_gg+%$'"("&>0210>C"%?"9"5>"/.g) )O*%)$)/"&$09"/)/"12%0#&0*()*+)&:"1")%"9$&0*(1)1"")&:")O%$'"H"&) P**?)KVmM)##>)V]hBVVV>)) Q )!*A&5)KWM):$1)$%-."/)&:$&)%*9")&5#"1)90?")$-"(&;)#$&0"(&;)&:"'";) "&2>)$%")099B+*.(/"/)0($1'.2:)$1)0&)01)/0++02.9&)&*)"1&$C901:)1"&1)*+) #%*#"%&0"1)&:$&)#02?)*.&).(0+0"/)G.(/"2*'#*1$C9"J)(*&0*(1>)) Z GhJ! ) GZJ! 8*(&%0C.&0(-)#%*#"%&0"1)+*%)&:")[-"(&)I%*&*BR*9"_) $>! 4*90&0*($9)0(4*94"'"(&)0()&:")"4"(&)*%)1&$&") C>! 1"(&0"(2")G$(/g*%)#"%2"#&0*(J) 2>! 2$.10(-)$()"4"(&)*%)2:$(-")*+)1&$&")0()$(*&:"%) #$%&020#$(&) />! '*4"'"(&)G%"9$&04")&*)&:")#*10&0*()*+)$(*&:"%) #$%&020#$(&J) ">! G"701&1) 0(/"#"(/"(&95) *+) &:") "4"(&) ($'"/) C5) &:")4"%CJ) 8*(&%0C.&0(-)#%*#"%&0"1)+*%)&:")I$&0"(&)I%*&*BR*9"_) $>! .(/"%-*"1)2:$(-")*+)1&$&") C>! 0(2%"'"(&$9)&:"'") 2>! 2$.1$995)$++"2&"/)C5)$(*&:"%)#$%&020#$(&) />! 1&$&0*($%5) %"9$&04") &*) '*4"'"(&) *+) $(*&:"%) #$%&020#$(&) ">! G/*"1)(*&)"701&)0(/"#"(/"(&95)*+)&:")"4"(&;)*%) (*&)$&)$99J) ) @() !*A&5S1) '*/"9;) I%*&*B[-"(&) $(/) I%*&*BI$&0"(&) $%") 2*(2"#&.$90F"/) $1) 1.0&$C9") $C1&%$2&0*(1) &:$&) /"+0(") $) 1"B '$(&02) 2*(&0(..') '$##0(-) /0%"2&95) *(&*) 15(&$7) *() &:") C$101)*+)$)(.'"%02$9)2*'#$%01*()90(?0(-)#%0(20#9"m>)U:$&) 01;)(*).(0+50(-)1"'$(&021)01)0'#90"/)+*%)"0&:"%)*+)&:")901&1) 0() GhJBGZJ>) X"'$(&02) #%*#"%&0"1) $%") $11*20$&"/) A0&:) -%$''$&02$9) 2$&"-*%0"1) 0() $) '$(5B&*B*(") +$1:0*() $(/) $%-.'"(&1) $%") $99*A"/) &*) :$4") /0++"%"(&) /"-%""1) *+) '"'C"%1:0#) &*) C*&:) #%*&*B%*9"1>) @&) &:.1) '010(&"%#%"&1) !*A&5)&*)1#"$?)*+)$)#$%&02.9$%)$%-.'"(&)*+)$)#%"/02$&")$1) )3")I%*&*B[-"(&)*%))3")I%*&*BI$&0"(&>)) @() $) %"9$&"/) 4"0(;) D"2:19"%) KVc;) W]M) $($95F"1) $%-.B '"(&)1&%.2&.%")0()&"%'1)*+).(04"%1$9)1"'$(&02)#%0'0&04"1;) 0>">)2*(2"#&1)0(/"#"(/"(&95)%"E.0%"/)C5)&:")1"'$(&021)*+) ($&.%$9) 9$(-.$-">) `(") 1.2:) #%0'0&04") 01) H*&0*(;) A:02:) %"2*(1&%.2&1)!*A&5S1)"(&$09'"(&)*+)1"(&0"(2")G*%)#"%2"#B &0*(J) C5) $11.'0(-) $() $15''"&%02) %"9$&0*() *+) (*&0*(>) ,0(?0(-)+*%)#%"/02$&"1)&:$&)"(&$09)(*&0*()01)2*(1&%$0("/)0() $22*%/$(2") A0&:) $) 2*(101&"(&) #$&&"%() 1.--"1&0(-) &:$&) &:") 0(/040/.$9) /"(*&"/) C5) &:") 1.Cd"2&) *+) $) &%$(10&04") 4"%C) 01) "(&$09"/)&*):$4")$)(*&0*()*+)&:")0(/040/.$9)/"(*&"/)C5)&:") *Cd"2&;)A:09")&:")%"4"%1")"(&$09'"(&)/*"1)(*&)("2"11$%095) :*9/>) D:09") !*A&5S1) #%*&*B%*9") $($95101) 01) %"1&%02&"/) &*) &:")/*'$0()*+) '*(*&%$(10&04")4"%C1;)D"2:19"%)2*(10/"%1) #%"#*10&0*($9)2*'#9"'"(&1)$1)A"99>)j")$%-."1)0()+$4*.%)*+) &:")40"A)&:$&)'$(5)#%"#*10&0*(1):"$/0(-)2*'#9"'"(&)II1) $%") 1"'$(&02$995) 2*(&"(&+.9) G%$&:"%) &:$() 15(&$2&02$995) &$--0(-)$)2*'#9"'"(&)*+)&:")4"%CJ)$(/)&:$&)&:"0%)1"'$(B &021)'.1&).(0+5)A0&:)&:")1"'$(&021)*+)&:")#%"/02$&"L>)) )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))) m )[%-.'"(&)1"9"2&0*()#%0(20#9"_)@()#%"/02$&"1)A0&:)-%$''$&02$9) 1.Cd"2&)$(/)*Cd"2&;)&:")$%-.'"(&)+*%)A:02:)&:")#%"/02$&")"(&$091) &:") -%"$&"1&) (.'C"%) *+) I%*&*B[-"(&) #%*#"%&0"1) A099)C") 9"702$9B 0F"/) $1) &:") 1.Cd"2&) *+) &:") #%"/02$&"e) &:") $%-.'"(&) :$40(-) &:") -%"$&"1&)(.'C"%)*+)I%*&*BI$&0"(&)"(&$09'"(&1)A099)C")9"702$90F"/) $1)&:")/0%"2&)*Cd"2&>)) L )U:01) 40"A) *+) #%"#*10&0*($9) 2*'#9"'"(&1) :$1) *%0-0($995) C""() /"4"9*#"/)C5)\$A%*()KmM;)A:*)#*0(&1)*.&)&:$&)#%"#*10&0*(1)&:$&) *22.%) +"9020&*.195) A0&:) $) #$%&02.9$%) 4"%C) $%") (*&) %$(/*'>) [2B 2*%/0(-)&*)\$A%*(;)$)("2"11$%5)G&:*.-:)(*&)1.++020"(&J)2*(/0&0*() +*%)$)#%"#*10&0*()&*)C")1"9"2&"/)+*%)$)-04"()2*'#9"'"(&)*+)$)4"%C) 01)&:$&)&:")#%"#*10&0*($9)1"'$(&021)C")2*'#$&0C9")G*%)$)2*'#*B ("(&)*+J)&:")1"'$(&021)*+)&:")4"%C>)8"%&$0()/"-%"")*+)$%C0&%$%0("11) ) [)'*%")*%)9"11)10'09$%)40"A)*+)$%-.'"(&)1&%.2&.%")01) "1#*.1"/)C5)!$401)KVM>)j")%"0+0"1)#%*&*B%*9"1)$1)$&&%0C.&"1) 0(&*)9"702$9)1"'$(&02)1&%.2&.%"1)&:$&)2$#&.%")90(-.01&02$995) %"9"4$(&)$1#"2&1)*+)$)4"%CS1)'"$(0(-)0()&:")+*%'$901')*+) j"$/B/%04"()I:%$1")X&%.2&.%")\%$''$%>)6$2:)$&&%0C.&")01) $11*20$&"/)A0&:)*(")*%)'*%")*+)$)1#"20+0"/)1"&)*+)"(&$09B '"(&1):*9/0(-)*+)&:")/"(*&"/)#$%&020#$(&c>)!$401S) '*/"9) C.09/1).#*()!*A&5S1)$(/)D"2:19"%S1)1"&1)*+)"(&$09'"(&1>) P5) 0'#*10(-)1*'")0(&"%($9)1&%.2&.%")&*)9"702$9)1"'$(&02) %"#%"1"(&$&0*(1)0&)#%*40/"1)$()$22*.(&)+*%)&:")2$.1$9)%"9$B &0*(1:0#1)C"&A""()"4"(&1)G$(/)&:"0%)#$%&020#$(&1J)$(/)&:") 2*'C0("/)"++"2&)*+)"(&$09'"(&1)A0&:)%"1#"2&)&*)90(?0(-V]>) >:<#!""1%&%'"A#%3*#$"%&'()*"%-# X*)+$%;)#%*&*B%*9")#%*#"%&0"1):$4")C""().1"/)0()1"'$(&02) %*9") $((*&$&0*() &*) 2:$%$2&"%0F") &:") .(/"%950(-) 1"'$(&021) *+)&:")%*9"1).1"/)+*%)&:")'$%?.#>)I%*#P$(?;)+*%)"7$'#9";) /"+0("1)&:")1"'$(&02)2*(&"(&)*+)[R\])$(/)[R\V)0()&"%'1) *+) !*A&5S1) I%*&*B[-"(&) $(/) I%*&*BI$&0"(&) "(&$09'"(&1;) %"1#"2&04"95e)[R\])$(/)[R\V)$%";)0()"++"2&;)29.1&"%1)*+) 4$%0*.1)&5#"1)*+)#$%&020#$(&1)/"+0("/)15(&$2&02$995)$1)A"99) $1)1"'$(&02$995;)10'09$%)&*)!$401S)#%*&*B%*9")$&&%0C.&"1)G5"&;) 2%.20$995)9"11)2*:"%"(&J>) O%$'"H"&;)*()&:")*&:"%):$(/;)/"+0("1)1#"20+02)+%$'"1) $(/) +%$'") "9"'"(&1) 0() &"%'1) *+) +0("B-%$0("/) 9"702$9) "(B &$09'"(&1) 1:$%"/) C5) 0(/040/.$9) 9"702$9) .(0&1>) I%*&*B%*9") "(&$09'"(&1)1.2:)$1)(*&0*(;)2$.1$&0*(;)4*90&0*(;)"&2>)+*%') &:")C$101)+*%)&:")/"+0(0&0*()*+)$C1&%$2&)+%$'"1)+%*')A:02:) '*%")1#"20+02)*("1)0(:"%0&)A0&:0()&:")+%$'"):0"%$%2:5)G">->) [D[R6H6XX;)@HU6HU@`H[,,Nk[8U;)"&2>J>) @()&:")%"1&)*+)&:01)1"2&0*(;)A")#%*#*1")&*)'$%?)$%-.B '"(&1)*+)#%"/02$&"1)A0&:)#%*&*B%*9")#%*#"%&0"1)"7,#*(*)#0;) 0>">)0()$().('"/0$&"/)A$5>)D")/"12%0C")$)&"(&$&04")1"&)*+) "(&$09'"(&1) G2*(2"#&.$90F"/) A0&:0() $C1&%$2&) 9"702$9) 1"B '$(&02)%"9$&0*(1J)0(&"(/"/)&*)2*4"%)$)C%*$/)%$(-")*+)4"%C1) A0&:) 4$%0*.1) 15(&$2&02) #$&&"%(1) C"5*(/) &%$(10&040&5) $(/) 0(29./") 1*'") $((*&$&0*(1) &:$&) "7"'#90+5) *.%) 12:"'">) I%"#*10&0*($9) 2*'#9"'"(&1) +0990(-) ("2"11$%5) 19*&1) *+) #%"/02$&") 1"'$(&021) $%") '$%?"/) A0&:) &:") 2*%%"1#*(/0(-) "(&$09'"(&1;) &*) A:02:) &:"5) '0-:&) C") 2*(&%0C.&0(-) $//0B &0*($9)0(+*%'$&0*(>)I%"#*10&0*($9)1"'$(&021)01)&:.1)%"#%"B 1"(&"/) 0() &"%'1) *+) &:") 2*''*() C$101) 0&) 1:$%"1) A0&:) #%"/02$&") 1"'$(&021>) U:") 0'#902$&0*(1) *+) 1.2:) $() $#B #%*$2:)$%")/012.11"/)0()&:")("7&)1"2&0*(>)) ) 0>)[)"1%'1")%"9$&0*()10'09$%)&*)&:$&)#%*#*1"/)C5)D"2:19"%) )))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))) ))))))))))))))) 01)0'#*1"/)C5)0(/040/.$9)9"702$9)1&0#.9$&0*(1>) c )I%*&*B%*9")$&&%0C.&"1)$%").1"/)$1)$()$##%*#%0$&")9"4"9)*+)%"#%"B 1"(&$&0*()+*%)1&$&0(-)$)1'$99)(.'C"%)*+)90(?0(-)#%0(20#9"1>) V] )!$401)'*/"91)&:")+$2&)&:$&)&:")2$.1$9)1&%.2&.%")*+)&:")1"'$(B &021)*+)$)#%"/02$&")&$?"1)#%"2"/"(2")*4"%)$99)*&:"%)"(&$09'"(&1)+*%) #.%#*1"1)*+)90(?0(->)H*&")&:$&)&:")9"702$9)1"'$(&02)%"#%"1"(&$B &0*(1):")#%*#*1"1)0'#9020&95)%"95)*()#%"40*.1)A*%?)C5)\%.C"%)KcM;) Y$2?"(/*++) KV];) VVM;) U$9'5) KVLM) $(/) I0(?"%) KVQM>) [99) *+) &:"1") A*%?1)$/4*2$&")$)%"9$&0*($9)40"A)*+)&:"'$&02)%*9"1)/"+0(0(-)&:"') 0()&"%'1)*+)1"&1)*+)#*10&0*(1)0()1"'$(&02)1&%.2&.%"1;)&:"'1"94"1) C$1"/) .#*() 1"&1) *+) -%$''$&02$995) %"9"4$(&) "9"'"(&1)*+) 9"702$9) 1"'$(&021;) 0>">) %"2.%%0(-) '"$(0(-) 2*'#*("(&1) &:$&) :$4") 1*'") "++"2&)*()$)4"%CS1)-%$''$&02$9)C":$40*%>)[)+$0%95)10'09$%)0(&.0B &0*().(/"%90"1)&:")1"'$(&02)#%"/02$&"1)G">->)4-)*-%;)(-%)'();)*%) ('/2"J)$11*20$&"/)A0&:)&:")1"'$(&021)*+)4"%C)29$11"1)0()f"%CH"&>)) 29$110+0"1) $%-.'"(&1) G'*%") $22.%$&"95;) &:") #$%&020#$(&1) /"(*&"/)C5)&:"'J)0(&*)(-%("*+"!2)G"(&$09"/)&*)("2"11$%095) :$4")1*'")(*&0*()*%)#"%2"#&0*()*+)*&:"%1J)$(/)(-%("*+".) *("1>)) GQJ! K8`H86@f6R) @M) (*&02"/) K8`H86@f6!) &:"0%) $#B #"$%$(2"M)$(/)$91*)(*&02"/)K8`H86@f6!)&:$&;)9"+&) $9*(";)&:"5)/01$##"$%"/)&**M>) GmJ! @&)$##"$%1)&:$&)K8`H86@f6R)X2:A$%F"("--"%M)A099) %"("-") K8`H86@f6!) *() $() $-%""'"(&) :") '$/") A0&:)&"$2:"%1M>)) ) @()$22*%/$(2")A0&:)!*A&5S1)&:"*%"&02$9)$11.'#&0*(1;) 1"'$(&02)#%*#"%&0"1)$%")-"("%$995)'"$(&)&*)C")$11*20$&"/) A0&:) $%-.'"(&1) 0() $) '$(5B&*B*(") +$1:0*(>) U:.1) #$%&020B #$(&1)&:$&)1:$%")&:")1$'")"(&$09'"(&)$%")/01&0(-.01:"/)0() &"%'1)*+)$(5)$//0&0*($9)"(&$09'"(&1)&:"5)'0-:&):$4">)O*%) 0(1&$(2";)4"%C1)"(&$090(-)&:$&)$)2*(2"04"%):$1)$)(*&0*()*+) '*%") &:$() *(") "(&0&5) '0-:&) 0(4*94") $) 2*'#9"7) (*&0*() %"9$&0*()0()A:02:)2*(2"04"/)$%-.'"(&1)$%")%"9$&"/)&*)"$2:) *&:"%)C5)'"$(1)*+)$()0(&"%($9),!".*(')*+")%"9$&0*()GA0&:0() &:")2*(2"04"%S1)'"(&$9)*%)#"%2"#&0*($9)1&%.2&.%"1J>)U:"1") $%-.'"(&1)$%")&*)C")'$%?"/)A0&:)2*%%"1#*(/0(-)$//0&0*($9) "(&$09'"(&1)1.2:)$1)"%)*)01$(/),!".*(')">) GLJ! K8`H86@f6R) @M) 1**() 2*(10/"%"/) K8`H86@f6!;) 6HU@UN) :0'M) K8`H86@f6!;) IR6!@8[U6) #$%&) *+) '5)+$'095M>) GcJ! K8`H86@f6R) U:") #*902"M) 1.1#"2&) K8`H86@f6!;) 6HU@UN)H*$:)R*-"%1M)K8`H86@f6!;)IR6!@8[U6) *+)$0/0(-)&:")%*CC"%5)9$1&)(0-:&M>) GV]J! K8`H86@f6R) X*'"M) 9$C"9"/) K8`H86@f6!;) 6HB U@UN) :0'M) K8`H86@f6!;) IR6!@8[U6) $) A*'$(B 0F"%M>) ) @()"4"(&)&5#"1;)*()&:")*&:"%):$(/;)0()A:02:)(*)0(&"%($9) %"9$&0*()01)"(&$09"/)&*):*9/)*+)#$%&020#$(&1)*+)A:*')$)2*(B 2"04"%):$1)$)(*&0*(;)A")/01&0(-.01:)$%-.'"(&1)&:$&)1:$%") &:01)"(&$09'"(&)0()&"%'1)*+)&:"0%)1"'$(&02)1$90"(2"_)$%-.B '"(&1) &:$&) $%") 1"2*(/$%5) *%) 9"11) 1$90"(&) 0() &"%'1) *+) &:") "11"(&0$9)9"702$9)1"'$(&02)%"9$&0*()/"(*&"/)C5)$)#%"/02$&") $%")$11*20$&"/)A0&:)$)'*%")1#"20+02)#%*#"%&5)&"%'"/)(-%6 ("*+".8$'(59!-/%.82)')"8-&8'&&'*!2) G(-%("*+".8$2-'JVV >) O*%)"7$'#9";)&:")#%0'$%5)+*2.1)0()&:")1"'$(&021)*+)$)4"%C) 90?")2(-/!)GVVJ)01)0(&.0&04"95)*()&:")12*.%"/)"(&0&5;)(*&)&:") 1*.-:&)"(&0&5)G2*(&%$%5)&*)4"%C1)90?")2"'!(3)*%)#--51&-!J>))) GVVJ! K8`H86@f6R) X20"(&01&1M) 12*.%"/) K8`H86@f6!) &:")02")1$'#9"1M)K8`H86@f6!kPX`[)+*%)10-(1)*+) 90+"M>) GVWJ! K8`H86@f6R)U:")2*$2:M)2*.9/)/01&0(-.01:)K8`HB 86@f6!) &:") &A0(1M) K8`H86@f6!kPX`[) C5) &:"0%) :$0%M>)) ) 00>) [() $C1&%$2&) %"9$&0*() *+# 2&7-&%'1") 01) "(&$09"/) C5) &:") 1"'$(&021)*+)#%"/02$&"1)&:$&)0(4*94")$++"2&"/)#$%&020#$(&1>) @()&:")/"(*&"/)"4"(&1;)$)('/2"!)01).1.$995)"(&$09"/)&*)$++"2&) $)('/2"")0()$)#:5102$9)*%)'"(&$9)'$(("%>)8$.1$995)$++"2&"/) #$%&020#$(&1)'$5)$//0&0*($995):$4")'*%")1#"20+02)#%*#"%B )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))) VV )X.2:)$%-.'"(&1)$%")*+&"()15(&$2&02$995)*#&0*($9>)H"4"%&:"9"11;) 15(&$2&02)#%*#"%&0"1)/*)(*&)0(/"#"(/"(&95)50"9/)1.++020"(&)/0$-B (*1&021)+*%)0/"(&0+50(-)&:")1"'$(&02)1&$&.1)*+)$%-.'"(&1>) ) &0"1) 1.2:) $1)(3'%9"6-&62)')") *%) *%(!"4"%)'#1 )3"4"3--.VW) %"+"%%0(-)&*)%"$/095)*C1"%4$C9")2:$(-"1)0()&:"0%)G#:5102$9) *%) '"(&$9J) 1&$&"1e) +.%&:"%'*%";) #%"/02$&"1) &:$&) "(&$09) $) (3'%9"6-&62)')") '$5) $91*) 9"702$90F") $)2-/!(") $(/g*%) "%.) 1&$&")*+)&:")$++"2&"/)"(&0&5>) GVTJ! K8[3X66)j01):*'")$(/)2$%M):$/)C""()$&&$2?"/)0() &:")#$1&>) GVhJ! K8[3X6R) b5) /$/M) 2:$(-"/) K8[3X66;) 8jk`OkXU[U6) :01) :$0%) 2*9*%M) KX`3R86kXU[U6) +%*')%"/M)K6H!kXU[U6)&*)C9."M)&*/$5>) GVZJ! K8[3X6R) X$'$(&:$M) &"%%*%0F"/) K8[3X66;) 8jk`OkXU[U6) &:") 2:09/%"(M) K6H!kXU[U6) 0(&*) 12%"$'0(-M>) GVQJ! K8[3X66) U:01) 1"%402"M) A099) /0'0(01:) K8[3X66;) 8jk`OkXU[U6)0()E.$90&5M>) GVmJ! K8[3X6R)j"M)2*$&"/)K8[3X66;)@H8RkUj6b6)&:") A$99M)K8[3X66)A0&:)#$0(&M>) ) [)2$.1$9)"4"(&)'$5)$9&"%($&04"95)0(4*94")$)#$%&020#$(&) &:$&)"0&:"%)C%0(-1)$C*.&)*%)$++"2&1)$)!"2/#)*%91"+"%)1-!12)')">) I%"1"%40(-)'.2:)*+)A:$&)01)4$9.$C9")0()!$401S)'*/"9;)A") ?""#)&%$2?)*+)&:")0(&"%($9)1&$&.1)*+)"(&$09'"(&1)A0&:0()&:") 2$.1"/)*%)$++"2&"/)"4"(&)C5)%"#%"1"(&0(-)&:"')0()1E.$%") C%$2?"&1>)R*.-:95;)$)(-%("*+"!)/0++"%1)+%*')$):(-%("*+"!;) 0()&:$&)&:")9$&&"%)01)$()"(&0&5)&:$&)01)('/2".1)-1(-%("*+">)O*%) 0(1&$(2";)$)2*''.(02$&0*()4"%C)90?")!",-!))01)%"#%"1"(&"/) 0() $22*%/$(2") A0&:) &:") "(&$09'"(&) &:$&) $) 1#"$?"%) ("2"1B 1$%095)2*(2"04"1)*+)$()$//%"11"")$(/)2"%&$0()'"11$-")$(/) 2$.1"1)&:$&)&:")$//%"11"")2*(2"04"1)*+)&:$&)'"11$-")&**>) X0'09$%95;)4"%C1)1.2:)$1)'*.)$(/)(-%2)!'*%10(4*94")2$.1B $995) $++"2&"/) 1&$&"1) *+) $++$0%1) /"12%0C"/) 0() &"%'1)*+) &:"0%) 0(&"%($9) "(&$09'"(&1) G2+>) 4"%C1) *+) 2$.1"/) '*&0*() $(/) #*11"110*() /012.11"/) C"9*AJ>) U:") %*9"1) $11*20$&"/) A0&:) '*.;)+*%)0(1&$(2";)$%")%"#%"1"(&"/)C5)'"$(1)*+)&:").(/"%B 950(-)"(&$09'"(&)&:$&)1*'")"(&0&5)G.*2(/22*-%1-&1)3"1&*#4J) $++"2&1) &:") 0(&"%($9) %"9$&0*() C"&A""() $() 0(&"(&0*($9) 2*(B 2"04"%) G)3"1 )"'(3"!J) $(/) $) 2*(2"04"/) #$%&020#$(&) G&:") "+'#/')*-%1-&1)3"1"&&"()*+"%"221-&1)3"1&*#4J>) GVLJ! K8[3X6R;) 8`H86@f6R) j"M) %"#*%&"/) KK8`HB 86@f6!M;)8`H86@f6!)&:")'$&&"%M)KK8`H86@f6RM;) 8`H86@f6!)&*)&:")1"2.%0&5M>) GVcJ! K8[3X6R)U:")#$0(&0(-M)0(1#0%"/)K8[3X66;)K8`HB 86@f6R;) @HU6HU@`H[,M) '"M) KK8`H86@f6!M) &*) &$?")&:")%01?)$(/).1")&:")0(&"(1")-%""()+*%)&:")1?5M>) GW]J! K8[3X6R) !012.110*() *+) &:") +09'M) 2$() $0/) KK8`HB 86@f6R;) @HU6HU@`H[,M) &:") &"$2:"%M) KK8`HB 86@f6!M) 0() "4$9.$&0(-) &:") "++"2&04"("11) *+) &:") +09'M>) GWVJ! K8[3X6R) U:") #%"1"(2") *+) "7$'1M) 1""'1) &*) 2*(B 1&%$0() K8[3X66;) K8`H86@f6R;) @HU6HU@`H[,M) &:"'M) KK8`H86@f6!M) 0() &:"0%) $##%*$2:) &*) 29$11B %**')&"$2:0(-M>) ) 000>) !*A&5S1) "(&$09'"(&) *+) 4*90&0*($9) 0(4*94"'"(&) 0() $() "4"(&)*%)1&$&")01)%"#9$2"/)A0&:)&:")'*%")29"$%2.&)#%*#"%&5) )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))) VW )@()&:")&"%'0(*9*-5)*+)!*A&5)KTM;)*%(!"4"%)'#1)3"4"2)0/"(&0+5) %*9"1)+*%)A:02:)$)2:$(-")*+)1&$&")0()&:")#$%&020#$(&1)+0990(-)&:"') %"+9"2&1) &:") &"'#*%$9) 1&%.2&.%") G0>">) &:") #%*-%"110*(J)*+) &:") /"B (*&"/)"4"(&)0()E."1&0*(>) *+) '"%*"%'1"&('%B>) <%)"%)*-%'#) #$%&020#$(&1) $%") 2:$%$2&"%B 0F"/) C5) 2*(120*.1) 2:*02";) /"2010*() *%) 2*(&%*9) *4"%) &:") 2*.%1")*+)$2&0*()/"(*&"/)C5)&:")4"%Ce)1#"20+02)*%)"%)*-%2) '0-:&)$91*)C")9"702$90F"/>) GWWJ! K@HU6HU@`H[,;) 8[3X6R) U:") 2*'#$(5M) '$(.B +$2&.%"/)K8[3X66;)8jk`OkXU[U6)UB1:0%&1M>) GWTJ! K@HU6HU@`H[,;) 8`H86@f6R) j"M) .1"/) K8`HB 86@f6!) :01) 0(+9."(2"M) K8`H86@f6!kPX`[;) @HB U6HU@`H)&*)+$4*.%)$)2*(&"'#*%$%5)*+)<""&*(S1M>) GWhJ! K@HU6HU@`H[,) X20"(2"M) $0'1) K@HU6HU@`H) $&) &:"*%0"1)A0&:)$)9$%-")0(+*%'$&04")2*(&"(&M>)) ) 04>)[)%"9$&0*()*+))1%'1")-"("%$995)0(4*94"1)$)4-+*%91"%6 )*)0)$(/)$)1&$&0*($%5)%"+"%"(2")+%$'")G,')3J)A0&:0()A:02:) 4$%0*.1)#*0(&1)G1&$%&;)"(/;)*%)0(&"%'"/0$&"J)'$5)C")+.%&:"%) 1#"20+0"/>)f"%C1)*+)2$.1"/) '*&0*(;)0()#$%&02.9$%;)0(4*94") #$%&020#$(&1)&:$&)$%")C*&:)'*40(-)$(/)$%")2$.1$995)$++"2&"/) G0>">)1"&)&*)'*&0*(J>) GWZJ! Kb`f@H\)[)A*'$()0().(0+*%'M)"(&"%"/)KI[Uj)&:") %**'M>) GWQJ! Kb`f@H\) U:") .%$(0.') #$%&029"1M) %$/0$&") KI[UjkX`3R86)+%*')&:")(.29"$%)#9$(&M>) GWmJ! Kb`f@H\) D"M) $##%*$2:"/) KI[Ujk\`[,) &:") :*.1"M>) GWLJ! K8[3X6R) N*.M) 2$() .1") 0&) &*) 1:**&) K8[3X66;) Kb`f@H\M):"$45)C$991)*+)'"&$9M)KKI[UjkX`3R86M) +%*')9$%-")-.(1M>) GWcJ! K8[3X6R)Y*:(M)%$()K8[3X66;)Kb`f@H\M)&:")2$%M) KKI[Ujk\`[,M)0(&*)&:")+0"9/M>) GT]J! Kb`f@H\) U:") %*2?M) :0&) KI[Ujk\`[,) &:") 1$(/M) A0&:)$)&:.'#>) GTVJ! Kb`f@H\;) 8`H86@f6R) U:") 1E.0%%"9M) 2:$1"/) K8`H86@f6!)&:")(.&M)KI[Uj)$2%*11)&:")%*$/M>) GTWJ! K8`H86@f6R;) @HU6HU@`H[,;) b`f@H\) X"4"%$9) @(/0$() #"$1$(&) 9"$/"%1M) +9"/) K8`H86@f6!;) I[UjkX`3R86) &:") 2*.(&%5M) 0() &:") "$%95) :*.%1) *+) &:")2*.#>) ) 4>) ;"2(7-'1") %"90"1) *() *(") *+) D"2:19"%S1) #%0'0&04") %"9$B &0*(1) $(/) 2$#&.%"1) &:") "(&$09'"(&) *+) 1*'") "(&0&5) ("2"1B 1$%095)C"0(-)$)2*(1&0&."(&),'!))*%)'"'C"%)*+)$)=3-#")G0>">)$) #:5102$9;)1*20$9;)*%)'"(&$9)"(&0%"&5J>)) GTTJ! KDj`,6) U:") C*7M) :*9/1) KI[RU) &:%"") :.(/%"/) #02&.%"1M>) GThJ! KDj`,6) U:") 2*99$C*%$&0*(M) 0(2*%#*%$&"1) KI[RU) '*4"'"(&;)/$(2";)'.102)$(/)4*2$9)&"2:(0E."1M)&*) "7#9*%")8:"?:*4S1)&"7&>) GTZJ! KI[RU) X"4"%$9) *+) &:") 2*.(&%0"1M) A"%") .($C9") &*) #$%&020#$&")KDj`,6)0()&:")'$%?"&M>) GTQJ! K8[3X6R) j"M) :$1) '"%-"/) K8[3X66;) KI[RUM) &:") &A*)2*'#$(0"1M)KKDj`,6M)0(&*)$)10(-9")*%-$(0F$B &0*(M>) ) 40>) O0($995;) $) %"9$&0*() *+)/1--*--'1") $22*.(&1) +*%) &:") 1"B '$(&021)*+)&%$(10&04")4"%C1)90?")3'+";)-=%;)'(>/*!";)*%3"!*);) #'(5)$(/)/0&%$(10&04"1)90?")9*+">)U:")9$&&"%)$%")%"#%"1"(&"/) $1) '"$(0(-) ('/2"6)-6,-22"22;) 0>">) 0() &"%'1) *+) 2$.1$&0*() $(/)#*11"110*(>)@()$//0&0*()&*)$),-22"22-!)$(/)$),-22"22".) "(&0&5;) $) 2-/!(") *+) #*11"110*() G*%) &:") 0(0&0$9) #*11"11*%J) ) '0-:&)$91*)C")9"702$90F"/VT>) GTmJ! KI`XX6XX`R)@%$(M):$/)$2E.0%"/)KI`XX6XX6!)+*.%) (.29"$%) A"$#*(1M) KX`3R86) +%*') +*%'"%) X*40"&) b*19"')%"#.C9021M>) GTLJ! K8[3X6R) j.(&0(-M) #%*40/"1) KKI`XX6XX`RM) &:") '"(M) KKI`XX6XX6!M) A0&:) $) #.C902) 1&$-"M) +*%) &:") 1&590F"/)/01#9$5)*+)40%090&5>) GTcJ! K8[3X6R) U:"5M) 1.C'0&&"/) KKI`XX6XX6!M) &:"0%) "40/"(2"M)KKI`XX6XX`RM)&*)&:")2*''0&&""M>) ) I*11"110*()01)$91*)"(&$09"/)C5)#%"/02$&"1)*+)2*''"%B 20$9) &%$(1$2&0*() &:$&) 0(29./") &A*) &%$(1+"%) "4"(&1) G0>">) &:") &%$(1+"%) *+) -**/1) $(/) &:") &%$(1+"%) *+) '*("5J;) "0&:"%) *+) A:02:)'0-:&)C"):0-:90-:&"/)$1)&:")'$0()"4"(&>) Gh]J! KX`3R86;) 8[3X6R) P"(M) 1*9/) KKI`XX6XX6!M) &:") 2$%M)KKI`XX6XX`RM)&*),01$M>) GhVJ! KX`3R86;) 8[3X6R;) 8`H86@f6R) ,01$M) #$0/) KKI`XX6XX`RM)P"(M)KKI`XX6XX6!M)VZ;]]])/*99$%1M) K8`H86@f6!kPX`[)+*%)&:")2$%M>) )) U:")901&)*+)#%*&*B%*9")#%*#"%&0"1)/"12%0C"/)$C*4")01)C5) (*)'"$(1)2*'#9"&">)[//0&0*($9)$C1&%$2&)9"702$9)1"'$(&02) %"9$&0*(1)$(/)2*%%"1#*(/0(-)"(&$09'"(&1)'0-:&)C")("2"1B 1$%5) &*) %"#%"1"(&) &:") 1"'$(&021) *+) 29$11"1) *+) #%"/02$&"1) &:$&):$4")(*&)C""()/012.11"/):"%"Vh>) N"&;)2"%&$0()#%"/02$&"1)%$01")0(&"%"1&0(-)E."1&0*(1)+*%) $()"(&$09'"(&BC$1"/)40"A)*+)1"'$(&02)%*9"1>)O*%)0(1&$(2";) 1*B2$99"/)15''"&%02)4"%C1):$4")$%-.'"(&1)&:$&)$%")0(/01B &0(-.01:$C9")0()&"%'1)*+)"(&$09'"(&1)/01#9$50(-)10-(0+02$(&) 4$%0$C090&5)*+)15(&$2&02)#$&&"%(1)GhWJBGhTJ>)X.2:)$%-.'"(&1) $%") &%$/0&0*($995) /"12%0C"/) C5) %*9"1) G">->) +0-.%";) -%*.(/J) &:$&)!*A&5)%"+"%1)&*)$1)#"%1#"2&04"B/"#"(/"(&)G2*(&%$%5)&*) "4"(&B/"#"(/"(&) %*9"1J>) U:") 1"'$(&02) #%*#"%&0"1) *+) #$%B &020#$(&1)&:$&)1""')&*)4$%5)$2%*11)/0++"%"(&)#"%1#"2&04"1)*+) 40"A0(-)$()"4"(&)$%")'*1&)#%*C$C95)*.&10/")&:")12*#")*+)$) #%*&*B%*9") &:"*%"&02$9) $##%*$2:) &:$&) 0(:"%"(&95) 0(4*94"1) $15''"&%02)%"9$&0*(1)C"&A""()"(&0&0"1>)) GhWJ! KO@\3R6)U:"):*.1"M)01)("$%)K\R`3H!)&:")1"$M>) GhTJ! KO@\3R6)U:")1"$M)01)("$%)K\R`3H!)&:"):*.1"M>) C:! $D&(7&%'1"#&".#E7%70*#F'0*2%'1"-# [1)$)#%"90'0($%5)"4$9.$&0*()*+)*.%)$##%*$2:)A")$//%"11"/) $)2$1")1&./5)2*'#$%0(-)0&)A0&:)*(")*+)&:")"701&0(-)$((*B &$&0*()$##%*$2:"1>)D")2*(2"(&%$&"/)*()$)#*%&0*()*+)6(-B 901:)9"702$9)0&"'1)G4"%C1)+*%)&:")'*'"(&J)"(1.%0(-)$)90(B -.01&02$995) %"#%"1"(&$&04") /$&$1"&) +*%) "$2:) *+) &:"'>) D") )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))) VT )U:")"(&$09'"(&1)2-/!(";),')382-/!(")$(/)2-/!("82)')")2$()C") %"#9$2"/)C5)$)'*%")-"("%$9)#%*#"%&5)&"%'"/)2-/!(")"(&$09"/)+*%) #$%&020#$(&1)&:$&).(/"%-*)1*'")?0(/)*+)2:$(-")G0>">)*+)#*11"110*(;) 9*2$&0*(;) *%) 1&$&"J>) Y$2?"(/*++) KV];) VVM)C.09/0(-) *() \%.C"%) KcM) :$1)#.&)+*%&:)$)%"9"4$(&)$($95101)&:$&)&%"$&1)90(-.01&02).(0+*%'0&0"1) $2%*11)4$%0*.1)1"'$(&02)+0"9/1)0()&"%'1)*+)"7&"(10*(1)*+)'*&0*($9) $(/)9*2$&0*($9)G2*(2"#&.$9J)1&%.2&.%"1)G$()$($95101)?(*A()$1)&:") )3"4')*(1!"#')*-%2130,-)3"2*2J>)) Vh )!$401)$/4*2$&"1)$)%"9$&0*()*+)2/!,'22*%9)+*%)&:")1"'$(&021)*+) 4"%C1)1.2:)$1)"7("".;).='!&;)-/)2(-!";)-/),#'0;)"&2>)&:$&)"(&$09) &:$&)$)1.#"%0*%)"(&0&5)*.&%$(?1)$()0(+"%0*%)*(">)`()&:")*&:"%):$(/;) :")/"12%0C"1)4"%C1)90?")3*);)2)!*5";),-5";))',;),!"22;)"&2>)0()&"%'1) *+)$()"(&$09'"(&)*+)0'#0(-"'"(&)*%)+*%2"+.9)2*(&$2&>)D")%"+%$0() +%*') $/*#&0(-) &:") 9$&&"%) $1) $() 0(/"#"(/"(&95) '*&04$&"/) "(&$09B '"(&)$(/)%"#%"1"(&)4"%C1)*+)+*%2"+.9)2*(&$2&)0()&"%'1)*+)'*&0*(>)) .1"/)&:")O%$'"H"&)+.99B&"7&)2*%#*%$)$1)1*.%2")*+)*.%)/$&$>) [//0&0*($995;)+*%)"$2:)4"%C)+*.(/)0()&:")2*%#*%$)$(/)&:") 1"'$(&02$995)%"9$&"/)*("1)C"9*(-0(-)&*)&:")2*%%"1#*(/0(-;) 0(4*?"/)+%$'"1) A")"7&%$2&"/)2*99"2&0*(1)*+)"7$'#9")$(B (*&$&"/)1"(&"(2"1)+%*')&:")O%$'"H"&)9"702*(>)) [)("A)$((*&$&0*()9$5"%)A$1)$//"/)&*)&:01)/$&$)'$#B #0(-)#%*&*B%*9")#%*#"%&0"1)&*)O%$'"H"&)%*9"1)G+%$'")"9"B '"(&1J)0/"(&0+0"/)0()&:")1"(&"(2"1)0()$22*%/$(2")A0&:)&:") #%"40*.195)/"12%0C"/)12:"'">)D")$.&*'$&02$995)#%*/.2"/) $((*&$&0*(1)C5)'$##0(-)+%$'")"9"'"(&1)&*)"(&$09'"(&1)$&) $) +%$'") 9"4"9) $(/) &:"() '$(.$995) 2:"2?"/) &:"1") $((*&$B &0*(1) +*%) 2*(101&"(25) 0() &"%'1) *+) &:") 1"'$(&021) *+) 0(/0B 40/.$9)4"%C1>)@()-"("%$9;)'"'C"%1)*+)&:")1$'")+%$'")1:$%") $)'0(0'.')*+)2*''*()#%*#"%&0"1>)N"&;)&:"5)'0-:&)/0++"%) 0()1#"20+02)$1#"2&1)*+)&:"0%)'"$(0(->)O*%)0(1&$(2";)-$)'*%) /0++"%1)+%*')$)#$1104")1"(1")*+)&:")4"%C)'(>/*!")0()&:$&)0&) "(&$091)$2&0*()*()&:")#$%&)*+)&:")"4"(&.$9)#*11"11*%>)) GhhJ! iKI`XX6XX`R) @&M) (""/1) &*) '(>/*!") KI`XX6XX6!) 1*'")&""&:M)KX`3R86)+%*')1*'"A:"%"MSS;):")1$0/>) GhZJ! @() 1*'") 2$1"1;) K8[3X6R;) KI`XX6XX`RM) &:") P\X) 90C%$%0"1M)-$)'*%".)KKI`XX6XX6!M)2*#0"1)*+)&:"1"M) KKX`3R86M)+%*')&:")$.&:*%1M>) ) D")2*(10/"%"/)$)&*&$9)*+)WhV)+%$'"1>)6$2:)*+)&:")2*%") +%$'") "9"'"(&1) A0&:0() $) -04"() +%$'") A$1) '$##"/) &*) $) .(0E.") "(&$09'"(&) *%) $) 2*'C0($&0*() *+) "(&$09'"(&1 VZ >) H*(B2*%")+%$'")"9"'"(&1)G1.2:)$1)U0'";)I9$2";)I.%#*1";) R"$1*(;) b$(("%;) "&2>J) A:*1") 1"'$(&021) $%") 0(/"#"(/"(&) *+)0(/040/.$9)+%$'"1)$(/)#%"/02$&"1)A"%")/01'011"/)+%*') 2*(10/"%$&0*(>)X#"20+02)+%$'")"9"'"(&1)#*10&"/)0(&"%"1&0(-) 011."1) +*%) %"+0("'"(&) *+) &:") 1"'$(&02) 2*(&"(&) *+) "(&$09B '"(&1_) GhQJ! U:") #*9025) A099) C") 0'#9"'"(&"/) K@HU6HU@`H[,) C5)$)("A)2*'#.&"%)151&"'MVQ>)) ) [)+0%1&)011.")01)&*)/*)A0&:)&:")2*4"%$-")*+)&:")1"&)*+) #%*#"%&0"1)A")$11.'"/>)D")0/"(&0+0"/)Vc)+%$'"1)+*%)A:02:) (*(")*+)&:")"(&$09'"(&1)*+)&:")#%"40*.1)1"2&0*()1""'"/)&*) :*9/>)U:")'$d*%0&5)*+)&:"1")0(29./"1)1&$&04")4"%C1)1.2:)$1) "7*2);)3',,"%;)-((/!;)!"4'*%;)(-%)*%/";)#*";)2)'%.;).","%.;) !'%5;)!"2"4$#";)4')(3;)"&2>;)1*'")*+) A:02:)0(4*94")#"%B 1#"2&04"B/"#"(/"(&) %*9"1>) U:") %"1&) *+) #%*C9"'$&02) 2$1"1) 0(29./")4"%C1)90?")('%;)23-/#.;)4/2);)&-##-=;),!"(".";)!*+'#;) ">/'#;)!"2,-%.;)."4'%.;)."2"!+";)"&2>)O.%&:"%)$($95101)*+) &:"1")01)("2"11$%5)&*)1:"/)90-:&)&*)&:")12*#")*+)$()"(&$09B '"(&BC$1"/)$##%*$2:>)@()-"("%$9;)#%"/02$&"1)A:*1")90(?B 0(-)#$&&"%(1)/"#"(/)*()#%$-'$&02)*%)2*(&"7&.$9)0(+*%'$B &0*()$%")"7#"2&"/)&*)%"E.0%")$)/0++"%"(&)&%"$&'"(&>))) O%*')&:")("A)$((*&$&"/)/$&$1"&)A")$//0&0*($995)"7B &%$2&"/)'$##0(-1)*+)"(&$09'"(&1)&*)15(&$2&02)2$&"-*%0"1;)$) #*%&0*()*+)A:02:)01)1.''$%0F"/)0()U$C9")V)$9*(-)A0&:)&:") )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))) VZ )[()"72"#&0*()0(4*94"1)'"&*(5'02$995)%"9$&"/)+%$'")"9"'"(&1;) 0>">)29*1"95)%"9$&"/;)'.&.$995)"729.104")%*9"1)/01&0(-.01:"/)1*9"95) C5)*(&*9*-02$9)2%0&"%0$>)X.2:)+%$'")"9"'"(&1)$%")'$##"/)&*)&:") 1$'")"(&$09'"(&1>)) VQ )@()$22*%/$(2")A0&:)&:")/"+0(0&0*()*+)0(&"(&0*($90&5;)$)2*'#.&"%) '0-:&)C")0(&"(&0*($9)0()$)1"(1")&:$&)0&)2*(&%*91)&:")2*.%1")*+)$() 0(:"%"(&95) 0(&"(&0*($9) $2&0*() G0>">) 0&1) 1&$%&;) 0(&"%'0110*(;)*%) "(/) #*0(&1J>))))) ) O%$'"H"&)+%$'"1)+%*')A:02:)0&)A$1)$2E.0%"/Vm>)[1)2$()C") 1""() 0() &:01) &$C9";) &:") /01&%0C.&0*() *+) %"$90F$&0*(1) *+) "(B &$09'"(&1)G*%)2*'C0($&0*(1)*+)"(&$09'"(&1J)0()&:")/$&$1"&) %"$/095)$C1&%$2&1)*4"%)$)A0/")%$(-")*+)1"'$(&02)$(/)15(B &$2&02) 2*'C0($&*%0$9) #%*#"%&0"1) *+) 0(/040/.$9) 4"%C1) C"B 9*(-0(-)&*)1"'$(&02$995)%"9$&"/)*%)"4"().(%"9$&"/)+%$'"1>) 8*(&%$%5)&*)+0("B-%$0("/)9"702$9)1"'$(&02)/01&0(2&0*(1)&:$&) .(/"%90")&:")O%$'"H"&)+%$'"1;)$()$11"&)*+)#%*&*B%*9")"(B &$09'"(&1) 01) &:$&) &:"5) 0/"(&0+5) -%$''$&02$995) #"%4$104") 1"'$(&02)"9"'"(&1)1.0&$C9")+*%)/"+0(0(-)$()$C1&%$2&;)#%0(B 20#9"/) 15(&$7B1"'$(&021) 0(&"%+$2">) @() $) +%$'"BA01") $#B #%*$2:;)-"("%$90F$&0*(1)$C*.&)'$##0(-1)&*)15(&$2&02)+*%') 1:*.9/) "'"%-") C5) $($95F0(-) &:") /01&%0C.&0*() *+) %*9") $1B 10-('"(&1) +*%) "$2:) +%$'") 1"#$%$&"95;) $C1&%$2&0(-) *4"%) 1#"20+02)'$##0(-1)*+)&:")2*%%"1#*(/0(-)9"702$9).(0&1)$(/) $&&"'#&0(-)&*).(0+5)$C1&%$2&0*(1)$2%*11)&:")+%$'"):0"%$%B 2:5;) A:02:) 01) (*&) $) &%040$9) &$1?>) I%*&*B%*9") #%*#"%&0"1) #%$2&02$995)/"2*.#9")90(?0(-)0(+*%'$&0*()+%*')$1#"2&1)*+) 9"702$9) 1"'$(&021) &:$&) :$4") (*) "++"2&) *() 0&>) U:"5) &:.1) 2$#&.%") 151&"'$&02) 90(?0(-) #$&&"%(1) G0(29./0(-) "(&$09B '"(&B#%"#*10&0*()2*%%"1#*(/"(2"1J)&:$&)2$()C")+*%'$90F"/) 0(&*)29$11"1)*+)(*(B9"702$90F"/)+%$'"1>) @() 1.';) #%*&*B%*9") "(&$09'"(&1) $%") 1"'$(&02) (*&0*(1) +0%'95) -%*.(/"/) 0() 90(-.01&02) 0(&.0&0*() &:$&) :$4") $) A0/") 2*4"%$-") *4"%) 9"702$9) 1"'$(&02) %"9$&0*(1) &:$&) :.'$(1) "7#%"11)202)"4')*('##0>)8*(&%$%5)&*)2$&2:B$99)9$C"91;)&:"5) #0() /*A() #%"/02$&"B$%-.'"(&) 1&%.2&.%") %"9$&0*(1) 0() $) -"("%$9;)5"&)2*:"%"(&)A$5>)U:")901&)*+)"(&$09'"(&1)0()T>W)01) #%"1"(&"/) $1) $) +0%1&) $&&$2?) &*) &:") '"&:*/*9*-02$9) 011."1) %"9$&"/)&*)$((*&$&0*()*+)1.2:)#%*#"%&0"1>)O.%&:"%)$($95101) 1:*.9/)%"+0(")$(/)"7&"(/)&:")2.%%"(&)1"&)&*)2*4"%)$()"4"() A0/"%)%$(-")*+)#%"/02$&")&5#"1>)) @()$)1.C1"E."(&)#:$1")*+)"4$9.$&0(-)*.%)#%*#*1$9;)A") 0(&"(/)&*)"'#9*5)$()"(&$09'"(&BC$1"/)$((*&$&0*()9$5"%)+*%) 1"'$(&02)#$%10(->)310(-)&:")$C*4")$((*&$&"/)/$&$)A")#9$() &*) 0'#9"'"(&) $) 1"'$(&02) $($95F"%) &:$&) 0/"(&0+0"1) &:") "(B &$09'"(&1) 0(1&"$/) *+) &:") +%$'") "9"'"(&1) $11*20$&"/) A0&:) $%-.'"(&1>)X0(2")+%$'")"9"'"(&1)$%").(0E."95)'$##"/)&*) "(&$09'"(&1)A0&:0()"$2:)+%$'";)A")2$()"4"(&.$995)"4$9.B $&") &:") "(&$09'"(&BC$1"/) 151&"') 2*'#$%0(-) 0&1) #"%+*%'B $(2")&*)1&$(/$%/)+%$'"B1"'$(&02)'*/"91>)) G:! !2@"1H(*.A*)*"%-# b.2:)*+)&:01)A*%?):$1)C""()1.##*%&"/)C5)&:")\%""?)X&$&") X2:*9$%1:0#1) O*.(/$&0*(>) @) $') 0(/"C&"/) &*) [(-"9*1) H0B ?*9$*.)+*%):01)4$9.$C9")1.##*%&)0()#%*-%$''0(->)@)$')$91*) -%$&"+.9) &*) !*.-) [%(*9/) $(/) b$110'*) I*"10*) +*%) &:"0%) 0(10-:&+.9)2*''"(&1)*()#$%&1)*+)&:01)A*%?>) I:! 5*J*0*"2*-# KVM)!$401;)[(&:*(5>)W]]V>),0(?0(-)C5)&5#"1)0()&:"):0"%$%2:02$9) 9"702*(>)8X,@)I.C902$&0*(1>)) KWM)!*A&5;)!$40/>)VcLQ>)`()&:")1"'$(&02)2*(&"(&)*+)&:")(*&0*() )))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))) Vm )I:%$1") 1&%.2&.%") $(/) -%$''$&02$9) +.(2&0*() &$-1) $%") &:")*("1) .1"/)C5)O%$'"H"&e)&:")&A*)$((*&$&0*()9$5"%1)$%")%"#%"1"(&"/)$1)$) 2*'#*.(/)&$-)A0&:)$)/*&)1"#$%$&0(-)&:"'>)U:")&$-)67&)01).1"/)+*%) "7&"%($9)$%-.'"(&1)A:02:)0(29./")1.Cd"2&1)*+)+0(0&")4"%C1;)`Cd) %"+"%1) &*) *Cd"2&1;) A:09") !"#) 01) $110-("/) &*) /"#"(/"(&1) *+) &:") -*4"%(0(-)4"%C>)I$%"(&:"1"1)/"(*&")*#&0*($9)#%*#"%&0"1>) &:"'$&02) %*9">) @() \"(($%*) 8:0"%2:0$;) P$%C$%$) I$%&"") $(/) R$5)U.%("%;)"/1>)I%*#"%&5)&:"*%5;)U5#")&:"*%5;)$(/)H$&.%$9) 9$(-.$-")1"'$(&021>)!*%&%"2:&_)R"0/"9>) KTM) !*A&5;) !$40/>) VccV>) U:"'$&02) I%*&*BR*9"1) $(/) [%-.'"(&) X"9"2&0*(>),$(-.$-")Qm>T>)ZhmBQVc>) KhM) O099'*%";) 8:$%9"1) Y>) VcLZ>) O%$'"1) $(/) &:") 1"'$(&021) *+) .(/"%1&$(/0(->)n.$/"%(0)/0)X"'$(&02$)Q>W>)WWWBWZh>) KZM) O099'*%";) 8:$%9"1) Y>;) 8:%01&*#:"%) R>) Y*:(1*(;) b0%0$') R>) I"&%.2?>) W]]T>) P$2?-%*.(/) &*) O%$'"H"&>) @(&"%($&0*($9) d*.%($9)*+)9"702*-%$#:5;)VQ>)WTZBWZ]>) KQM)O%$(?;)[(("&&">)W]]h>)\"("%$90F$&0*(1)*4"%)2*%#.1B0(/.2"/) +%$'") $110-('"(&) %.9"1>) @() 8:$%9"1) O099'*%";) b$(+%"/) I0(?$9;)8*990()P$?"%)$(/)<$&%0()6%?)G"/1>J_)I%*2""/0(-1)*+) &:"),R68)W]]h)D*%?1:*#)*()P.09/0(-),"702$9)R"1*.%2"1) +%*') X"'$(&02$995) [((*&$&"/) 8*%#*%$>) ,01C*(;) I*%&.-$9>) TVBTL>) KmM) \$A%*(;) Y"$() b$%?>) VcLQ>) X0&.$&0*(1) $(/) #%"#*10&0*(1>) ,0(-.01&021)$(/)#:09*1*#:5)c>)TWmBTLW>) KLM)\09/"$;)!$(0"9)$(/)!$(0"9)Y.%$+1?5>)W]]W>)[.&*'$&02)9$C"9B 0(-) *+) 1"'$(&02) %*9"1>) 8*'#.&$&0*($9) ,0(-.01&021) WL) GTJ) WhZBWLL>) KcM)\%.C"%;)Y"++"%"5>)VcQZ>)X&./0"1)0(),"702$9)R"9$&0*(1>)I:>!>) /011"%&$&0*(;)b@U)G%"#%0(&"/)0(),"702$9)X&%.2&.%"1)0()X5(&$7) $(/)X"'$(&021>)['1&"%/$';)H*%&:Bj*99$(/;)VcmQJ>) KV]M) Y$2?"(/*++;) R$5>) VcLT>) X"'$(&021) $(/) 8*-(0&0*(>) 8$'B C%0/-";)b[;)b@U)I%"11>) KVVM) Y$2?"(/*++;) R$5>) Vcc]>) X"'$(&02) X&%.2&.%"1>) 8$'C%0/-";) b[;)b@U)I%"11>) KVWM)<0(-1C.%5;)I$.9)$(/)b$%&:$)I$9'"%>)W]]W>)O%*')U%""C$(?) &*) I%*#P$(?>) @() I%*2""/0(-1) *+) &:") ,R68;) ,$1) I$9'$1;) 8$($%5)@19$(/1;)X#$0(>) KVTM)<0##"%;)<$%0(;)j*$)U%$(-)!$(-;)$(/)b$%&:$)I$9'"%>)W]]]>) 89$11BC$1"/)2*(1&%.2&0*()*+)$)4"%C)9"702*(>)@()I%*2""/0(-1) *+) &:") X"4"(&:) H$&0*($9) 8*(+"%"(2") *() [%&0+020$9) @(&"990B -"(2")G[[[@BW]]]J;)[.1&0(;Uo;)Y.95B[.-.1&>) KVhM),"40(;)P"&:>)VccT>)6(-901:)4"%C)29$11"1)$(/)$9&"%($&0*(1_)[) #%"90'0($%5)0(4"1&0-$&0*(>)3(04"%10&5)*+)8:02$-*)I%"11>) KVZM) b$%2.1;) b0&2:"99;) \%$2") <0';) b$%5) [(() b$%20(?0"A02F;) R*C"%&)b$2@(&5%";)[(()P0"1;)b$%?)O"%-.1*(;)<$%"()<$&F;) $(/) P%0&&$) X2:$1C"%-"%>) Vcch>) U:") I"(() U%""C$(?_) [((*B &$&0(-) #%"/02$&") $%-.'"(&) 1&%.2&.%">) @() I%*2""/0(-1) [RB I[j,U)D*%?1:*#>) KVQM) I0(?"%;) X&"4"(>) VcLc>) ,"$%($C090&5) $(/) 8*-(0&0*(>) 8$'B C%0/-";)b[;)b@U)I%"11>) KVmM)R.##"(:*+"%;)Y*1"+;)b02:$"9)6991A*%&:;)b0%0$')R>)I"&%.2?;) 8:%01&*#:"%)R>)Y*:(1*(;)Y$()X2:"++2F5?>)O%$'"H"&)@@_)67B &"(/"/)&:"*%5)$(/)#%$2&02"_) ))))))):&&#_ggAAA>0210>C"%?"9"5>"/.g+%$'"("&gC**?gC**?>:&'9) KVLM) U$9'5;) ,"*($%/>) VcLZ>) ,"702$90F$&0*() #$&&"%(1_) 1"'$(&02) 1&%.2&.%")0()9"702$9)+*%'>)@(),$(-.$-")U5#*9*-5)$(/)X5(B &$2&02)!"12%0#&0*(;)4*9>T;)"/>)U0'*&:5)X:*#"(>)8$'C%0/-";) 3<;)8$'C%0/-")3(04"%10&5)I%"11>) KVcM)D"2:19"%;)X&"#:"(>)VccV>))[%-.'"(&)X&%.2&.%")$(/),0(?0(->) I:>!>)!011"%&$&0*(;)X&$(+*%/)3(04"%10&5>) KW]M)D"2:19"%;)X&"#:"(>)VccZ>)U:")1"'$(&02)C$101)*+)$%-.'"(&) 1&%.2&.%">)X&$(+*%/;)8[>)8X,@)I.C902$&0*(1>) ) ) ) K*L'2&(#-*)&"%'2# 0*(&%'1"# K'"@'"A#A*"*0&('M&%'1"-# E0&)*N*%#J0&)*-# H*&0*() ) 8`H86@f6R_)HI>67&) 8`H86@f6!_)HI>`Cd) )))))))))))))))))))))))))IIK*(M>!"#) )))))))))))))))))))))))))IIK.#*(M>!"#) )))))))))))))))))))))))))IIK*+M>!"#) )))))))))))))))))))))))))IIK*4"%M>!"#) )))))))))))))))))))))))))IIK$C*.&M>!"#) )))))))))))))))))))))))))IIK+*%M>!"#) )))))))))))))))))))))))))IIK$+&"%M>!"#) )))))))))))))))))))))))))IIK0(M>!"#) ,*2$&0(-;)!"10%0(-;)P"2*'0(-k$A$%";)[2&04B 0&5k*(-*0(-;)8*'0(-k&*kC"90"4";)8"%&$0(&5;)67B #"%0"(2"%k1.Cd;)O""90(-;)[A$%"("11;)67#"2&$&0*(;) I"%2"#&0*(k"7#"%0"(2";)D$0&0(-;)`#0(0*(;) P"k0(k$-%""'"(&k*(k$11"11'"(&;)U%.1&;)8*-0&$&0*(;) 6'*&0*(k$2&04";)R"90-0*.1kC"90"+;)"&2>) 8`H86@f6R_)HI>67&) 8`H86@f6!_)HI>`Cd) 8`H86@f6!kPX`[_)IIKC5M>!"#) ))))))))))))))))))))))))))))))))))))))IIK+*%M>!"#) 8$&"-*%0F$&0*(;)!0++"%"(&0$&0*(;)Y./-'"(&;)"&2>)) 8`H86@f6R_)HI>67&) 8`H86@f6!;)6HU@UN_)HI>`Cd) 8`H86@f6!;)IR6!@8[U6_)IIK*+M>!"#) )))))))))))))))))))))))))))))))))))))))))))))))))IIK$1M>!"#) )))))))))))))))))))))))))))))))))))))))))))))))))HI>!"#) )))))))))))))))))))))))))))))))))))))))))))))))))[YI>!"#) 8$&"-*%0F$&0*(;)Y./-'"(&;)Y./-B '"(&k2*''.(02$&0*(;)X.1#020*(;),$C"90(-;)R"+"%B %0(-kC5k($'";)"&2>) 8`H86@f6R;)@HU6HU@`H[,_)HI>67&) 8`H86@f6!_)HI>`Cd) )))))))))))))))))))))))))IIK+*%M>!"#) )))))))))))))))))))))))))IIK*(M>!"#) )))))))))))))))))))))))))IIK.#*(M>!"#) ) U$?0(-k10/"1;)I9$2"kA"0-:&k*(;)310(-;)R$&0+02$&0*(;) X0-(k$-%""'"(&;)67"2.&"k#9$(;)!"20/0(-;)\*B 0(-kC$2?k*(k2*''0&'"(&;),"$/"%1:0#;)@(&"(&0*(B $995k$2&;))[&&"'#&;)6'#9*50(-;)`#"%$&0(-k$k151&"';) [22*'#901:'"(&;)8:$(-"k*+k9"$/"%1:0#;)I0%$25;) 67$'0($&0*(;)X""?0(-;)89$0'k*A("%1:0#;)[2&04B 0&5k#%"#$%";)8*99$C*%$&0*(;)!012.110*(;) b$?"k$-%""'"(&k*(k$2&0*(;)j0%0(-;)[4*0/0(-;)"&2>) 8`H86@f6R;)@HU6HU@`H[,_)HI>67&) 8`H86@f6!_)HI>`Cd) 8`H86@f6!kPX`[_)IIK+*%M>!"#) ))))))))))))))))))))))))))))))))))))))IIK*(M>!"#) ))))))))))))))))))))))))))))))))))))))IIK*4"%M>!"#) X2%.&0(5;)[%%"1&;)[11"110(-;)@(1#"2&0(-;)R"4"(-";) `#"%$&0*($9k&"1&0(-;)X2*.%0(-;)j*1&09"k"(2*.(&"%;) Y.1&0+50(-;)"&2>)) 8`H86@f6R;)@HU6HU@`H[,_HI>67&) 8`H86@f6!_)HI>`Cd) 8`H86@f6!kPX`[;)@HU6HU@`H_)IIK+*%M>!"#) ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))fI&*>!"#) 310(-;)I%$2&02";)6'#9*50(-;)j0%0(-;)H""/0(-;)"&2>) 8[3X6R_)HI>67&) 8[3X66_)HI>`Cd) G8[3X66;)8jk`OkXU[U6J_)IIK0(M>!"#) `Cd"2&04"k0(+9."(2";)8$.1$&0*(;)[&&$2?;)67#"%0B "(2"%k*Cd;)8:$(-"k"4"(&k&0'";)j0(/"%0(-;)I%"B 4"(&0(-;)U:A$%&0(-;)8$.1"k&*k2*(&0(.";) 8$.1"k:$%';)8$.1"k&*k"7#"%0"(2";)6290#1";) 8$.1"k2:$(-"k*+k#*10&0*(k*(k$k12$9";)"&2>) 8[3X6R_)HI>67&) 8[3X66;)8jk`OkXU[U6_)HI>`Cd) GX`3R86kXU[U6J_)IIK+%*'M>!"#) G6H!kXU[U6J_)IIK&*M>!"#) )))))))))))))))))))))))))))IIK0(&*M>!"#) 8%"$&0(-;)8$.1"k&*k"(/;)8$.1"k"7#$(10*(;)R"(B /"%k(*(+.(2&0*($9;)!"1&%*50(-;) 8$.1"k2:$(-"k*+k#*10&0*(k*(k$k12$9";)R"B 1*94"k#%*C9"';)8$.1"k&*k%"1.'";)<0990(-;) 8$.1"k&*k1&$%&;)!$'$-0(-;)\%0(/0(-;) 8$.1"k&*k+%$-'"(&;)8$.1"k2:$(-";)R"1:$#0(-;)"&2>))))) 8[3X6R;)8`H86@f6R;)G@HU6HU@`H[,J_)HI>67&) 8[3X66;)G8jk`OkXU[U6J;)8`H86@f6!_)HI>`Cd) 8$.1"k&*k'$?"k#%*-%"11;)@(&"%2"#&0(-;)I%*2"11B 0(-k'$&"%0$91;)[2&040&5k%"1.'";)8*''0&&0(-k2%0'";) [2&040&5k1&*#;)[2&040&5k#$.1";)[2&040&5k+0(01:;)"&2>) 8[3X6R;)G@HU6HU@`H[,J_)HI>67&) G8[3X66J;)K8`H86@f6R;)G@HU6HU@`H[,JM_)HI>`Cd) K8`H86@f6!M_)IIK0(M>!"#) ))))))))))))))))))))))))))))IIK0(&*M>!"#) [1101&$(2";)X.Cd"2&04"k0(+9."(2";)b$(0#.B 9$&"k0(&*k/*0(-;)X.$10*(;)j0(/"%0(-;)"&2>) H*&0*(;)) @(&"(&0*($90&5) 8$.1$&0*() 8$.1$&0*(;)H*&0*(;) G@(&"(&0*($90&5J) U$C9")V_)6(&$09'"(&BC$1"/)90(?0(-)-"("%$90F$&0*(1)$(/)2*%%"1#*(/0(-)O%$'"H"&)+%$'"1) A French Corpus Annotated for Multiword Expressions with Adverbial Function Eric Laporte, Takuya Nakamura, Stavroula Voyatzi Université Paris-Est Institut Gaspard-Monge - LabInfo 5, Boulevard Descartes, Champs-sur-Marne 77454 Marne-la-Vallée Cedex 2 (France) E-mail: eric.laporte@univ-paris-est.fr, nakamura@univ-mlv.fr, voyatzi@univ-mlv.fr Abstract This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results. The annotated corpus is available at http://infolingu.univ-mlv.fr/ under the LGPLLR license. 1. Introduction Recognising multiword adverbs such as à long terme ‘in the long run’ in texts is likely to be useful for information retrieval and extraction because of the information that such adverbials can convey. In addition, it is likely to help resolving prepositional attachment during shallow or deep parsing: most multiword adverbs have the superficial syntax of prepositional phrases; in many cases, recognising them rules out analyses where they are arguments or noun modifiers. The quality of the recognition of multiword adverbs depends on algorithms, but also on resources. We created a corpus of French texts annotated with multiword adverbs. In this article, we survey related work, we define the target of our annotation effort, we describe the method we have implemented and we analyse the corpus obtained. This corpus will be made freely available on the web under the LGPLLR license when this article is published. 2. Related work Corpora annotated with multiword adverbs are rare and small1. In the Grace corpus (Rajman et al., 1997), most multiword units are ignored. In the French Treebank (Abeillé et al., 2003), prepositional phrases and adverbs are annotated with a binary feature (‘compound’) which indicates whether they are multiword units; the distinction between whether prepositional phrases are verb modifiers, 1 Several reasons explain this lack of interest. Firstly, adverbials are usually felt as less useful than nouns for information retrieval and extraction. Secondly, many multiword adverbs are difficult to distinguish from prepositional phrases assuming other syntactic functions, such as arguments or noun modifiers: the distinction is hardly correlated to any material markers in texts and lies in complex linguistic notions (Villavicencio, 2002; Merlo, 2003). The task is therefore felt as too difficult by most researchers in language processing, whose main background is in information technology. However, the distinction in question is essential to identifying the semantic core of a sentence, and the availability of a larger corpus of annotated text is likely to shed light on the problems posed by this task. noun modifiers or objects appears only in the function-annotated part of the Treebank (350 000 words). We are not aware of other available French corpora annotated with multiword adverbs. In other languages, including English, corpora annotated with multiword units are rare and small as well. 3. Target of annotation The target of our annotation effort is defined by the intersection of two criteria: (i) multiword expressions and (ii) adverbial function. In this section, we define both criteria in more detail, we define the features that we included in the annotations, and we describe the corpus. 3.1 Multiword expression criterion For this work, we considered a phrase composed of several words to be a multiword expression if some or all of their elements are frozen together in the sense of (Gross, 1986), that is, if their combination does not obey productive rules of syntactic and semantic compositionality. In the following example, de nos jours (‘nowadays’, lit. ‘of our days’) is a multiword adverb: (1) Il est facile de nos jours de s'informer ‘It is easy to get informed nowadays’ This criterion ensures a complementarity between lexicon and grammar. In other words, it tends to ensure2 that any combination of linguistic elements which is licit in the language, but is not represented in syntactic-semantic grammars, will be stored in lexicons. Syntactic-semantic compositionality is usually defined as follows (Freckleton, 1985; Machonis, 1985; Silberztein, 1993; Lamiroy, 2003): a combination of linguistic elements is compositional if and only if its meaning can be computed from its elements. This is also our conception. However, in this definition, we consider that the possibility of computing the meaning of phrases from their elements is of any interest only if it is a better solution than storing the same phrases in lexicons, i.e. if 2 That can be empirically checked only after a lexicon and a grammar for the same language are complete and compatible. they rely on grammatical rules with sufficient generality. In other words, we consider a combination of linguistic elements to be compositional if and only if its meaning can be computed from its elements by a grammar. In example (1) above, the lack of compositionality is apparent from distributional restrictions3 such as: * Il est facile de nos semaines de s'informer *‘It is easy to get informed nowaweeks’ Multiword expressions include many different subtypes, varying from entirely fixed expressions to syntactically more flexible expressions (Sag et al., 2002). We annotated expressions undergoing variations4. In (2), the possessive adjective agrees obligatorily in person and number with the subject of the sentence: (2) De (ses + *mes) propres mains, il a construit une maison ‘With (his + *my) own hands, he built a house’ 3.2 Adverbial function We annotated only expressions with adverbial function, or circumstantial complements, i.e. complements which are not objects of the predicate of the clause in which they appear. We recognised them through criteria (Gross 1986, 1990a, 1990b) involving the fact that they are optional, they combine freely with a wide variety of predicates and some of them pronominalize with specific forms. Phrases with adverbial function are often called ‘circumstantial complements’, ‘adverbials’, ‘adjuncts’, or ‘generalised adverbs’. They assume several morphosyntactic forms: underived (demain ‘tomorrow’) or derived adverbs (prochainement ‘soon’), prepositional phrases (à la dernière minute ‘at the last minute’) or circumstantial clauses (jusqu’à ce que mort s’ensuive ‘until death comes’), and special structures in the case of named entities of time (lundi 20 ‘on Monday 20’). We annotated NEs only when they have an adverbial function, as in: Jean arrive lundi 20 ‘John arrives on Monday 20’. NEs of other categories, such as places, persons, events, etc., are usually not adverbials. 3.3 Features Two types of features were included in the annotations. (i) Each occurrence of a multiword adverb was assigned 3 The point is that this blocking of distributional variation (and other syntactic constraints) cannot be predicted on the basis of general grammar rules and independently needed lexical entries. Therefore, the acceptable combinations are meaning units and have to be included in lexicons as multiword lexical items. 4 We annotated phrases which comprise a frozen part and a free part, e.g. au moyen de ce bouton ‘with the aid of this switch’, in which au moyen de ‘with the aid of’ is frozen, and ce bouton ‘this switch’ is a distributionally free noun phrase embedded in the global phrase. In such cases, we delimited the embedded free part with tags (cf. section 4.2). Finally, we annotated named entities (NEs) of date and duration. The status of named entities with respect to compositionality is not fully consensual: however, we complied with the usual view that, since they follow quite specific grammatical rules, they should be considered as multiword expressions. one internal morphosyntactic structure or semantic type among 19. The definition of the morphosyntactic structures is based on the number, category and position of the frozen and free components of the adverbial. They are described as a sequence of parts of speech and syntactic categories. For example, à la nuit tombante ‘at nightfall’ is assigned a structure identified by the mnemonic acronym PCA, and defined as Prép Dét C (MPA) Adj, where C stands for a noun frozen with the rest of the adverbial, Adj for a post-posed noun modifier (e.g. an adjectival phrase or a relative clause), and MPA for a pre-adjectival modifier, empty in this lexical item. For named entities, this feature encodes the semantic type: date, duration, time or frequency, in conformity with the typology of the Infom@gic project (Martineau et al., 2007). The 19 structures and semantic types are listed in Table 1. In this table, N stands for a free noun phrase, and W for a variable ranging over verb complements. Other symbols are easy to interpret: Prép, Dét, Adj, V, Conj... Table 1: Morphosyntactic structures and semantic types of MWEs with adverbial function (ii) The second feature is binary and encodes whether the adverbial assumes a conjunctive function in discourse, i.e. it connects the clause in which the adverbial occurs with the previous clause, as en dernier lieu ‘finally’. The positive value is indicated by identifier ‘Conj’ in attribute ‘fs’. Example: <ADV fs='PAC Conj'>. 3.4 The corpus The corpus we annotated includes: (a) the complete minutes of the sessions of the French National Assembly on October 3-4, 2006, transcribed into written style from oral French (hereafter AS)5 and (b) Jules Verne’s novel Le Tour du monde en quatre-vingts jours, 1873 (hereafter JV). Errors (e.g. mis enoeuvre for mis en oeuvre ‘implemented’) have not been corrected. Statistics on the corpus are displayed in Table 2. 5 http://www.assemblee-nationale.fr/12/documents/index-rappor ts.asp. corpus AS corpus JV total size (Kb) 824 1 231 2 055 sentences 5 146 3 648 8 794 tokens 98 969 69 877 168 846 types 18 028 19 828 37 856 Table 2: Size of the corpus 4. Methodology In order to annotate the corpus, we tagged the occurrences of the expressions described in a syntactic-semantic lexicon of adverbials, as Abeillé et al. (2003), Baptista (2003) for Portuguese, and Català & Baptista (2007) for Spanish; we tagged NEs of date, duration, time, and frequency through a set of local grammars, as Friburger & Maurel (2004); then, we revised the annotation manually. 4.1 The lexicon We used the same syntactic-semantic lexicon (Gross, 1990a) as Abeillé et al. (2003), so that the two corpora can be used jointly for further research. This lexicon has 6 800 entries. It is freely available6 for research and business under the LGPLLR license. It was constructed on the basis of conventional dictionaries, grammars, corpora and introspection, within the Lexicon-Grammar methodology (Gross, 1986; 1994). It takes the form of a set of Lexicon-Grammar tables such that of Table 3, which displays a sample of the lexical items with the PCA morphosyntactic structure. human reader to find examples of sentences containing the adverbial (e.g. columns D and E giving an example of a verb compatible with the adverb). There are 15 such tables, one for each of the morphosyntactic structures. The features provided by the lexicon were used to annotate the occurrences. 4.2 Tagging We tagged the corpus with the Unitex system (Paumier, 2006). Many multiword adverbs are entirely fixed expressions, but others present variations, such as grammatical agreement (cf. example (2), section 3.1), permutations and omissions. Due to these variations, we tagged them with finite-state transducers (FST): the input part of these transducers recognises the expressions and their variants, and the output part inserts the tags. Like Català & Baptista (2007), we used lexicalised transducers, i.e. one for each lexical item, and we generated them with the technique of parameterised graphs (Roche, 1999) modified by Silberztein (1999). Multiword adverbs with a free prepositional phrase modifier (morphosyntactic structures PCDN and PCPN) were annotated semi-automatically as follows (‘N’ if the free complement is occupied by a noun phrase, ‘S’ if it is occupied by a clause): <ADV fs='PCDN'>compte tenu de <NP>vos ambitions</NP></ADV> ‘taking into account your ambitions’ (ii) <ADV fs='PCDN'>compte tenu de <S>ce que tout va bien</S></ADV> ‘taking into account that everything is OK’ (i) Named entities with temporal value (cf. section 3.2) were automatically tagged by using FST methods similar to those applied for multiword adverbs. 4.4 Manual revision Table 3: Sample of the table of entries with the PCA morphosyntactic structure In this table, each row describes a lexical item, and each column corresponds: - either to one of the elements in the morphosyntactic structure of the items (columns with identifiers ‘Prép’, ‘Dét’, ‘C’, ‘Modif pré-adj’ and ‘Adj’); - or to a syntactic-semantic feature (columns with binary values), for example the conjunctive function of the adverbial in discourse (column with identifier ‘Conjonction’), or the constraint that the adverbial obligatorily occurs in a negative clause (column with identifier ‘Nég obl’); - or to illustrative information provided as an aid for the 6 http://infolingu.univ-mlv.fr/english/DonneesLinguistiques/Lex iques-Grammaires/View.html. The annotation was manually reviewed by three experts. This validation followed guidelines, which are available along with the corpus. It involved two operations. (i) The sequences tagged with the aid of the lexicon and Unitex were checked in order to detect cases when the recognised sequence is in fact a part of a larger MWE. For instance, when de force ‘forcibly’ occurred within the compound noun ligne de force ‘thrust’, the tags around de force were deleted. When the embedded free part of a multiword adverb is a coordination, we tagged it manually: <ADV fs='PCDN'>en termes de <NP>santé</NP> et d'<NP>éducation</NP></ADV> ‘in terms of health and education’ (ii) The text was integrally reviewed in search for multiword adverbs absent from the lexicon, and thus undetected by Unitex, e.g. de plus ‘moreover’ or pour le moins ‘at least’. This required for the annotators to identify the syntactic structure of each sentence in the corpus. We had meetings during the annotation process in order to make it consistent. 5. Results This corpus is annotated with 4 247 occurrences of MWEs with adverbial function. They represent about 6 % of the overall of simple word occurrences occurring in the whole corpus. Table 4, below, shows the number of occurrences of annotated MWEs. The lines of the table correspond to the morphosyntactic structures and semantic types. Table 4: Annotated occurrences of MWEs with adverbial function in the corpus 6. Conclusion This paper described the design of a French corpus annotated for MWEs with adverbial function. Various types of features are included in the annotations: the morphosyntactic structure, special functions in discourse (e.g. the conjunctive function) and the semantic types of named entities of time. This annotated corpus can be used jointly with the French Treebank (Abeillé et al., 2003) for research on information retrieval and extraction, automatic lexical acquisition, as well as on deep and shallow syntactic parsing. 7. Acknowledgment This task has been partially financed by CNRS and by the Cap Digital business cluster. We thank Anne Abeillé for making the French Treebank available to us. 8. References Abeillé, A., Clément, L., and Toussenel F. (2003). Building a Treebank for French. In A. Abeillé (Ed.), Building and Using Parsed Corpora, Text, Speech and Language Technology, 20, Kluwer, Dordrecht, pp. 165--187. Baptista, J. (2003). Some Families of Compound Temporal Adverbs in Portuguese. In Proceedings of the Workshop on Finite-State Methods for Natural Language Processing, EACL 2003, Budapest, Hungary, pp. 97--104. Català, D., Baptista, J. (2007). Spanish Adverbial Frozen Expressions. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, ACL 2007, Prague, Czech Republic, pp. 33--40. Freckleton, P. (1985). Sentence idioms in English, Working Papers in Linguistics, University of Melbourne, pp. 153--168 & appendix (196 p.). Friburger, N., Maurel, D. (2004). Finite-state transducer cascades to extract named entities in texts. Theoretical Computer Science, 313(1), pp. 93--104. Gross, M. (1986). Lexicon-Grammar. The representation of compound words. In Proceedings of the Eleventh International Conference on Computational Linguistics, Bonn, West Germany, pp. 1--6. Gross, M. (1990a). Grammaire transformationnelle du français: 3. Syntaxe de l’adverbe. Paris, ASSTRIL. Gross, M. (1990b). La caractérisation des adverbes dans un lexique-grammaire. Langue Française, 86, pp. 90--102. Gross, M. (1994). Constructing Lexicon-Grammars. In Atkins & Zampoli (Eds.), Computational Approaches to the Lexicon, Oxford University Press, pp. 213--263. Lamiroy, B. (2003). Les notions linguistiques de figement et de contrainte, Lingvisticae Investigationes, 26:1, Amsterdam/Philadelphia: John Benjamins, pp. 1--14. Machonis, P. (1985). Transformations of verb phrase idioms: passivization, particle movement, dative shift. American Speech, 60:4, pp. 291--308. Martineau, C., Tolone, E., Voyatzi, S. (2007). Les Entités Nommées : usage et degrés de précision et de désambiguïsation. In Proceedings of the Twenty Sixth International Conference on Lexis and Grammar, Bonifacio, Corse du Sud, pp. 105--112. Merlo, P. (2003). Generalised PP-attachment Disambiguation using Corpus-based Linguistic Diagnostics. In Proceedings of the Tenth Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 251--258. Paumier, S. (2006). Unitex Manual. Université Paris-Est. http://igm.univ-mlv.fr/~unitex/manuel.html. Rajman, M., Lecomte, J., Paroubek, P. (1997). Format de description lexicale pour le français. Partie 2 : Description morpho-syntaxique. Rapport GRACE GTR-3--2.1. Roche, E. (1999). Finite-state transducers: parsing free and frozen sentences. In Kornai (Ed.), Extended finite-state models of language, Cambridge University Press, pp. 108--120. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D. (2002). Multiword Expressions: A Pain in the Neck for NLP. In A. Gelbuk (Ed.), Computational Linguistics and Intelligent Text Processing: Proceedings of the Third International Conference CICLing 2002, Springer-Verlag, Heidelberg/Berlin, pp. 1--15. Silberztein, M. D. (1993) Les groupes nominaux productifs et les noms composés lexicalisés. Lingvisticae Investigationes, 17:2, Amsterdam/ Philadelphia, John Benjamins, pp.405--426. Silberztein, M. (1999). Manuel d'utilisation d'Intex version 4.12. Villavicencio, A. (2002). Learning to distinguish PP arguments from adjuncts. In Proceedings of the Sixth Conference on Natural Language Learning, Taipei, Taiwan, pp. 84--90. On Construction of Polish Spoken Dialogs Corpus Agnieszka Mykowiecka∗† , Krzysztof Marasek† , Małgorzata Marciniak∗ , Joanna Rabiega-Wiśniewska∗ , Ryszard Gubrynowicz† ∗ Institute of Computer Science, Polish Academy of Sciences J. K. Ordona 21, 01-237 Warsaw, Poland † Polish Japanese Institute of Information Technology Koszykowa 86, 02-008 Warsaw, Poland Abstract The paper concerns construction of the Polish spontaneous spoken dialogs corpus built within the LUNA project. It elaborates on the process of collecting conversations, their transcription and annotation at morpho-syntactic and concept levels. Corpus annotation is performed using a mixture of manual and automated techniques. 1. Introduction In this paper we describe the process of construction and annotation of Polish spoken dialogs corpus. Collecting corpora of spontaneous speech in French, Italian and Polish is one of the goals of LUNA (spoken Language UNderstanding in multilinguAl communication systems) 6th Framework 33549 project. The general assumptions of this task are described in (Raymond et al., 2007). The Polish corpus is maintained by two partners: the Institute of Computer Science Polish Academy of Sciences and the Polish-Japanese Institute of Information Technology. The task of building the corpus is twofold. First, a sub-corpus of 500 human-human conversation has been collected and annotated at obligatory levels (agreed by the project partners). The annotation scheme takes advantage of the previous works experience (Mengel et al., 2000; Cattoni et al., 2001). Now, a sub-corpus of 500 humanmachine dialogs is being collected, see (Koržinek et al., 2008), and will be annotated with methods elaborated for the first sub-corpus. The chosen domain of conversations is public transport in Warsaw. A brief description of the corpus recordings of the first sub-corpus is presented in section 2. In section 3. we present rules of dialogs transcription that are common for all three languages. Some of them were invented in order to preserve phenomena occurring in Polish dialogs. An account of morpho-syntactic annotation is given in section 4. The level of concept annotation is presented in section 5. 2. The Polish corpus collection The corpus of human-human dialogs contains spontaneous dialogs recorded at Warsaw City Transportation Information Center in spring 2007. We have selected about 500 dialogs from around 12 thousands collected calls. The call center receives about 250 calls every day, but not all of them are relevant to our project’s scope, and a part of them is of very low signal quality. An average conversation lasts 2 minutes. Before further processing, chosen calls were classified according to the main dialog topic. There are five classes of dialogs: • STOPS – A caller asks about: – a stop nearest to the point in the city, – the name of the stop to get on or to get off, – a transportation mean (a bus or a tram) that stops at a given stop, – a stop appropriate for transfer between communication means. • FARES_REDUCTION – A caller asks about rules concerning fares reduction in Warsaw. • WHEN – A caller asks about timetables and travel duration. • HOW_TO_GET – A caller asks how to get to a given place in the city or about details of transportation means’ routes. • DOES_IT_GO_TO – A caller asks if any transportation mean goes to a particular place in the city. Naturally, many conversations refer to more than one described topic class. The classification was done in view of the dominating subject of the dialog. Still, the whole dialog is transcribed and annotated. Topic class DOES_IT_GO_TO HOW_TO_GET WHEN STOPS FARES_REDUCTION All dialogs Nb of dialogs 91 140 99 51 83 464 Nb of turns 2667 4694 2512 1383 1868 13124 Table 1: The distribution of the dialogs’ topics 3. Data transcription After the dialogs were chosen, an annotator converted them into texts using Transcriber (Barras et al., 1998). Every conversation was divided into turns referred to a caller and an operator, respectively. The transcription output is an XML file which includes the dialog text and some metadata referring to articulation distortions, speaker and nonspeaker noises, and time-stamps of the beginning and the end of each turn. General rules of transcription were agreed by all the project partners (Rodriguez et al., 2007) and they are presented in the context of Polish data in (Mykowiecka et al., 2007). To cover some phenomena significant for Polish dialogs, a few additional rules were defined. The most important Polish additions are 1 : • It was agreed to transcribe spellings (and spelled acronyms) with capital letters tagged with a symbol pron=SPELLED as in [pron=SPELLED-] PKO [-pron=SPELLED]. However, it is typical of Polish to syllabify words, especially proper names. Therefore the symbol pron=SYL was introduced, e.g. Bank [pron=SYL-] Narodowy [-pron=SYL] (National Bank). • Acronyms pronounced as words are written in capitals, e.g. PEKAES. Some of acronyms in Polish undergo inflection, e.g. ZUS, ZUS-u, ZUS-em. In these cases, an inflection suffix is added to the basis in small letters, e.g. ZUSu, ZUSem, etc. • Foreign words or acronyms are transcribed in their original orthographic form and tagged with a symbol lang= and the label of the language, e.g. [lang=English-] Blue City [-lang=English]. In case they are inflected by a Polish speaker, an inflection suffix appears directly after the closing tag lang=, e. g. Plac [lang=English-] Wilson [-lang=English]a (Wilson’s Square). • A tag lex=FIL represents pause fillers, hesitations and articulatory noises as breath, laugh, cough, etc. In order to capture significant non-verbal answers as confirmation, which could be helpful at dialog acts annotation, it was decided to distinguish here a subtype marked with a tag lex=FIL+. An example of the transcribed utterance is presented in Fig. 12 . user: [lex=FIL] chciałam si˛e dowiedzieć jak długo jedzie autobus [silence] linii sto pi˛ećdziesiat ˛ siedem z ulicy Grójeckiej przy Bitwy Warszawskiej na Plac [lang=English-]Wilson[-lang=English]a operator: [lex=FIL] do Placu [lang=English]Wilson[-lang=English]a jedzie od dwudziestu sześciu do trzydziestu dwóch minut Figure 1: Example of the text transcription 1 All examples come from the dialog corpus Translation: U:I wanted to know how long it takes the bus 157 to go from Grójecka Str at Warsaw Battle Str to Wilson’s Square; O: To Wilson’s Square it goes 26-32 minutes 4. Morphosyntactic annotation of dialogs After transcription, the set of dialogs was annotated morphologically with POS tags and inflectional characteristics. As the project concerns three different languages, the partners have adopted the recommendations of EAGLES (Leech and Wilson, 1996) for the morphosyntactic annotation and have defined for each language a core set of tags consistent with international standards. There are several inflectional analyzers available for Polish (Hajnicz and Kupść, 2001) from which we have chosen AMOR (Rabiega-Wiśniewska and Rudolf, 2003). It was easy to extend it with the domain vocabulary and proper names; and to adapt it to the project annotation guidelines. The most important changes made to the analyzer’s lexicon are described below: • In the dialogs there are a lot of proper names, however, they split into a few POS classes. Originally, the AMOR contained only nominal proper names, now there are also proper adjectives, e.g. Afrykańska (African), Centralna (Central), proper prepositions, e.g. Przy (At), and proper numerals, e.g. Siedem (Seven). At present, the set of proper names in the corpus consists of 6500 words that belong to 820 lemmas. • Sometimes a caller is not sure what the name of the street (a building etc.) is or how to pronounce its name. Names that were not recognized by an operator or heavily distorted are transcribed according to their real pronunciation. At the morphological level such words get a POS tag ‘PropName’ but no additional characteristics. The aim is to be able to represent every, correct and mistaken, proper name in the corpus. Compare the examples below: (1) <w id=‘37’ word=‘Bliżna’ POS=‘PropName’ morph=‘-’ /> (2) <w id=‘58’ word=‘Wólk˛e’ lemma=‘Wólka’ POS=‘Np’ morph=‘acc.sg.fem’ /> lemma=‘-’ • The spoken language is rich in colloquial expressions (se instead of sobie) and ungrammatical (poszłem instead of poszedłem) word forms. Those appearing regularly and frequently in collected texts, were added to the lexicon. The automatic morphological analysis gives approximately three different interpretations per word. As there is no Polish tagger which accounts for proper names and which was tested on speech data, disambiguation of morphological tags is done manually. However, it is planned to train a tagger on a sample of the annotated corpus in the next stage of the project. The morphological analysis results in obtaining for every word a set of tags: id, word, lemma, pos and morph. They are stored in XML files in a format presented in Fig. 23 . 2 3 Translation: I wanted to ask (about) bus 143 from the direction of Ursynów chciałam zapytać autobus sto czterdzieści trzy z kierunku Ursynowa chciałam zapytać autobus sto czterdzieści trzy z kierunku Ursynowa <words> <w id=‘10’ word=‘chciałam’ lemma=‘chcieć’ POS=‘VV’ morph=‘1.sg.fem.past.ind.imperf’ /> <w id=‘11’ word=‘zapytać’ lemma=‘zapytać’ POS=‘VV’ morph=‘inf.perf’ /> <w id=‘12’ word=‘autobus’ lemma=‘autobus’ POS=‘Nc’ morph=‘nom.sg.m3’ /> <w id=‘13’ word=‘sto’ lemma=‘sto’ POS=‘NUM’ morph=‘nom.nm1’ /> <w id=‘14’ word=‘czterdzieści’ lemma=‘czterdzieści’ POS=‘NUM’ morph=‘nom.nm1’ /> <w id=‘15’ word=‘trzy’ lemma=‘trzy’ POS=‘NUM’ morph=‘nom.nm1’ /> <w id=‘16’ word=‘z’ lemma=‘z’ POS=‘PreP’ morph=‘-’ /> <w id=‘17’ word=‘kierunku’ lemma=‘kierunek’ POS=‘Nc’ morph=‘gen.sg.m3’ /> <w id=‘18’ word=‘Ursynowa’ lemma=‘Ursynów’ POS=‘Np’ morph=‘gen.sg.m3’ /> </words> <chunks> <chunk id=‘8’ span=‘word_10’ cat=‘VP’ main=‘word_10’ /> <chunk id=‘9’ span=‘word_11’ cat=‘VP_INF’ /> <chunk id=‘10’ span=‘word_12’ cat=‘NP’ main=‘word_12’ /> <chunk id=‘11’ span=‘word_13..word_15’ cat=‘NUM’ /> <chunk id=‘12’ span=‘word_16’ cat=‘PP’ /> <chunk id=‘13’ span=‘word_17’ cat=‘NP’ main=‘word_17’ /> <chunk id=‘14’ span=‘word_18’ cat=‘PN’ /> </chunks> Figure 2: Example of the morphological annotation Figure 3: Example of the syntactic annotation different type of questions occurring in the recorded conversations. The domain model specification started with defining a general ontology of public transport in OWL (http://www.w3.org/TR/owl-ref/). On its basis, a set of attributes representing concepts was defined. It contains simple notions: • bus, tram, metro lines, routes, their ends and stops; Morphologically annotated texts of dialogs are next segmented into elementary syntactic chunks. The aim of syntactic description is to group the words into basic nominal phrases and verbal groups. As there exists no chunker suitable for the analysis of Polish spoken texts, a program used in the project was designed especially for the purpose. In order to find phrases within an utterance of one speaker, information about turns is used. The parser uses also some domain knowledge, which helps for example to recognize transportation line numbers. The following phrase: autobusy pi˛ećset dwanaście sto siedemdziesiat ˛ cztery ‘buses five hundred twelve one hundred seventy four’ can be theoretically divided in many ways, but we know that all buses in Warsaw have three-digit numbers so we can divide the phrase properly into two numbers: pi˛ećset dwanaście ‘five hundred twelve’ and sto siedemdziesiat ˛ cztery ‘one hundred seventy four’. The chunker also recognizes compound verbal phrases, b˛edzie jechać ‘will go’, and nominal phrases (without prepositional modifiers), nast˛epny przystanek autobusowy ‘the next bus stop’. For these phrases it indicates the main word i.e., the word semantically most significant. In the previous examples it is jechać ‘go’ and przystanek ‘stop’ respectively. In the case of a nominal phrase it coincides with the head of the phrase. The syntactic segmentation of previously morphologically annotated example is shown in Fig. 3. 5. Semantic annotation of dialogs Semantic annotation of the dialogs consists in assigning attributes and their values to phrases. The principles of the annotation in our project are similar to the attribute annotation in MEDIA corpus (Hardy et al., 2003). The set of attributes was defined specially for the project and contains general transportation system features, some details on Warsaw public transport and some concepts related to • places in the city: districts, streets, squares, parks, important buildings; • fare reduction’s concepts; • basic time points specifications. And more complex ideas like: • trips’ beginnings and endings, • trips’ durations and other time specifications, • questions concerning different attribute values. Within the chosen domain there are a lot of proper names (there are over 4 thousands names of streets, buildings, city districts, etc). Their recognition is not easy, see (Mykowiecka et al., 2008b): • A lot of names are inflected, e.g. street names: Francuska, Francuskiej, Francuska˛ etc. (French), building names: Teatr Dramatyczny, Teatru Dramatycznego etc. (Dramatic Theater) • For many names there is more than one variant, e.g. names of persons in the street names are often omitted: Krasińskiego instead of Zygmunta Krasińskiego. • Complex names are simplified: Bitwy Warszawskiej 1920 r. (Warsaw Battle in 1920) to Bitwy, Plac Powstańców Warszawy (Warsaw Uprising Square) to Plac Powstańców. • Proper names are sometimes ambiguous as bus stops have frequently the same names as streets, squares or buildings names where they are situated. In Polish, nouns, adjectives and numerals undergo inflection, so the recognized proper names had to be lemmatized. As the final quality of the annotation was the primary target, we introduced proper names elements into our inflectional dictionary which enabled us to obtain lemmas for all names elements. We also prepared a lexicon which relates sequences of basic forms of name elements to the basic forms of the entire names (Mykowiecka et al., 2008b). The next planned step is to annotate collected dialogs with concept names. To realize this task we use rule-based Information Extraction approach. Therefore, the annotation is done automatically on the basis of manually created rules that define patterns for recognizing attributes and their values. At the moment there are 950 rules that recognize 134 attributes. Most attributes can have only a few possible values but there are a few attributes that can have many values like: destination, bus number, time description. An example of the output is shown in Fig. 4. chciałam zapytać autobus sto czterdzieści trzy z kierunku Ursynowa <concept id="4" span="word_10" attribute="Action" value="Request" /> <concept id="5" span="word_12..word_15" attribute="BUS" value="sto czterdzieści trzy" / > <concept id="6" span="word_16..word_18" attribute="SOURCE_DIR_TD" value="Ursynów" / > Figure 4: Example of the concept annotation The first evaluation of the set of rules was done on the 26 dialogs and showed the overall concept error rate at the level of 20.5% (Mykowiecka et al., 2008a). 6. Summary In the paper we described the collection process and the annotation practice underlying the first multi-level annotated spontaneous speech corpus of Polish. The corpus is being developed as a part of the LUNA project. The procedures adopted within the project combine manual and automatic approach. The automatically obtained morphological annotations are disambiguated manually and randomly verified. The annotation on the syntactic level is automatic but only very basic chunks are built, so the number of introduced errors is rather low. Automatic semantic annotation is much more difficult, but as the size of the corpus is not too big, this annotation level will be checked manually. In the next step we are going to annotate the collected data with predicates’ roles, coreferences and dialog acts. 7. References C. Barras, E. Geoffrois, Z. Wu, and M. Liberman. 1998. Transcriber: a Free Tool for Segmenting, Labeling and Transcribing Speech. In First International Conference on Language Resources and Evaluation (LREC), pages 1373–1376. R. Cattoni, M. Danieli, A. Panizza, V. Sandrini, and C. Soria. 2001. Building a corpus of annotated dialogues: the ADAM experience. In Proceedings of the Corpus Linguistics 2001 conference, pages 109–119, Lancaster, UK. E. Hajnicz and A. Kupść. 2001. Przeglad ˛ analizatorów morfologicznych dla j˛ezyka polskiego. Wydawnictwo IPI PAN, Warszawa. H. Hardy, K. Baker, H. Bonneau-Maynard, S. Rosset, L. Devillers, and T. Strzalkowski. 2003. Semantic and dialogic annotation for automated multilingual customer service. In Proceedings of Eurospeech-2003, pages 201– 204, Geneva, Switzerland. D. Koržinek, Ł. Brocki, R. Gubrynowicz, and K. Marasek. 2008. Wizard of Oz Experiment for a Telephony-Based City Transport Dialog System. In Proceedings of the IIS 2008 Workshop on Spoken Language Understanding and Dialogue Systems, Zakopane, Poland. Springer Verlag. To appear. G. Leech and A. Wilson. 1996. Eagles. Recommendations for the morphosyntactic annotation of corpora, EAGTCWG-MAC/R. Technical report, ILC-CNR, Pisa. A. Mengel, L. Dybkjaer, J.M. Garrido, U. Heid, M. Klein, V. Pirrelli, M. Poesio, S. Quazza, A. Schiffrin, and C. Soria. 2000. MATE Dialogue Annotation Guidelines. MATE Deliverable 2.1. A. Mykowiecka, K. Marasek, M. Marciniak, R. Gubrynowicz, and J. Rabiega-Wiśniewska. 2007. Annotation of Polish spoken dialogs in LUNA project. In Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of 3rd Language & Technology Conference. October 5-7, 2007, Poznań, Poland. A. Mykowiecka, M. Marciniak, and K. Głowińska. 2008a. Automatic Semantic Annotation of Polish Dialog Corpus. In progress". A. Mykowiecka, M. Marciniak, and J. RabiegaWiśniewska. 2008b. Proper Names in Polish Dialogs. In Proceedings of the IIS 2008 Workshop on Spoken Language Understanding and Dialogue Systems, Zakopane, Poland. Springer Verlag. To appear. J. Rabiega-Wiśniewska and M. Rudolf. 2003. Towards a Bi-Modular Automatic Analyzer of Large Polish Corpora. In R. Kosta, J. Błaszczak, J. Frasek, L. Geist, and M. Żygis, editors, Investigations into Formal Slavic Linguistics. Contributions of the Fourth European Conference on Formal Description of Slavic Languages – FDSL IV, held at Potsdam University, November 28-30th, 2001, pages 363–372. Ch. Raymond, G. Riccardi, K. J. Rodriguez, and J. Wisniewska. 2007. The LUNA Corpus: an Annotation Scheme for a Multi-domain Multi-lingual Dialogue Corpus. In R. Artstein and L. Vieu, editors, Decalog 2007: Proceedings of the 11th Workshop on Semantics and Pragmatics of Dialogue, Trento, Italy, 30 May – 1 June 2007, pages 185–186, Trento, Italy. K. J. Rodriguez, S. Dipper, M. Götze, M. Poesio, G. Riccardi, C. Raymond, and J. Rabiega-Wiśniewska. 2007. Standoff coordination for multi-tool annotation in a dialogue corpus. In Proceedings of the Linguistic Annotation Workshop, pages 148–155, Prague, Czech Republic. Association for Computational Linguistics. A RESTful interface to Annotations on the Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University Sydney steve.cassidy@mq.edu.au Abstract Annotation data is stored and manipulated in various formats and there have been a number of efforts to build generalised models of annotation to support sharing of data between tools. This work has shown that it is possible to store annotations from many different tools in a single canonical format and allow transformation into other formats as needed. However, moving data between formats is often a matter of importing or exporting from one tool to another. This paper describes a web-based interface to annotation data that makes use of an abstract model of annotation in its internal store but is able to deliver a variety of annotation formats to clients over the web. 1. Introduction There has been considerable work in recent years on building generalised models of annotation and defining interchange file formats such that data can be moved between tools. This work offers the hope that annotation data can be released from the project or discipline specific dungeons it is often locked in due only to the difficulty in understanding data from foreign tools. However, while data sits in files on a researcher’s disk it remains hard to discover it and get access, let alone collaborate on the development of a corpus. A second problem is that annotations, even in well known and widely distributed corpora, can’t be cited in the same way that we might cite a result in a research paper. Exceptions to this are cases where the authors of a corpus have taken care to define reference codes for segments of the corpus (e.g. the line numbers of the Brown corpus). We propose that both of these problems can be addressed by defining a well structured interface to corpora and annotations over the web. Such an interface would have the advantage of defining a public URI for every corpus and annotation within the corpus that could be cited in a research paper. It could also allow widespread access to data from remote locations to facilitate collaboration and sharing of annotations. Using the infrastructure of the web allows technologies such as caching and access control to be layered on top of the basic interface. This paper describes the core of a web based interface to corpora. At present this interface only supports reading of annotations from a central annotation store. However, the design has been built with a view to enabling read/write access to data over the web. 2. Background A number of proposals have been made in recent years for generalised data models for Linguistic annotation. These models provide an abstract representation of annotation data that subsumes practices in the majority of research areas where language data is annotated or marked up in some way. While there are some differences in the proposals they are largely compatible with each other; this is perhaps not surprising since they are designed to support transformation to and from a similar set of end-user formats. Two examples whose design is particularly focussed on interchange of annotations between formats are Annotation Graphs (Bird and Liberman, 2001) and the Linguistic Annotation Framework (Ide and Romary, 2007). Both are structured as directed graph structures with annotations as nodes in the graph; annotations are distinct objects carrying arbitrary feature structures (attribute-value pairs) and may be related to each other by many kinds of relations. Both formats make use of so called stand-off markup where the annotations are stored separately to the primary data itself. Locations in the primary data are indicated by pointers; for audio and video data these are time values, for textual data they can be character offsets or XPointer references. The use of annotations that point into primary data instead of being embedded in it was motivated in part by the need to be able to represent overlapping hierarchies. Since XML, a common format used for annotation, can only directly represent a single hierarchy, a solution that separated the different hierarchies into different XML files was used. A side effect of this change is that annotations can be managed separately to the primary data, paving the way for an annotation architecture that uses an abstract interface rather than an application specific file format. The work described here develops on this idea of an abstract interface to an annotation store as an alternative to reading and writing annotation files. Instead of thinking of annotations as elements in files and corpora being collections of these files we abstract these ideas to make all of these things resources within an annotation store. Internally in our system, we store annotations as assertions in an RDF triple store and provide an abstract interface for creation, deletion and query of annotation data. The proposal in this paper though, does not make any assumptions about the kind of store that is used; only that it supports the idea of annotations as separate entities. This is true of the Annotation Graph system for example and will generally be true of any tool that displays and manipulates annotation data. This work has been implemented in a development system that is being used as part of a larger project to support collaborative annotation on language resources. A demonstration of the service may be available at the URI http://dada1.ics.mq.edu.au/ depending on the current status of the software. This paper first highlights the capabilities of the HTTP transport layer, then develops the design of an interface to annotation data over HTTP and finally describes some extensions to this interface that we are currently exploring. 2.1. HTTP and the Web The Hypertext Transfer Protocol (HTTP) is the base protocol of the World Wide Web and defines the conversation that takes place between a web server and a client such as a web browser. The original web was conceived as a read/write medium and the design of HTTP reflects this in the provision of actions for creating, updating and deleting resources as well as retrieving them. Until recently, the two-way nature of HTTP was not widely exploited but the development of web services following the REST (Representational State Transfer) architecture (Fielding, 2000) has highlighted the power of the original design. The REST view of the web is as a means to provide access to resources that are identified by unique addresses (the Uniform Resource Identifier or URI). Resources are accessed through a constrained set of operations for transferring state information between client and server; be it a GET request to retrieve the current state of a resource or a POST request to update it. State information can range from the content of an HTML web page to the contents of a shopping cart or a value in a data store. It is also common to differentiate the internal form of the resource from the surface form that is transferred over the network. Hence, the current temperature on a web accessible device could be transferred as a simple text file, an XML document or an HTML web page. The form of the response is determined by the request that is sent from client to server. The most common request in HTTP is GET which retrieves the current state of a resource. A POST request is often used to submit form data to a web service but in general is intended to submit data to a resource and can be interpreted as creating a subsidiary resource (e.g.. a file within a folder) or updating an existing resource. Less commonly used are PUT and DELETE which create new resources and delete them; since these generally imply creating and deleting files on a server they are not generally implemented for security reasons. HTTP supports a few other kinds of request and there are a number of extensions to the protocol to support additional applications (for example WebDAV to support remote file stores). While HTTP is an inherently open protocol, it is able to support secure and authenticated access to resources. Encrypted connections using the Secure Sockets Layer (SSL) mean that traffic over the network cannot be intercepted. Authentication can be layered on top of the basic HTTP protocol using cookies - additional headers exchanged with every transaction. In combination, these can provide secure access to resources mediated via appropriate authentication and authorisation controls. This is an important feature for working with language resources which often need to be protected from general access; some work relating to this will be outlined later in the paper. 3. Annotations on the Web 3.1. What gets a URI? The first question in designing an interface to annotations over the web is that of designing the URI space – the logical structure of URIs used to retrieve and modify annotations. Closely tied to this is the question of what should have a URI of its own. Our proposal is for a three-level abstraction of resources from the annotation store: corpora, annotation sets and annotations. We also include an explicit representation of an annotation end point (start or end time or pointer to a document location) called an anchor. Each of these kind of resource is identified by a unique URI. This is both a canonical name for the resource and a means of accessing a description of it over the HTTP interface. Corpora represent collections of documents whose annotations are stored on the server. A corpus might be a traditional curated collection such as the TIMIT or BNC corpora, or an ad-hoc collection by a single researcher. A corpus has a URI of the form http://example.org/ corpora/NAME where NAME is a symbolic name for the corpus1 A collection of corpora housed on a given server will also have a URI (http://example.org/ corpora/ here) that could be used to discover what data is available on this server. Annotation Sets are containers for the annotations on a single document or media file. It is common to have this level of abstraction when using a tool such as ELAN (Wittenburg et al., 2006) or Transcriber (Barras et al., 1998) that stores all annotations on a media file in a single XML file. Annotation sets might correspond to more than one of these XML files in the case when multiple kinds of annotation are stored in different files. An annotation set is always part of a corpus and has the corpus URI as a prefix of its URI which is of the form http://example.org/ corpora/NAME/ASID; ASID here is a unique identifier for the annotation set. Annotations are the individual annotations that make up an annotation set. A single annotation might store the part of speech of a word or a phonetic label for a segment of a speech signal. The URI of an annotation has an annotation set URI as a prefix: http://example.org/ corpora/NAME/ASID/ANNID where ANNID is an annotation identifier. Anchors are the endpoints of annotations and are represented as explicit resources to allow them to be shared between annotations. For example, one anchor may be the end point of one annotation and the start point of a second. Anchors appear in some form in many annotation formats including Annotation Graphs (Bird and Liberman, 2001) and ELAN (Wittenburg et al., 2006) which calls them time slots. Since anchors are also contained within annotation sets, they also have a URI that has an annotation set URI as a prefix: http://example.org/corpora/ NAME/ASID/ANCHID where ANCHID is an anchor identifier. 1 In these examples we use a common prefix of http:// example.org/corpora/ in all URIs, this is arbitrary and will depend on the server used to store the corpora. Each of these kind of resources can be described by a feature structure (in the TEI or ISO 24610-1 sense (M. Laurent Romary and TC 37/SC 4/WG 2, 2006)) containing information about the resource. This structure supports attaching feature sets to any level of detail from the corpus to the annotation itself. Feature values can include relations between resources; these are easily expressed since each resource has a unique URI that can appear as the value of a feature. The vocabulary used in defining features is of course important; we note that the Linguistic Annotation Framework (Ide and Romary, 2007) is directly addressing this need in setting up standards for a Data Category Registry that would allow mapping of feature names between resources. 3.2. Responses to URIs Having said that these resources have unique URIs that can be published and accessed to allow sharing of annotations, we still need to define what exactly will be returned if someone enters one of these URIs into a web browser. By default, the response to a request for a URI from the annotation server will be an HTML representation of the resource being referenced. This means that someone can access one of these published URIs in a web browser and see a human readable representation of the corpus, annotation set or annotation. The actual representation that is returned is the concern of the implementer of the server and need not be uniquely defined; for example, a server that holds annotations of video data might be able to serve a representation of an annotation set as a page with the video embedded alongside a browseable version of the annotations similar to that developed by the EOPAS project (Thieberger and Schroeter, 2006) for ethnographic data. Our current implementation includes links to all of the subordinate resources in the HTML representation. So, the page generated for a corpus links to all of the annotation sets in the corpus while the annotation set links to all of the annotations. The page for an annotation includes all of the properties associated with the annotation and links to any other associated annotations (e.g.. parents, dependancies, etc.). 3.2.1. Content Negotiation A little used option in HTTP is the ability to have the web browser request certain types of content when requesting a resource. For example, I can ask for http://example. org/data while saying that I will accept plain text or PDF. The server can then respond with whichever of these it is able to produce. This process is called content negotiation and is not widespread partially because of the lack of support for it in all browsers. The web service described here makes use of content negotiation to serve different kinds of content to different clients. If the client is a conventional web browser, the server will generate HTML descriptions of resources; on the other hand if the client is an annotation tool, it can request data, for example, in ELAN eaf format. Content negotiation will allow us to serve different representations of each of the resources to different kinds of client. We can, for example, return a version of an anno- tation set in the format required by an annotation tool such as ELAN or Transcriber. In this way, the interface can realise the format conversion functionality that is at the core of standards such as LAF (Ide and Romary, 2007) or AG (Bird and Liberman, 2001) transparently. The same annotation could then be accessed by an ELAN user and a Transcriber user without having to distribute two distinct versions of the annotation or go through any explicit conversion process. In some situations, content negotiation is not possible for example when including links in a web page or when dealing with older HTTP client software. In these cases it is possible to achieve the same end by augmenting the URI of a resource with a query string indicating the type of representation required. So, to retrieve an ELAN format representation of an annotation set one could retrieve http://example.org/corpora/andosl/ foobar?format=application/xml+eaf (the exact keyword and format indicator needs standardisation, this example is included to illustrate this capability). 3.2.2. Low Level Access There is a third possibility though that offers to realise the full potential of the web based annotation store. That is to return a form of each resource that can form the basis of a read/write interface to the store. The idea here is that instead of reading and writing annotation files in an XML format, a tool could query the server directly for information about the annotations on a document or media file. To support this, the response to a request for an annotation set could be a simple XML list of the URIs of the annotations, perhaps with a small amount of data from each such as a label or start and end times. Using this, an annotation tool could determine which annotations are of interest and query the server for more information about each. The response to a request for an annotation could be a simple XML representation of the annotation as a feature structure. This kind of server would allow updates to be made to annotations using the same kind of messages sent from the client to the server. To add a new annotation to an annotation set the client would make a POST request to the annotation set URI http://example.org/corpora/ as123/ with a request body containing the feature structure for the new annotation. In response to the POST request, the server creates a new annotation and returns a HTTP response confirming that it was created with the URI of the new annotation. Similarly, a POST request to an existing annotation URI has the effect of updating the annotation. Finally, the DELETE request to an annotation or annotation set URI can be used to remove the corresponding resource. These requests can be used by an annotation tool to directly manipulate the annotations stored on the server rather than working through any kind of file format. 4. Building Upon the Interface One of the primary advantages of defining a web based interface based on HTTP access to resources is that the existing infrastructure of the web can be leveraged to add new functionality with little extra effort. The web is a very mature family of technologies and many issues around effi- cient, secure distribution of data have been addressed in general purpose technologies layered on top of HTTP. A few of the possibilities are outlined here. 4.1. Caching and Proxies A significant problem with providing remote access to resources such as annotations or primary linguistic data is the time lag between a request and the response being delivered over the network. This would be an immediate barrier to adoption of this kind of technology in some applications which require very fast access to data. This is not a problem unique to annotation and since we have layered our interface on top of HTTP we can take advantage of HTTP caches to speed access to frequently accessed data. An HTTP cache acts as a proxy between the client and server such that most transactions occur just as they would if no proxy were in place. The cache will however, remember the responses to some requests and, if configured appropriately, will return a local copy of the response if it is requested again. A cache can be run on an individual machine or within an organisation where the requests from all users within a research group would be cached together, speeding access to the resources being used by the group. While a generic HTTP proxy cache such as Squid (http: //www.squid-cache.org/) can be used in this way there is scope for writing a special purpose proxy cache that knows about usage patterns of annotation data. Such a proxy could pre-fetch annotations that might be used in the near future. While caching files can be one important function of a proxy server, it can also fulfil another role in this context. A proxy acts as a mediator between the client and one or more servers and as such can federate access to multiple annotation servers. One could imagine a departmental or institutional proxy supporting access to many servers via a common cache while also serving local resources transparently to users. A network of such proxies could effectively provide distributed, redundant, storage of annotation data. 4.2. Authentication and Authorisation As described so far, all resources are available to anyone on the internet to read and possibly update; this is clearly not what would be required by most researchers and for many language resources which must be restricted in some way. Again, we can make use of existing technology on the web to layer authentication and authorisation on top of the HTTP interface described above. HTTP provides a simple authorisation scheme as part of the protocol which would allow resources to be password protected. Web servers such as the Apache server allow configuration settings that protect different URIs with different user names and passwords and this could be used to restrict access to distinct groups of users. Similarly, the operations that update an annotation (PUT, POST, DELETE) can be given different levels of password protection using standard server settings. A more sophisticated solution has been developed for applications that require more complex authorisation rules to be enforced. The XACML (XML Access Control Markup Language, http://www.oasis-open.org/ committees/xacml/) standard allows complex access control rules to be written which take into account external factors such as the date or file properties such as size or source of data. We are currently investigating the use of XACML in conjunction with our annotation server to provide fine grained access control to both annotations and primary data. For example, one might want to restrict access to part of a recording based on the identity of a speaker in that recording. XACML allows the rules to be written to express this restriction; we are now looking at how the server infrastructure needs to be configured to put this into practice. Rather than require every server to maintain passwords and user credentials for authorised users, the Shibboleth system http://shibboleth.internet2.edu/ implements a federation of identity providers such that a user can be authenticated against their home institution. An identity federation such as this would allow groups of researchers to be granted access to resources based on, for example, their host institution or membership of some project. We are currently working with the RAMS project at Macquarie http://www.melcoe.mq.edu. au/projects/RAMP/ on integrating our server with the Muradora data repository http://www.muradora. org/, a version of the popular Fedora server that integrates Shibboleth and provides a web based interface to building XACML policy documents. Our work here aims to illustrate how access to source data, meta-data and annotations can be mediated by appropriate authentication and authorisation. 4.3. Version Management Annotations are not often static; errors are found and corrected and new versions of corpora are published. Especially in the context of a collaborative annotation tool it must be possible to manage different versions of annotations and integrate version control operations such as rollback of changes or generating patch sets to send to other users. As part of our work on the back-end RDF annotation store we have developed a version control system for RDF triple stores that is designed to support these operations on annotation data (Cassidy and Ballantine, 2007). If the URIs published for annotation sets and annotations are to be useful they must be constant over time. That is, I must be able to publish a reliable URI for the annotation set that I used for a given study, not one which points to the most recent version of that annotation. Hence we must be able to include revision information in the URI. While we have not yet integrated our version control system with the HTTP interface, there are a number of possible ways in which one could refer to historical versions of data via a URI. One simple option is to prefix the corpus name with a revision identifier: http://example.org/ corpora/101029/andosl/msdjc001/ann0293 where 101029 uniquely identifies the revision of the annotation that is being referred to. The most recent annotation could still be referenced with out the version identifier but the longer style could be used where longevity of reference is required. 4.4. Mashups of Data and Annotations One of the defining features of the recent boom of applications on the web has been the growth of mashups built from data provided by different sources. A common component of these is Google Maps http://maps.google.com/ which can be used to visualise geographic data available on the web. The open nature of the web and the fact that data is available in well defined formats using well defined interfaces means that data can be re-purposed into applications that might not have been conceived by the original authors. In the annotation domain there are many possibilities for mashups that might combine annotation data with other widely available data sources such as WordNet, Wikipedia etc. Annotations might also be combined with each other; for example, merging different styles of annotation or augmenting annotations with data from lexical resources. The important point here is that this capability comes for free once we adopt an open, well defined interface using well understood technology. 5. Conclusion This paper has given a brief overview of the design of a web based interface to an annotation store. The design uses the REST approach to make corpora, annotation sets and annotations available as first class resources on the web. This approach changes the way that annotation tools work with annotation data. Instead of relying on local storage of data in files, tools can work with an annotation store through an abstract interface. The fact that this interface uses the HTTP protocol of the web means that the store can be remote and shared between users. By layering authentication, authorisation, caching and other standard HTTP technologies on top of the interface we can add additional functionality to the interface. 6. References Claude Barras, Edouard Geoffrois, Zhibiao Wu, and Mark Liberman. 1998. Transcriber: a Free Tool for Segmenting, Labeling and Transcribing Speech. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC),, pages 1373–1376, Granada, Spain, May. S. Bird and M. Liberman. 2001. A Formal Framework for Linguistics Annotation. Speech Communication. Steve Cassidy and James Ballantine. 2007. Version control for rdf triple stores. In ICSOFT 2007, Barcelona, Spain, July. Roy Thomas Fielding. 2000. Architectural Styles and the Design of Network-based Software Architectures. University of California, Irvine,. N. Ide and L. Romary. 2007. Towards International Standards for Language Resources. In L. Dybkjaer, H. Hemsen, and W. Minker, editors, Evaluation of Text and Speech Systems, pages 263–84. Springer. M. Laurent Romary and TC 37/SC 4/WG 2. 2006. Language resource management - Feature structures - Part 1: Feature structure representation. In ISO 24610–1. Nicholas Thieberger and Ronald Schroeter. 2006. EOPAS, the EthnoER online representation of interlinear text. In Linda Barwick and Nicholas Thieberger, editors, Sustainable Data from Digital Fieldwork, pages 99–124, University of Sydney, December. Peter Wittenburg, Hennie Brugman, Albert Russel, Alex Klassmann, and Han Sloetjes. 2006. ELAN : a professional framework for multimodality research. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006). Multiple Purpose Annotation using SLAT — Segment and Link-based Annotation Tool — Masaki Noguchi† , Kenta Miyoshi† , Takenobu Tokunaga† Ryu Iida‡ , Mamoru Komachi‡ , Kentaro Inui‡ † ‡ Department of Computer Science, Tokyo Institute of Technology Tokyo Meguro Ôokayama, 152-8552, Japan {mnoguchi,kmiyoshi,take}@cl.cs.titech.ac.jp Graduate School of Information Science, Nara Institute of Science and Technology Nara Ikoma Takayama 8916-5, 630-0192, Japan {ryu-i,mamoru-k,inui}@is.naist.jp Abstract In recent years, the use of large scale corpora in NLP applications, such as statistical parsing, has become prominent. As their use gained credibility, naturally so did the types of information they provided. There exist today many groups that create corpora: ANC, SFB at the University of Potsdam, just to name a few. In many cases these groups also provide specialized annotation tools for their corpora. However, these tools are just that: specialized, i.e. designed to work with a very specific annotation definition, without flexibility in mind. In the early stages of a project, often times the specification for annotating changes. This makes it difficult to use a tool with such rigid boundaries. In this paper, we propose a browser-based annotation tool SLAT, which allows for easily adding and customizing annotations. We also explain the steps involved in customizing SLAT to meet a user’s project needs. 1. Introduction 2. Requirements for Annotation Tools In recent years, the use of large scale corpora in NLP applications, such as statistical parsing, has become prominent. As their use gained credibility, naturally so did the types of information they provided. There are many projects which construct corpora, such as ANC1 and Sonderforschungsbereich (SFB) on information structure at the University of Potsdam2 just to name a few. The annotation of sentences by hand is not only extremely time consuming, but also leads to various kinds of errors. These errors combined with other user-entered biases have a large effect on the performance (and subsequent evaluation) of systems trained on these corpora. Thus, the information provided by corpora must be both accurate and consistent. To this end, annotation tools for simplifying and constraining human input have been developed in various projects, and have decreased the costs of constructing corpora. These tools are developed to work with a very well defined annotation specification. In the early stages of a project often times the specification for annotating will change, making it difficult to use a tool with such rigid boundaries. The format for storing information also differs by tool, so their data is not immediately interoperable. The conversion of one format to another is required each time an experiment is conducted or a method evaluated. In the next section, we briefly review some existing annotation tools and then describe our motivations for developing a new annotation tool. We introduce SLAT [sléit] (Segment and Link-based Annotation Tool), aimed at satisfying these motivations and briefly explain its features. Lastly, we summarize this paper and describe future work. Stefanie Dipper et al.(Dipper et al., 2004) compared existing tools that use XML as their data storage format. They compared twelve individual research projects from several disciplines, having corpora that mostly consisted of 5 types of annotations: semantic, discourse and focus annotations, as well as diachronic data and typology. To manage these types of annotations, they described seven requirements for annotation tools: diversity of data, multi-level annotation, diversity of annotation, simplicity, customizability, quality assurance and convertibility. First three relate to data annotation while the latter four relate to the usability of the annotation tool. They compared five annotation tools to test the validity of these criteria: TASX Annotator3 , EXMARaLDA4 (Thomas, 2001), MMAX5 (Müller, 2006), PALinkA6 (Orăsan, 2003) and Systemic Coder7 . 1 2 http://www.americannationalcorpus.org http://www.sfb632.uni-potsdam.de/ 3. Requirements during the Early Stages of a Project As presented in the previous section, an annotation tool must satisfy these requirements to be successful in corpus annotation. Previously developed annotation tools have mostly focused on the usability of the system regarding the annotation task itself, i.e. how easy/difficult it is to add/remove annotations. Usability is clearly important. In designing an annotation tool, however, it is also crucially important to take the while demands of a corpus project, which typically not only annotate text but also designing the tag set and evaluate and maintain the resultant corpus, 3 http://tasxforce.lili.uni-bielefeld.de/ http://www.exmaralda.org/ 5 http://www.eml-research.de/english/research/nlp/download/ 6 http://clg.wlv.ac.uk/projects/PALinkA/ 7 http://www.wagsoft.com/Coder/ 4 into account as design issues. More specifically, at least the following three issues should be addressed so that the tool can effectively support a project even during its initial unstable stages: 1. Cost to install an annotation tool Creating a corpus involves a large number of hands engaging in the task of annotation. It is particularly the case for those unfamiliar with computers that merely installing an annotation tool can become a burden. 2. Variation of data schemes for each annotation task Past annotation tools have been developed with a specific annotation scheme in mind, making it unsuitable for other types of annotation. A multipurpose annotation tool must use a flexible data scheme that can incorporate various types of annotation, and must have an interface adaptable to various annotation tasks. 3. Quality of the corpus As previously mentioned, the initial phases of a project are often filled with adjustments to how a corpus will be annotated. Since typical annotators work individually while referring to a specification, this period can result in poor consistency. These errors affect the quality of a corpus which in turn affects the performance and subsequent evaluation of a system. We introduce SLAT (Segment and Link-based Annotation Tool), in the next section. For tackling the first issue, we adopt a client/server architecture. We present annotation abstraction for resolving the second issue and discuss some already-developed annotation tools and their own implementations. Finally, we summarize our findings and briefly touch upon the third issue enumerated above. 4. SLAT SLAT is a web-based annotation tool that employs a client/server architecture. With the ubiquitousness of the internet, this means that SLAT can be accessed almost anywhere; the only prerequisite for beginning annotation is having access to the URL via a browser. This also serves to reduce the cost and time of installation on an annotator’s machine. The server-end of SLAT is composed of a computer running a database and a PHP-enabled web-server. The SLAT server stores all documents to be annotated, annotation information and customized user configurations. In this section, we first propose an abstraction of annotations using segments and links, which allows SLAT to adapt to many different annotation tasks. We then address the interface issues, detailing the components of the current SLAT interface, and finally demonstrate how SLAT can be easily customized. 4.1. Abstraction of Annotations To explore a universal data scheme applicable to various types of annotations, we discuss the abstraction of annotations using a simple POS annotation example shown in Figure 1. In this example, annotation is carried out by affixing POS and named entity tags to specific regions of text, called segments. Thus, “John” is annotated as N and N-PER and “New York” as N and N-LOC etc. Relations between segments are then identified, such as coreference or a certain semantic role. This is called linking. Using this abstraction, almost any annotation can be represented. SLAT adopts stand-off annotation, i.e. all annotated data is stored separately from the original data. John lives in N VERB-PRE PREP New York. N N-PER N-LOC He bought a book last Saturday. ProN1 VERB-P ART N ADJ N He wants to be a lawyer. ProN2 VERB-PRE TO BE ART N Figure 1: An example of POS annotation 4.1.1. Segments When annotating a text, it is important to both indicate the particulars of a region as well as its relation to other parts of the text. A segment is indicated by marking the starting and ending offsets of a region. For representing this information, tags are inserted into the text. A fragment of the text can be multiple segments such as “John” and “New York” in Figure 1. Furthermore, segments can be nested and overlap, such as ‘XXX YYY ZZZ’. 4.1.2. Links As mentioned above, segments may have several types of relations to one another, e.g. “John” and “he” (coreference), or “bought” and “a book” (semantic role). All relations have at least two properties: transitivity and directionality. By combining these two properties, we can divide relations into four general groups: 1. transitive and directed E.g. “car”→“door”→“glass”, part-of relations belong to this group. Temporal relations between events also belong to this group. 2. transitive and undirected Coordination and coreference, such as the relations between “John (N-PER)”, “He (ProN1 )” and “He (ProN2 )”in Figure 1. 3. non-transitive and directed Semantic role labeling, e.g. the relation between “bought (VERB-P)” and “book (N)” belongs to this group. 4. non-transitive and undirected Relations in this group represent a special case only, and consist of only a pair. 4.2. Interface SLAT’s interface has been designed to allow for intuitive, visual annotation. It has two main panes in the center of the screen, as shown in Figure 2. The left pane, an editor pane, displays the text to be annotated while the right pane displays a list of all current segments and links. Annotating a segment is as easy as marking a region of text with the mouse. The upper pane shows information of selected and focused segments. In Figure 2, “support systems” is selected and “adopt” is focused. The notion of the selected and forcued segments roughly corresponds to the source and destination segments of a link. A new link is annotated by regarding Configuration pane Focused/selected segments List pane Edit pane Figure 2: Snapshot of SLAT selected segment as the destination and focused segment as the source of that link. And these segments have a difference in an operational respect. That is, the system allows users to move around focused segments by using arrow keys, while selected segments are determined by clicking the segments with the mouse. This operational distinction is useful for annotations where multiple links extend from a single segment, such as with predicate-argument annotation. The focusable segments are defined in the configuration as described below. In the editor pane a segment is displayed as colored and underlined strings. Strings that are comprised of more than one annotation will have multiple underlines. A segment may be selected by clicking on an underlined region. When a segment is selected, links attached to that segment are presented by highlighting the counterpart segments with colors and underlines. In Figure 2, there are two links displayed: one is a link between “adopt” and “support systems” and the other is a link between “adopt” and “abstracted annotation”. The right list pane contains a table-view list of segments and links. Clicking a column header allows for sorting by properties such as offsets, segment/link names and so on. By clicking on a segment within this list, the left editor pane will scroll to display the selected item. Selecting a link item will identify both the destination and source segment within the editor pane. 4.2.1. Interface Design Research shows that there are essentially two ways of representing relations: one using edges and the other table- based. In an interface that displays links using edges, identifying a link can become difficult if there is a large number of annotated links. However, a table-based interface has the obvious shortcoming of lacking good visual representation of source/destination. SLAT’s interface was designed with both these points in mind. Relations with focused segments are highlighted by underlined and colored strings to avoid congestion in the editor pane. Highlighting can be toggled by a check-box in order to allow annotators concentrate on specific tags during annotation. Many treebank projects represent the phrase structure of sentences using a tree representation. Phrase structures can be represented in terms of segments and links though the interface today is less than ideal for displaying its hierarchical structure. We designed our interface to be as adaptable to various annotation tasks as possible; segments and links are more versatile than tree representations, and in particular allow for overlapped segments which are troublesome to deal with using trees. That being said, a tree representation might be more suitable when annotating phrase structures and we have plans to incorporate another type of view pane for displaying trees, based on a user’s configuration options. 4.3. Customization SLAT allows users to customize tag-sets in two ways, (1) by using the GUI directly, and (2) by uploading a file containing tag-set definitions. Figure 3 shows a snapshot of the configuration interface, through which the user can create segment and link definitions. A SLAT configuration can define different types of annota- tions simultaneously e.g. coreference, predicate-argument structure and syntactic structure and whatsoever. Users can toggle the visibility of each tag by using the configuration pane just above the edit pane. requirements designated especially important during the early stages of a project. SLAT’s use of abstracted annotations, i.e. segments and links resolves many of the challenges presented in this paper, though there are still some issues to be solved. Supporting annotators in assuring the consistency and quality of a corpus is a remaining challenge. The following is our reseach agenda for achieving this goal. • Introduction of batch operations for keeping consistency • Annotation help based on the workflow context • Retrieval of cases similar to the current annotation target • Visual methods for reporting errors • Mining annotation data by multiple annotators to find annotation tips 6. Acknowledgment Figure 3: Snapshot of configuration pane This work is partially supported by the Grant-in-Aid for Scientific Research in Priority Areas JAPANESE CORPUS8 . 7. References 4.3.1. Segments Tag-name defines the name of the segment, key-bind is an optional keyboard shortcut for creating a new segment while annotating a text; color and background-color define display colors, and focusable toggles whether or not a segment can be focused using arrow keys; clickable and visible each define whether a segment is selectable by clicking and if it is visible, respectfully. Sample definitions are shown in the upper table of Figure 3. 4.3.2. Links Tag-name defines the name of the link, key-bind is the same as explained above, only for links; transitivity and directed define whether a link has each attribute as defined earlier. Based on these settings, SLAT can constrain the selection and pairing of source/destination tags. For allowing several source/destination combinations, they should all be defined here. Sample definitions are shown in the lower tables of Figure 3. 4.4. Other Features When a segment is selected, the user’s selection can be limited to only the focused/selected segment’s tag name. This greatly decreases annotation errors related to accidentally selecting wrong segments. After annotation, a user may easily retrieve annotated text from SLAT via the web browser. SLAT supports undo/redo as well as customization and configuration of tag-sets. SLAT supports any language that can be encoded using UTF-8. 5. Summary and Future Work With the goal of covering a broad range of annotation tasks, we have proposed a data scheme that is easier to understand and to use. In addition, we have introduced a tool SLAT, which implements many features, including several Stefanie Dipper, Michael Götze and Manfred Stede. (2004). Simple Annotation Tools for Complex Annotation Tasks: an Evaluation. In Proceedings of the LREC Workshop on XML-based Richly Annotated Corpora. pp.54-62. Lisbon. Portugal. Coreference Task Definition. (1995). The sixth in a series of Message Understanding Conferences (MUC-6). http://cs.nyu.edu/cs/faculty/grishman/ COtask21.book 5.html Olga Babko-Malaya. (2005). PROPBANK ANNOTATION GUIDELINES. http://verbs.colorado.edu/˜mpalmer/ projects/ace/PBguidelines.pdf Takahashi Tetsuro, Inui Kentaro. (2006). A multi-purpose corpus annotation tool: Tagrin. Proceedings of the 12th Annual Conference on Natural Language Processing. pp.228-231. Yokohama. Japan. Christoph Müller. (2006). Representing and Accessing Multi-Level Annotations in MMAX2. In Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-dimensional Markup in Natural Language Processing. pp.73-76. Trento. Italy. Constantin Orăsan. (2003). PALinkA: A highly customisable tool for discourse annotation. In Proceedings of the 4th SIGdial Workshop on Discourse and Dialogue. Sapporo. Japan. Schmidt Thomas. (2001). The transcription system EXMARaLDA: An application of the annotation graph formalism as the Basis of a Database of Multilingual Spoken Discourse. In Proceedings of the IRCS Workshop On Linguistic Databases, 11-13. Philadelphia. USA. 8 http://www.tokuteicorpus.jp