Generative Lexicon Theory

The focus of research in Generative Lexicon Theory is on the computational and cognitive modeling of natural language meaning. More specifically, the investigation is in how words and their meanings combine to make meaningful texts. This research has focused on developing a lexically oriented theory of semantics, based on a methodology making use of formal and computational semantics. That is, we are looking at how word meaning in natural language might be characterized both formally and computationally, in order to account for both the subtle use of words in different sentences, as well as the creative use of words in novel contexts. One of the major goals in our current research, therefore, is to study polysemy, ambiguity, and sense shifting phenomena in different languages.

PI: James Pustejovsky

Communicating with Computers

The Communicating with Computers (CwC) program aims to reimagine computers not only as tools, but also as collaborators working towards common goals. To that end, we are exploring how ideas can be conveyed between humans and computers and vice versa using various communicative modalities like language, gesture, visualization, or action; and how simple ideas can be composed into more complex ideas and interpreted in context. The core of our CwC work has been centered on the modeling of composable object and event semantics in a multimodal real-time simulation environment that serves as the semantic common ground between a human and a computer. Our environment and multimodal semantics facilitates a number of shared tasks, including collaborative structure building, curating biological databases, and the composition of stories and music. We use multimodal simulation as the scaffold for automatic learning of object and event properties with the goal of interacting with robots and and semi-autonomous systems.

PI: James Pustejovsky

Language Application Grid

The Language Application (LAPPS) Grid is an open web service platform for natural language processing (NLP) research and development. Together with researchers at Vassar College, Carnegie Mellon University, and the Linguistic Data Consortium at the University of Pennsylvania, we are working toward interoperability among language resources and NLP tools. Specifically, we are developing standards for interchange of linguistic objects among tools with different input and output formats, including evaluation tools based on the Open Advancement framework, and developing an easy-to-use interface for users to combine NLP tools from various sources into their own customized pipelines.

PI: James Pustejovsky

Integrating the Generative Lexicon into VerbNet

VerbNet is a comprehensive lexicon of English verbs, that categorizes verbs according to the syntactic structures they allow and the semantic restrictions they place on their arguments. It also contains a basic event semantics, in which events are divided into a beginning, middle, and end. We are currently working on augmenting the VerbNet representations of verbs with their event structures from Generative Lexicon Theory. Specifically, we propose a compositional model of events as consisting of some number of subevents, which can themselves be predicates of the verb’s arguments. We believe that more detailed models of subevent structure can help distinguish the meanings of verbs in the same VerbNet class, and help explain the semantics of verb polysemy in different contexts.

PI: James Pustejovsky

ISO-TimeML: Temporal Markup Language

TimeML is a robust specification language for events and temporal expressions in natural language. It is designed to address four problems in event and temporal expression markup. TimeML has been developed in the context of three AQUAINT workshops and projects. The 2002 TERQAS workshop set out to enhance natural language question answering systems to answer temporally-based questions about the events and entities in news articles. The first version of TimeML was defined and the TimeBank corpus was created as an illustration. TANGO was a follow-up workshop in which a graphical annotation tool was developed. The TARSQI project developed algorithms that tag events and time expressions in NL texts and temporally anchor and order the events.

The TARSQI project allowed developers and analysts to sort and organize information in NL texts based on their temporal characteristics. Specifically, we developed algorithms that tag mentions of events in NL texts, tag time expressions and normalize them, and temporally anchor and order the events. We also developed temporal reasoning algorithms that operate on the resulting event-time graphs for each document. These temporal reasoning algorithms include a graph query capability, that will, for example, find when a particular event occurs, or which events occur in a time period. They also include a temporal closure algorithm that allows more complete coverage of queries (by using the transitivity of temporal precedence and inclusion relationships to insert additional links into the graph), and a timelining algorithm that provides chronological views at various granularities of an event graph as a whole or a region of it. We also developed a capability to compare event graphs across documents. Finally, we developed a model of the typical durations of various kinds of events.

PI: James Pustejovsky

ISO-Space: Spatiotemporal Reasoning

The goals of the ISO-Space research are to further the representational and algorithmic support for spatio-temporal reasoning from natural language text in the service of practical applications. One such task is tracking the movements of individuals; providing automated support for such a task can be vital for national security. To create such technological support, we used lexical resources to integrate two existing annotation schemes, creating an entirely new representation that captures, in a fine-grained manner, the movement of individuals through spatial and temporal locations. This integrated representation can be extracted automatically from natural language documents using symbolic and machine learning methods.

The other challenge we address is translating verbal subjective descriptions of spatial relations into metrically meaningful positional information, and extending this capability to spatiotemporal monitoring. Document collections, transcriptions, cables, and narratives routinely make reference to objects moving through space over time. Integrating such information derived from textual sources into a geosensor data system can enhance the overall spatiotemporal representation in changing and evolving situations, such as when tracking objects through space with limited image data.

PI: James Pustejovsky

Speech Act Modifiers, Clause Types, and Indirectness

Often, the ultimate meaning of an utterance differs from its literal, direct meaning. For instance, the utterance Can you pass me the salt? looks like a question, but can be used as a request. The study of indirectness allows us to investigate the contributions of linguistic structure, cultural convention, and inference to the ultimate use of utterances.

Languages often provide speakers with ways of signalling the ultimate meaning of an utterance to their hearers. These special clues to the speaker’s meaning include the structure of the utterance (clause type), its intonation, as well as various words or phrases that can be attached to the utterance. We use the way that speech acts, clause types, and utterance modifiers can open a window into understanding indirectness in language in our study of clause-types, tag questions, and rising intonation in English, clause types and a discourse particle in Mandarin, requests in English, Russian, and Heritage Russian, and the politeness marker please in English and Russian.

PI: Sophia Malamud

Formal Models of Pragmatic Computation

Early work in pragmatics often lacked the precision and falsifiability that come with using a formal framework. More recent efforts have sought to integrate pragmatic factors into a more comprehensive formal theory of natural language interpretation. We extend this new line of research, contributing to a principled account of the interdependence of the context and truth-conditions of natural language expressions and to the development of new formal tools for investigating issues in the interaction of semantics and pragmatics. Specifically, we use Decision and Game theory to model underspecification in English definite plurals, and indirect speech acts.

PI: Sophia Malamud

Building Corpora of Spoken Language

Research of language meaning and use must be based on data that allows researchers to see linguistic structure, meaning, and context. Moreover, pragmatic phenomena such as indirectness emerge most clearly in spoken interaction. As research in other subfields of linguistics has shown, large collections of language data annotated with information about linguistic structure can bring about major advances. For instance, parsed corpora of historical English (Kroch & Taylor 1999, Taylor et al. 2003, Kroch et al. 2004) led to groundbreaking discoveries about the processes that defined the shape of English today and allowed linguists to gain a greater understanding of the very nature of language change.

The need for similar resources geared towards the study of semantics and pragmatics has become urgent. We are conducting methodological studies and building a corpus of the speech of bilingual and monolingual Russian children and their families (the BiRCh corpus), as well as corpora of spoken Russian narratives, a corpus of spoken Heritage Russian, and a corpus of spoken Hindi-Urdu.

PI: Sophia Malamud