Parsing to Text Structure: the basis of a reversible natural language generation system

David McDonald, Brandeis University


How to get a source that is rich enough conceptually and structurally to generate from comfortably is a problem that has always vexed NLG researchers. Our answer is that you should build it yourself - a conclusion we have drawn after a decade of work on this problem from the perspective of natural language understanding and semantic representation. This is because we believe that the natural origin of the conceptual content for applications in summarization or derived reports is human-authored text. Applying our information extraction system to these texts will automatically render the content into our preferred source representation for generation.

We go a step further than simple IE and use a system that recovers not just domain-level objects but also reconstructs the text structure that would have generated the texts. This puts us in a position (1) to mine sets of text structures with different realizations of the same object types to learn the suites of realization perspectives that human authors use in a given genre; and (2) to record the collective idiosyncrasies that govern the realization of the complex, multi-term relations found in everyday business text.

This talk will describe the architecture of our system. How it uses Tree Adjoining Grammar as the representation of its linguistic resources for both parsing and surface realization. How it associates these resources with the type definitions in the domain model to automatically construct the semantic parsing grammar. How it uses the type definitions schematically to ensure the expressibility of individual objects and known collections of objects. And how it uses a semantic representation of partially saturated relations to simplify the linguistic reasoning need to produce contextually cohesive texts. Examples will be taken from the domain of corporate quarterly earnings.