Tuesday, 26 August 2008

Context-sensitive constructions in English

Written by Martin Kleppmann on Tuesday, 26 August 2008, 13:25 GMT.
Filed under: techie notes.

Anybody who has read about basics of computational linguistics (i.e. the cross-over area between language and computing) will have come across the (very useful) definition of a context-free grammar. Many computer-readable languages such as programming languages are designed to be context-free, since this makes them more manageable for both the computers and the humans operating them.

There are also grammars which are more powerful than context-free grammars, and which allow more complex statements; the next level of complexity are context-sensitive grammars.

Ever since these mathematical definitions were made, people have wondered how our human languages fit into this framework. There has been a lot of discussion about this in the academic community, and currently the general opinion seems to be that most human languages are more or less context-free, but there are a few exceptions, such as a context-sensitive construction in Swiss German.

In my work at the moment I am trying to help improve the accuracy of computer understanding of certain statements in natural language. We get fragments of text off the web and want to automatically extract structured information from it, which can then be used in other parts of the program.

And in this context I have come across constructions in English which, annoyingly, cannot be defined using context-free grammars. Two examples are:

“The square roots of 16, 9 and 4 are 4, 3 and 2, respectively.”

“The wardrobe has dimensions of 230×120x50 cm (HxWxD).”

In both cases, the problem is that there is a linear matching-up which needs to occur: in the first example, 16 is matched with 4, 9 is matched with 3, and 4 is matched with 2. In the second example, 230 is matched with H (height), 120 with W (width) and 50 with D (depth).

I’m still in the process of figuring out how to best analyse this kind of sentence. Because it’s context-sensitive, a lot of the typical parsing libraries won’t do it out of the box. Which adds to the fun :)

2 Comments »

  1. Interesting. At first this got to the philosopher in me, but then I realised it could just a pattern matching problem.

    It looks like the first is different to the second in parsing terms. The first is recognisable as two sets of collerated data simply because in English ‘respectively’ is rarely used except to flag the construct. You could then fairly easily recognise the separators of ‘,’ and ‘and’.

    The second is possibly easier to parse once you understand that it’s two sets of collerated data, but getting there could be tricky - and way too often people assume the reader can guess the measurement units from the context.

    What are these typical parsing libraries?

    Comment by Andrew Bruce — Friday, 29 August 2008, 11:12 GMT

  2. Andrew,

    I agree that if the program can recognise the patterns which indicate one of these parallel constructions (e.g. ‘respectively’), it would be easy to write a piece of code which matches them up. The curious thing I found here is just that it can’t be done from within the context-free grammar; it has got to happen in a second post-processing step based on the syntax tree and additional rules.

    The parsing library I am using here is pyparsing (http://pyparsing.wikispaces.com/), a remarkably elegant library, but most libraries use something which is more or less context-free.

    Comment by Martin Kleppmann — Tuesday, 2 September 2008, 18:17 GMT

RSS feed for comments on this post. TrackBack URL

Leave a comment