Tag Archives: Liz Potter

COBUILD: The Evolving Corpus – How corpus use has changed over the years

Size matters when it comes to corpora. At 220 million words of text, the corpus used to create the second edition of the COBUILD dictionary in 1995 was over ten times the size of the one used for the first edition, and 220 times bigger than the first electronic corpora developed in the 1960s and early 1970s. Yet it was tiny compared to those we use today, some of which amount to billions, not millions of words.

To give an idea of the amount of information involved: suppose you are compiling a medium-frequency verb like proclaim. In the British National Corpus (BNC), which was fixed in 1993 at around 100 million words and has not been expanded since, there are just over 1,000 results (known as ‘citations’) for proclaim. It is possible, given time and the necessary expertise, to look at every one of these citations and give a good account of the word’s meanings and behaviour in a dictionary entry (or elsewhere).

But what about a high-frequency word like take (174,000 citations in the BNC) or hand (just under 50,000)? In today’s huge corpora the numbers are far greater: the most frequent words have tens of millions of citations, while some (and, the, he, have and so on) number in the hundreds of millions. Even relatively uncommon words can have tens of thousands of citations.

Concordance lines for chair, generated by the corpus

Concordance lines for chair, generated by the corpus

Fortunately there are software tools and other methods for efficiently extracting the information that corpora hold. Modern corpus search software gives an overall picture of a word by displaying it on the screen in a way that shows how it combines with other words. It shows the search word together with its collocates – the words it combines with most frequently – and tells you how significant these combinations are. Each collocational or grammatical chunk displayed can be expanded, allowing you to examine it in more detail if necessary.

The other essential tool in a lexicographer’s armoury is sampling. It was one of the insights of COBUILD’s founder, Professor John Sinclair, that you can tell a great deal about a word’s meanings and behaviour from a small representative sample of corpus citations: in many cases a screenful or two is enough. So, a combination of the overview of a word’s collocational and grammatical behaviour, together with a more detailed look at a small sample of lines, generally provides sufficient information to compile a new entry or revise an existing one.

It is worth remembering that corpora could not have expanded to their current enormous size without fast computer connections. In my early days as a freelancer, when dial-up was the only type of connection widely available, you could literally start a corpus search for a frequent word, go away and make a cup of tea, and come back to find the search was still running. Today with high-speed broadband a search even for a very frequent word returns a result within a few seconds.

Corpora are used today in many different ways for different purposes on different dictionary projects. At its most basic, a corpus can provide authentic examples of how a word is used. At the other end of the scale, detailed corpus analysis continues to reveal new and surprising information about the collocational and grammatical behaviour of even the most familiar words. As new ways of using language come into being, a regularly updated corpus allows us to keep track of them. While the ways in which corpora are built and used have changed greatly over the past thirty years, it has become more or less unthinkable to compile or revise a dictionary without reference to the evidence provided by a corpus.


This blogpost has been written by Liz Potter, who is a freelance lexicographer, editor and translator.

Find out more about our new editions of the Collins COBUILD dictionaries and other COBUILD materials here.

COBUILD: The early years. Part 2 – A dictionary from a corpus

By the time I arrived at COBUILD as part of the 1993 intake recruited to work on the second edition of the dictionary, the whole project had been fully computerised for several years. This meant working on screen at terminals linked to mainframe computers that hummed away in a separate room, still with the green text on a black background, as described by Andrew Delahunty in Part 1. The mainframe computers were named after Shakespeare characters –Titania was one – and would occasionally overheat and need time to recover, giving us the afternoon off.

A mainframe computer, similar to those used at the University of Birmingham in the 1990s

A mainframe computer, similar to those used at the University of Birmingham in the 1990s

There was a pleasing contrast between the high-tech, cutting-edge nature of the project and the elegant Victorian building where we worked, with its large sash windows overlooking a beautiful garden where we would sometimes eat our lunch in the summer. It was also a great place for seminars and parties, both of which would bring in members of the English department of the University of Birmingham to which COBUILD was attached and the wider university.

Compiling on screen using a purpose-built text editor required the acquisition of a whole new set of skills, since I had only ever worked on paper; but what really blew my mind was the corpus. Previously I had only seen concordances – the output of a corpus – on paper, since on my previous project we were able to request a printed sample of lines for particularly tricky entries. Engaging at close quarters with the corpus was a revelation. I was almost paralysed for several weeks, overwhelmed by the quantity and quality of the data I was expected to process. This corpus – soon to be rebranded as The Bank of English – was tiny by today’s standards, but the insights it provided into the behaviour of English were like nothing I had ever come across before.

Concordance lines for chair, generated by the corpus

Concordance lines for chair, generated by the corpus

At COBUILD we worked with the corpus differently from the way I have ever known it to be used anywhere else. Using specially developed software, we lexicographers (and grammarians) would analyse the evidence for the word we were compiling. We would then base our revisions of existing entries from the first edition, as well as all the new entries and senses we were adding, on that evidence. We were a large team and there was always a colleague available to discuss problematic entries or tricky decisions on how to divide up senses, but the evidence provided by the corpus was the basis of everything we did. I don’t think we ever looked at another learner’s dictionary. It sounds horribly arrogant, but we had no need to; we had all the material we needed right there in front of us.

I have worked on many corpus-based dictionaries and other projects since, and I rarely work on a dictionary that does not use corpus evidence to some degree. A corpus is always my first port of call when I encounter a new word or meaning. However, I think the COBUILD dictionary remains unique in being based so directly and completely on what only a corpus can give, which is evidence of how the language actually works.


This blog post has been written by Liz Potter, who is a freelance lexicographer, editor and translator.

Find out more about our new editions of the Collins COBUILD dictionaries and other COBUILD materials here.