Tag Archives: anniversary

COBUILD: The Evolving Corpus – How corpus use has changed over the years

Size matters when it comes to corpora. At 220 million words of text, the corpus used to create the second edition of the COBUILD dictionary in 1995 was over ten times the size of the one used for the first edition, and 220 times bigger than the first electronic corpora developed in the 1960s and early 1970s. Yet it was tiny compared to those we use today, some of which amount to billions, not millions of words.

To give an idea of the amount of information involved: suppose you are compiling a medium-frequency verb like proclaim. In the British National Corpus (BNC), which was fixed in 1993 at around 100 million words and has not been expanded since, there are just over 1,000 results (known as ‘citations’) for proclaim. It is possible, given time and the necessary expertise, to look at every one of these citations and give a good account of the word’s meanings and behaviour in a dictionary entry (or elsewhere).

But what about a high-frequency word like take (174,000 citations in the BNC) or hand (just under 50,000)? In today’s huge corpora the numbers are far greater: the most frequent words have tens of millions of citations, while some (and, the, he, have and so on) number in the hundreds of millions. Even relatively uncommon words can have tens of thousands of citations.

Concordance lines for chair, generated by the corpus

Concordance lines for chair, generated by the corpus

Fortunately there are software tools and other methods for efficiently extracting the information that corpora hold. Modern corpus search software gives an overall picture of a word by displaying it on the screen in a way that shows how it combines with other words. It shows the search word together with its collocates – the words it combines with most frequently – and tells you how significant these combinations are. Each collocational or grammatical chunk displayed can be expanded, allowing you to examine it in more detail if necessary.

The other essential tool in a lexicographer’s armoury is sampling. It was one of the insights of COBUILD’s founder, Professor John Sinclair, that you can tell a great deal about a word’s meanings and behaviour from a small representative sample of corpus citations: in many cases a screenful or two is enough. So, a combination of the overview of a word’s collocational and grammatical behaviour, together with a more detailed look at a small sample of lines, generally provides sufficient information to compile a new entry or revise an existing one.

It is worth remembering that corpora could not have expanded to their current enormous size without fast computer connections. In my early days as a freelancer, when dial-up was the only type of connection widely available, you could literally start a corpus search for a frequent word, go away and make a cup of tea, and come back to find the search was still running. Today with high-speed broadband a search even for a very frequent word returns a result within a few seconds.

Corpora are used today in many different ways for different purposes on different dictionary projects. At its most basic, a corpus can provide authentic examples of how a word is used. At the other end of the scale, detailed corpus analysis continues to reveal new and surprising information about the collocational and grammatical behaviour of even the most familiar words. As new ways of using language come into being, a regularly updated corpus allows us to keep track of them. While the ways in which corpora are built and used have changed greatly over the past thirty years, it has become more or less unthinkable to compile or revise a dictionary without reference to the evidence provided by a corpus.


This blogpost has been written by Liz Potter, who is a freelance lexicographer, editor and translator.

Find out more about our new editions of the Collins COBUILD dictionaries and other COBUILD materials here.

COBUILD: Shifting senses – How the meanings of words change

In the 30 years since the publication of the first COBUILD dictionary, a whole flurry of new words has come into the language and as they’ve caught on and become part of everyday usage, they’ve been added to the dictionary.

I’m not just talking about the trendy new coinages that occasionally hit the headlines; think omnishambles, binge-watch or post-truth. Most of these never become used widely or frequently enough to make it into a learner’s dictionary like COBUILD. The occasional exception is words like selfie, which appeared and became ubiquitous with surprising speed. More interesting, perhaps, are new uses of existing words which sneak into our vocabularies almost unnoticed.

In 1987, the meanings of post were all about mail, sticks in the ground, and jobs, no mention back then of the kind of post many of us put on social media daily now. Clicking was still just about making a noise, not something you do to a link, which itself was still a generic connection rather than a way of switching between webpages. And a thread was a piece of cotton or the flow of your argument, rather than a series of online comments.

The entry for click from the first edition (1987)

The entry for click from the first edition (1987)

The entry for click from the ninth edition (2018)

The entry for click from the ninth edition (2018)

Not all these shifts have been in the online world either, other technological developments have also generated new usages. Back in the 80s, wireless was just an old-fashioned word for radio. Nowadays, you might have wireless headphones, speakers, microphones, or keyboards. If you talk about a hybrid now, you’re more likely to be referring to a type of semi-electric car than a plant or animal bred from two different species.

The entry for wireless from the first edition (1987)

The entry for wireless from the first edition (1987)

The entry for wireless from the ninth edition

The entry for wireless from the ninth edition

With each new edition of a dictionary, lexicographers are keeping an eye on corpus data to see what new words are coming into use, and also to pick up on new senses of familiar words. Like new coinages, new uses need to reach a certain frequency and distribution threshold for inclusion, first in larger native-speaker dictionaries, then, as they become more common-place into learner’s dictionaries too. Which words do you think might take on new meanings in the coming years?

Photographs taken from shutterstock.com


This blog post has been written by Julie Moore, who is an ELT lexicographer and materials writer.

Find out more about our new editions of the Collins COBUILD dictionaries and other COBUILD materials here.

COBUILD: The early years. Part 2 – A dictionary from a corpus

By the time I arrived at COBUILD as part of the 1993 intake recruited to work on the second edition of the dictionary, the whole project had been fully computerised for several years. This meant working on screen at terminals linked to mainframe computers that hummed away in a separate room, still with the green text on a black background, as described by Andrew Delahunty in Part 1. The mainframe computers were named after Shakespeare characters –Titania was one – and would occasionally overheat and need time to recover, giving us the afternoon off.

A mainframe computer, similar to those used at the University of Birmingham in the 1990s

A mainframe computer, similar to those used at the University of Birmingham in the 1990s

There was a pleasing contrast between the high-tech, cutting-edge nature of the project and the elegant Victorian building where we worked, with its large sash windows overlooking a beautiful garden where we would sometimes eat our lunch in the summer. It was also a great place for seminars and parties, both of which would bring in members of the English department of the University of Birmingham to which COBUILD was attached and the wider university.

Compiling on screen using a purpose-built text editor required the acquisition of a whole new set of skills, since I had only ever worked on paper; but what really blew my mind was the corpus. Previously I had only seen concordances – the output of a corpus – on paper, since on my previous project we were able to request a printed sample of lines for particularly tricky entries. Engaging at close quarters with the corpus was a revelation. I was almost paralysed for several weeks, overwhelmed by the quantity and quality of the data I was expected to process. This corpus – soon to be rebranded as The Bank of English – was tiny by today’s standards, but the insights it provided into the behaviour of English were like nothing I had ever come across before.

Concordance lines for chair, generated by the corpus

Concordance lines for chair, generated by the corpus

At COBUILD we worked with the corpus differently from the way I have ever known it to be used anywhere else. Using specially developed software, we lexicographers (and grammarians) would analyse the evidence for the word we were compiling. We would then base our revisions of existing entries from the first edition, as well as all the new entries and senses we were adding, on that evidence. We were a large team and there was always a colleague available to discuss problematic entries or tricky decisions on how to divide up senses, but the evidence provided by the corpus was the basis of everything we did. I don’t think we ever looked at another learner’s dictionary. It sounds horribly arrogant, but we had no need to; we had all the material we needed right there in front of us.

I have worked on many corpus-based dictionaries and other projects since, and I rarely work on a dictionary that does not use corpus evidence to some degree. A corpus is always my first port of call when I encounter a new word or meaning. However, I think the COBUILD dictionary remains unique in being based so directly and completely on what only a corpus can give, which is evidence of how the language actually works.


This blog post has been written by Liz Potter, who is a freelance lexicographer, editor and translator.

Find out more about our new editions of the Collins COBUILD dictionaries and other COBUILD materials here.

COBUILD: Design and Layout – Changes over the last 30 years

Where were you 30 years ago? I was in the middle of my university studies, still to embark on my ELT career, and as such, a smidgin too late to be part of the intrepid and free-spirited COBUILD dictionary team. Led by the late John Sinclair, this large young team was involved in bringing to life his vision: to create a dictionary for learners that was based on a large digital language database – or a corpus. The corpus would be used for analysing word frequencies, for identifying new uses, collocations, colligation, connotation, and typical contexts for words and phrases. Definitions would be written in full sentences in the type of everyday English a teacher might use to explain a word to a learner, with the added advantage that users would see how the word would work in a sentence.

Looking back at the pages of that first edition, you might be struck by the density of the page design. It seems that we now need our text to be broken up with white space, boxes and varying fonts and colours: our modern brains seem to need a bit of a break between lines and entries. Was my intrepid and free-spirited self really so much better at reading tiny words all squished together on a page? Well, the answer is probably yes, as I remember my first encounters with COBUILD dictionaries were ones of delight; I don’t recall thinking ‘What? How do you expect me to wade through all of that?’

A page from the first edition of COBUILD Advanced Learner’s Dictionary

A page from the first edition of COBUILD Advanced Learner’s Dictionary

The other feature that jumps out at us from the pages of the first edition is the ‘extra column’. This was a narrow column down the right-hand side of each main column of dictionary text. It provided information on parts of speech and typical syntactical patterns, such as ‘V + O’ (= verb plus object) for transitive verbs, so that students didn’t have to search through the denser dictionary text for this type of information. Parts of speech were very specific; for example, adjectives might be ADJ CLASSIF: ATTRIB (a classifying adjective that occurs in attributive position) or ADJ QUALIT (a qualitative adjective), and verbs could be V ERG (ergative verb), v-link (linking verb) or V + O (transitive verb). The user can see the examples of use in the main dictionary text next to this information.

See the ‘extra column’ information for accusation below:

An extract from the entry accusation from the first edition

An extract from the entry accusation from the first edition

The ‘extra column’ information here means:

Accusation is a countable noun. If it’s followed by a preposition, then that preposition is of or against (e.g. accusations of cheating). It can also be followed by a reporting clause, as in The accusation against us was that we were biased.

COBUILD’s ‘extra column’ was something of a showcase for the incredible amount of hard work that lexicographers and grammarians put into analysing the newly-built corpus. It told us all sorts of previously undocumented facts about how the English language works.

Sadly, though, the extra column was not to survive. Market research told us that most learners did not read or even understand the vast majority of information in the extra column and in 2008 it was quietly put out to grass. The information in the extra column was re-worked with the modern learner in mind. The reintegration of much of the material into the main text meant that the main columns could be widened and more words and meanings could be covered in the same number of pages.

So, what does our mature 30-year-old dictionary look like now? Well, it has grown into an incredibly user-friendly go-to treasure trove of the English language, thanks to its sophisticated font design, useful information boxes, colourful images, and plenty of restful white space. It has a hugely popular online sibling, available at www.collinsdictionaries.com, and has inspired learners and lexicographers alike to use corpora to continue to learn ever-more fascinating facts about our language. Happy 30th birthday, COBUILD!

A page from the ninth edition, published in 2018

A page from the ninth edition, published in 2018


This blog post has been written by Penny Hands, who is an ELT lexicographer and materials editor.

Find out more about our new editions of the Collins COBUILD dictionaries and other COBUILD materials here.

COBUILD: The early years. Part 1 – Where it all began

I have always counted myself as incredibly fortunate to have worked as part of the COBUILD team at the time that I did, between October 1983 and the end of 1986. I was not quite 24 when I arrived in Birmingham, not knowing one end of a dictionary definition from another. By the time I left I was pretty sure lexicography was going to be my career and, over 30 years later, I’m still doing it.

My three years at COBUILD spanned the move in compiling practice from paper to computer. For the first year or so we were writing out individual dictionary entries on slips of paper, in much the same way that Samuel Johnson, James Murray, and all our other illustrious predecessors had done before us.

Pink slips were for each sense of a headword, on which we’d write the definition together with accompanying syntactic (and other) information. White slips were for individual example sentences selected from the corpus, all 7.3 million words of it, together with any example-specific information that needed recording. I can remember laying out on the floor a mosaic of hundreds of slips for a long, complicated word like live or way, shuffling all these meanings around into various groupings in an effort to settle on the best arrangement. An academic paper could be written on the role of the floor in lexicography. Once compiled, entries were then typed up into the dictionary database.

The slip of paper used for compiling the entry veritable

The slip of paper used for compiling the entry veritable

The equivalent entry for veritable once typed up into the dictionary database

The equivalent entry for veritable once typed up into the dictionary database

The corpus, the primary evidence for all our observations about the language, may have been created computationally, but initially we consulted that on paper too, in the form of printed-out concordances. In the early days of compiling we’d often highlight with coloured felt-tip pens individual concordance lines that illustrated different meanings of a word.

Concordances for veritable from the 7.3 million word corpus

Concordances for veritable from the 7.3 million word corpus

Within a year or so, lexicographers were compiling and editing text directly into the dictionary database on newly-installed computer terminals, displaying green text on a black background. Compiling a dictionary on a computer was hugely innovative at the time but within a few years this would become the norm. So, I have a real sense of having been present at a moment of transition as one great lexicographical tradition was coming to an end and another was taking its first steps.

We were a fairly young team and for many of us this was our first experience of lexicography. So, I didn’t then have much to compare it with, in terms of methods and approach. But there was certainly a palpable buzz about the place. We knew we were doing something new. In some ways, it was only once the dictionary was published that I began to appreciate quite how radical and groundbreaking the COBUILD project was.

Look out for Part 2 coming soon…


This blogpost has been written by Andrew Delahunty, who is a freelance lexicographer, dictionary editor and reference book author with almost 35 years’ experience.

Find out more about our new editions of the Collins COBUILD dictionaries and other COBUILD materials here.