GLOSSARY OF CORPUS LINGUISTICS


Bank of English

The Bank of English is the name of Cobuild's collection of corpora. Detailed information can be found on the respective page on the Cobuild WWW server.

character

This is a term used in computing to mean roughly a letter of an alphabet; but a set of characters also includes punctuation marks and on computer keyboards some shapes that are not found in ordinary typing or printing.

citation

A citation is a selected example of a word or phrase in use. The term comes from lexicography, and citations form the basic evidence for most major dictionaries. A citation can be drawn from the spoken or the written language, and it consists of a word or phrase in a sufficient textual environment to show some feature of its meaning or use.

A set of citations of one particular word or phrase is similar to a concordance. However, citations are usually selected by people because of an interesting feature of the occurrence, and so they lack the objectivity of a concordance. From now on, it is likely that new dictionaries will be based on concordances.

Older original dictionaries are based on collections of citations, and the collections can become very large over the years. A collection of citations gives important evidence about a language but should not be confused with a corpus, which is a set of representative texts.

Cobuild

Cobuild is an acronym for COllins Birmingham University International Language Database. This is a joint project between industry (HarperCollins Publishers) and the University of Birmingham, which began in 1980. A large corpus of contemporary English was gathered from spoken and written sources, and each word in turn was studied for its lexical, grammatical, semantic, stylistic and pragmatic features. The information was entered into a database from which were edited the Cobuild dictionaries and other publications.

Here is a link to the Cobuild WWW server.

collocate

A word which occurs in close proximity to a word under investigation is called a collocate of it.

collocation

Collocation is the occurrence of two or more words within a short space of each other in a text. The usual measure of proximity is a maximum of four words intervening. Collocations can be dramatic and interesting because unexpected, or they can be important in the lexical structure of the language because of being frequently repeated.

This second kind of collocation, often related to measures of statistical significance, is the one that is usually meant in linguistic discussions. There are three useful technical terms in the description of a collocation. Each citation or concordance line exemplifies a particular word or phrase. This word or phrase is called the node. It is normally presented with other words to the left and right and these are called collocates. The collocates can be counted and this measurement is called the span.

Collocation in its purest sense, as used in this book, recognises only the lexical co-occurrence of words. This kind of patterning is often associated with grammatical choices as well, leading to the wealth of idioms and fixed phrases that are found in everyday English. Some writers on collocation (e.g.Kelljmer 1984) include matters of grammatical relations in their specification of collocation.

Attention is concentrated here on lexical co-occurrence, more or less independently of grammatical patter or positional relationship. Collocation patterns are restricted here to pairs of words, but there is no theoretical restriction to the number of words involved. Collocation is a contributing factor to idiom.

concordance

A concordance is an index to the words in a text. The concordance is at the centre of corpus linguistics, because it gives access to many important language patterns in texts. Concordances of major works such as the Bible and Shakespeare have been available for many years; an example of a concordance is given here.

The computer has made concordances easy to compile, and for some forty years a simple and effective convention called KWIC has been widely used.

The computer-generated concordances can be very flexible; the context of a word can be selected on various criteria (for example counting the words on either side, or finding the sentence boundaries). Also, sets of examples can be ordered in various ways.

context

This key term in modern linguistics has two related meanings:

corpus

A corpus is a collection of naturally-occurring language text, chosen to characterise a state or variety of a language. In modern computational linguistics, a corpus typically contains many millions of words: this is because it is recognised that the creativity of natural language leads to such immense variety of expression that it is difficult to isolate the recurrent patterns that are the clues to the lexical structure of the language.

There are two kinds of corpora described here. The first is the `sample corpus', which is a finite collection of texts, often chosen with great care and studied in detail. Once a sample corpus is established, it is not added to or changed in any way.

The other kind of corpus is just beginning to take shape: this is the `monitor corpus', which re-uses language text that has been prepared in machine-readable form for other purposes - for typesetters of newspapers, magazines, books, and, increasingly, word-processors;and the spoken language mainly for legal or bureaucratic reasons.

co-text

The co-text of a selected word or phrase consists of the other words on either side of it. This is a more precise term than context or verbal context, but it is not much used.

discourse

Discourse means language in use - naturally-occurring spoken or written language. Here it means much the same as text, but it is not usually found as a countable noun. It is sometimes used as a very general form for language patterns above the sentence. To some people, discourse suggests the spoken form of the language, and text the written form, although no such distinction is made her.

idiom

An idiom is a group of two or more words which are chosen together in order to produce a specific meaning or effect in speech or writing. The word is used in various other ways, but here the meaning is restricted to the above.

The individual words which constitute idioms are not reliably meaningful in themselves, because the whole idiom is required to produce the meaning. Idioms overlap with collocations, because they both involve the selection of two or more words. At present, the line between them is not clear. In principle, we call co-occurrences idioms if we interpret the co-occurrence as giving a single unit of meaning. If we interpret the occurrence as the selection of two related words, each of which keeps some meaning of its own, we call it a collocation.

Hence, hold talk, hold a meeting, hold an enquiry are collocations; whereas hold sway, hold the whip hand are idioms. If either hold or sway are removed, the special meaning disappears. Most current uses of sway as a noun are with hold, and the two words must be taken together. Similarly, in hold the whip hand nothing can be changed except the inflection of the verb. Other idioms, like hold...to ransom can be discontinuous.

idiom principle

One of the main principles of the organisation of language is that the choice of one word affects the choice of others in its vicinity. Collocation is one of the patterns of mutual choice, and idiom is another. The name given to this principle of organisation is the idiom principle.

The other main principle of organisation which contrasts with the idiom principle is the open-choice principle.

KWIC

This acronym stands for Key Word In Context. It is a popular type of computer-generated concordance, which is easy for a researcher to scan quickly.

Each line of concordance contains an instance of a selected word, and the page is aligned centrally around this word. The text before and after the selected word is printed, and then an additional space, the word in question, and more text, until the end of the line is reached. Simple versions of this layout start and stop at the line ends, in the middle of words where necessary. There is an example of a KWIC concordance here.

lemma

A lemma is what we normally mean by a `word'. Many words in English have several actual word-forms, so that, for example, the verb to give has the forms give, gives, given, gave, giving and to give. In other languages, the range of forms can be ten or more, and even hundreds. So `the word give' can mean either (i) the four letters g i v e or (ii) the six forms listed above.

In linguistics and lexicography we have to keep these meanings separate; otherwise it would not be possible to understand a sentence like `give occurs 50 times in this text'. For this reason, the composite set of word-forms is called the lemma.

lemmatization

Lemmatization is the process of gathering word-forms and arranging them into lemma. So the word-forms give, gives, gave, given, giving and probably to give will conventionally be lemmatized into the lemma give. Any occurrence of any of the six forms will be regarded as an occurrence of the lemma. There are a number of problems about lemmatization, because there are so many possible ways of grouping word-forms. Most books about linguistics assume that lemmatization is a simple and obvious process, especially in a language such as English. However, a lot of research is still needed to justify lemmatization on scientific grounds.

lexis, lexical

The lexis of a language is the set of all its word-forms. Lexical structure is the organisation of these word-forms into units such as collocations and idioms.

The origins of lexical evidence are the word-patterns in texts. Position and frequency of occurrence are important criteria, and in the first instance priority is given to the formal arrangements of word-forms, rather than to intuitive conclusions about meaning.

This distinguishes lexis from semantics, which is centrally concerned with the organisation of meaning. However, lexical study aims to associate the formal patterns with distinctions in meaning, and so lexis and semantics overlap.

Sometimes lexis is contrasted with grammar, and in this kind of usage it emphasises the individual patterns of words as against the generalisations of grammatical rules.

naturalness

Competent users of a language acquire acute sensitivity to naturally-occurring language, and are quite good at spotting contrived and artificial constructions. At present, it is beyond the competence of linguistic research to describe this facility, and we must conclude that the phraseology of language in text is more subtly organised than we can express or explain.

There is an edition of the English Language Research Journal which outs the case very clearly and explores some of the consequences of this observation. In relation to the concerns of this book, naturalness reinforces the insistence on working only from attested examples of language. Many writers and teachers have found it convenient to argue from pieces of invented or adapted language; increasingly this is becoming a hazardous occupation and before long it will be abandoned altogether.

Naturalness is to text what grammatical correctness is to sentences. The two principles come close together and indeed overlap; but whereas most educated people have been taught the conventions of grammatical correctness, they have not approached text with the same rigour. They do not have a ready vocabulary for talking about textual well-formedness, but they are aware of its influence on the success of both spoken and written language in practice.

node

The node word in a collocation is the one whose lexical behaviour is under examination.

open-choice principle

In many descriptions of language - grammars and dictionaries especially - words are treated as independent items of meaning. Each of them represents a separate choice. Collocations, idioms and other exceptions to this principle are given lower status in the descriptions. This is the open-choice principle, which contrasts with the idiom principle. A full discussion can be found in Chapter 8 of Looking Up.

running words

This term is used in measuring the length of a text. Each successive word-form is counted once, whether or not that particular form has occurred before. For example, the `The wind in the willows' contains 5 running words. Contrast this with vocabulary.

span

This is the measurement, in words, of the co-text of a word selected for study. A span of -4, +4 means that four words on either side of the node word will be taken to be its relevant verbal environment. See collocation.

text

This word can be used as a countable noun or uncountable noun.

vocabulary

The vocabulary of a text is the set of all the difference words used in the text. Vocabulary is usually counted in lemmas and contrasted with the count of running words. For example the phrase `The wind in the willows' contains 5 running words, but its vocabulary has only 4 items since the occurs twice. All the words of a text are part of its vocabulary - even the very common words such as the, of, and, which are sometimes said to be restricted to the grammar.

word

It is best to use the word word informally to mean what it normally means. Roughly speaking a word in the Roman alphabet is based on a string of letters including hyphen and sometimes apostrophe - bounded on each side by a word space or another punctuation mark. It can thus signify anything from a word-form to a lemma, from a running word to a vocabulary word. this is a large range of variation, and the use of the term is not very precise. However `word' is used for one of the basic concepts of language. the concept is sometimes expressed as `minimum free form'; simultaneously the unit of vocabulary and the smallest unit of syntax.

word-form

This term is used for any unique string of characters, bounded by spaces. Hence give, giving, gave, given are all separate word-forms. See also lemma.


Back to Corpus Linguistics Home Page