GLOSSARY OF CORPUS LINGUISTICS
The Bank of English is the name
of Cobuild's collection of corpora. Detailed information
can be found on the respective page on the Cobuild
WWW
server.
This is a term used in computing to mean roughly a letter of an
alphabet; but a set of characters also includes punctuation marks and on
computer keyboards some shapes that are not found in ordinary
typing or
printing.
A citation is a selected example of a word or phrase in use. The term
comes from lexicography, and citations form the basic evidence
for most major dictionaries. A citation can be drawn from the
spoken
or the written language, and it consists of a word or phrase
in a sufficient textual environment to show some feature of
its meaning or use.
A set of citations of one particular word or phrase is similar
to a concordance. However, citations are usually selected by people
because of an interesting feature of the occurrence, and so they lack
the objectivity of a concordance. From now on, it is likely that new
dictionaries will be based on
concordances.
Older original dictionaries are based on collections of citations, and
the collections can become very large over the years. A collection of
citations gives important evidence about a language but should not
be confused with a corpus, which is a
set of representative texts.
Cobuild is an acronym for COllins Birmingham University
International
Language Database. This is a joint project between industry
(HarperCollins Publishers) and the University of Birmingham, which began
in 1980. A large corpus of contemporary English was gathered from
spoken and
written sources, and each word in turn was studied for its
lexical, grammatical, semantic, stylistic and pragmatic features. The
information was entered into a database from which were edited the Cobuild
dictionaries and other publications.
Here
is a link to the Cobuild
WWW server.
A word which occurs in close proximity to a word under investigation is
called a collocate of it.
Collocation is the occurrence of two or more words within a short space of
each other in a text.
The usual measure of proximity is a maximum of four words intervening.
Collocations can be dramatic and interesting
because unexpected,
or they can be important in the lexical structure of the language
because of being frequently repeated.
This second kind of collocation, often related to measures of statistical
significance, is the one that is usually meant in
linguistic
discussions. There are three useful technical terms
in the description of a collocation. Each citation
or concordance line exemplifies a particular word
or phrase. This word or phrase is
called the node. It
is normally presented with other words to the left and right and these are
called
collocates. The collocates can be counted
and this measurement is called the span.
Collocation in its purest sense, as used in this book, recognises
only the lexical co-occurrence of words. This kind of patterning
is often associated with grammatical choices as well, leading to
the wealth of idioms and fixed
phrases that are found in everyday
English. Some writers on collocation
(e.g.Kelljmer
1984) include matters of grammatical relations in their specification
of collocation.
Attention
is concentrated here on lexical co-occurrence, more or
less independently of grammatical patter or positional relationship.
Collocation patterns are restricted here to pairs of words, but there is no
theoretical restriction to the number of words
involved.
Collocation is a contributing factor to idiom.
A concordance is an index to the words in a text. The concordance is at the
centre of corpus linguistics, because it gives access
to many important
language patterns in texts. Concordances of major works such as the Bible
and Shakespeare have been available for many years; an example of a concordance
is given here.
The
computer has made concordances easy to compile, and for some forty years
a simple and effective convention called KWIC has been
widely used.
The computer-generated concordances can be very flexible; the context
of a word can be
selected on various criteria (for example counting
the words on either side, or finding the sentence boundaries). Also, sets
of examples can be ordered in
various ways.
This key term in modern linguistics has two related meanings:
-
In any continuous text, the words that come on either side of a word
or phrase selected for study constitute he context of that word or phrase.
In this
sense, the
context means the linguistic environment of any expression under scrutiny.
Sometimes, to distinguish this meaning from the other, the term
co-text is used. In lexical studies, a measured piece
of verbal context is
called a span.
- The general. non-linguistic environment of any language activity
can also be called its context. Here, it means the sociocultural
background.
Some
theories of
language use context or `context of situation' to mean a level
of language description where the limitless complexity of the nonlinguistic
environment is organised into linguistically relevant categories.
A corpus is a collection of naturally-occurring language text, chosen to
characterise a state or variety of a language. In modern computational
linguistics, a corpus typically contains many millions of words: this is
because it
is recognised that the creativity of natural language leads to
such immense variety of expression that it is difficult to isolate the
recurrent patterns that are the clues to the lexical structure of the language.
There are two kinds of corpora
described here. The first is the `sample
corpus', which is a finite collection of texts, often chosen with great care
and studied in detail.
Once a sample corpus is established, it is not added to or changed in any way.
The other kind of corpus is
just beginning to take shape: this is the `monitor
corpus', which re-uses language text that has been prepared in machine-readable
form for other purposes - for typesetters of newspapers, magazines, books,
and, increasingly, word-processors;and the
spoken language mainly for legal or
bureaucratic reasons.
The co-text of a selected word or phrase consists of the other words
on either side of it. This is a more precise term
than context or
verbal context, but it is not
much used.
Discourse means language in use - naturally-occurring spoken
or written language. Here it means much the same as
text, but it is not usually found as a
countable noun.
It is sometimes used as a very general form for language patterns above
the sentence. To some people, discourse suggests the spoken form of the
language, and text the written form, although no such
distinction is made her.
An idiom is a group of two or more words which are chosen together
in order to produce a specific meaning or effect in speech or writing. The
word is used in various other ways, but here the meaning is restricted to
the
above.
The individual words which constitute idioms are not reliably meaningful
in themselves, because the whole idiom is required to produce the meaning.
Idioms overlap with collocations, because they both involve the selection of
two or more
words. At present, the line between them is not clear. In principle,
we call co-occurrences idioms if we interpret the co-occurrence as giving a
single unit of meaning. If we interpret the occurrence as the selection of two
related words, each of
which keeps some meaning of its own,
we call it a collocation.
Hence, hold talk, hold a meeting, hold an enquiry are collocations;
whereas hold sway, hold the whip hand are idioms. If either
hold or sway are removed, the
special meaning disappears. Most
current uses of sway as a noun are with hold, and the two words
must be taken together. Similarly, in hold the whip hand nothing can
be changed except the inflection of the verb. Other idioms,
like
hold...to ransom can be discontinuous.
One of the main principles of the organisation of language is that the
choice of one word affects the choice of others in its vicinity.
Collocation is one of the patterns of mutual
choice, and idiom is another. The name given to this
principle of organisation is the idiom principle.
The other main principle of organisation which contrasts with
the idiom
principle is the
open-choice principle.
This acronym stands for Key Word In Context.
It is a popular type of computer-generated concordance, which is
easy
for a researcher to scan quickly.
Each line of concordance contains an instance of a selected word, and
the page is aligned centrally around this word. The text before and
after the selected word is printed, and then an additional space,
the
word in question, and more text, until the end of the line is
reached. Simple versions of this layout start and stop at the line ends,
in the middle of words where necessary. There is an example of a
KWIC concordance here.
A lemma is what we normally mean by a `word'. Many words in English have
several actual word-forms, so that, for example,
the verb to give
has the forms give, gives, given, gave, giving
and to give. In other languages, the range of forms can be ten or more,
and even hundreds. So `the word give' can mean either (i) the four
letters g i v e or (ii) the six forms
listed above.
In linguistics and lexicography we have to keep these meanings separate;
otherwise it would not be possible to understand a sentence like `give
occurs 50 times in this text'. For this reason, the composite set of
word-forms is
called the lemma.
Lemmatization is the process of gathering word-forms
and arranging them
into lemma.
So the word-forms give, gives, gave, given,
giving and probably
to give will conventionally be lemmatized into the lemma give.
Any occurrence of any of the six forms will be regarded as an
occurrence of the lemma. There are a number of problems about
lemmatization,
because
there are so many possible ways of grouping word-forms. Most books about
linguistics assume that lemmatization is a simple and obvious process,
especially in a language such as English. However, a lot of research is still
needed to justify
lemmatization on scientific grounds.
The lexis of a language is the set of all its word-forms.
Lexical structure is the organisation of these word-forms
into units such as collocations
and idioms.
The origins of lexical evidence are the word-patterns in texts.
Position and frequency of occurrence are important criteria, and in the first
instance priority is given to the formal
arrangements of word-forms,
rather than to intuitive conclusions about meaning.
This distinguishes lexis from semantics, which is centrally concerned with the
organisation of meaning. However, lexical study aims to associate
the formal patterns with
distinctions in meaning, and so lexis and semantics
overlap.
Sometimes lexis is contrasted with grammar, and in this kind of usage
it emphasises the individual patterns of words as against the
generalisations of grammatical rules.
Competent users of a language acquire acute sensitivity to
naturally-occurring language, and are quite good at spotting
contrived and artificial constructions. At present, it is beyond
the competence of linguistic
research to describe this
facility, and we must conclude that
the phraseology of language in text is more subtly organised than
we can express or explain.
There is
an edition
of the
English Language Research Journal
which outs the case very clearly and explores some of the consequences
of this observation. In relation to the concerns of this book,
naturalness reinforces
the insistence on working only from attested examples
of language. Many
writers and teachers have found it convenient to argue from pieces
of invented or adapted language; increasingly this is becoming a hazardous
occupation and before long it will be abandoned altogether.
Naturalness is to text what
grammatical correctness is to sentences.
The two principles come close together and indeed overlap; but
whereas most educated people have been taught the conventions
of grammatical correctness, they have not approached text with the same
rigour. They
do not have a ready vocabulary for talking about textual
well-formedness, but they are aware of its influence on the success
of both spoken and written language in practice.
The node word in a collocation is the one
whose lexical behaviour is under examination.
In many descriptions of language - grammars and dictionaries
especially - words are treated as
independent items of
meaning. Each of them represents a separate choice.
Collocations, idioms
and other exceptions to this principle are given lower status in
the descriptions. This is the open-choice
principle, which contrasts
with the idiom principle. A
full discussion can be found in Chapter 8 of
Looking Up.
This term is used in measuring the length of a text. Each successive
word-form is counted once, whether or not that
particular form has occurred before. For example,
the `The wind in the willows' contains 5
running words. Contrast this with
vocabulary.
This is the measurement, in words, of the co-text of
a word selected for study. A span of -4, +4 means that four words on
either
side of the node word will be taken
to be its relevant verbal environment. See collocation.
This word can be used as a countable noun or uncountable noun.
- A
text is a complete and continuous piece of spoken
or written language. Note that spoken text is included.
- Text is continuous spoken or written language. Here, is usually refers to
such language in machine-readable form. Note that there is little
if any
difference in meaning between text and discourse.
The vocabulary of a text is the set of all the difference words used in the text.
Vocabulary is usually counted in lemmas and contrasted
with the count of running words. For example the
phrase `The wind in the willows' contains 5 running words, but its
vocabulary has only 4 items since the occurs twice. All the
words
of a text are part of its vocabulary - even the very common words
such as the, of, and, which are sometimes said to be restricted
to the grammar.
It is best to use the word word informally to mean
what
it normally means. Roughly speaking a word in the Roman alphabet
is based on a string of letters including hyphen and sometimes apostrophe -
bounded on each side by a word space or another punctuation mark.
It can thus signify anything from a word-form to a
lemma, from a running word
to a vocabulary word. this is a large
range of variation, and the use of the term is not very precise.
However
`word' is used for one of the basic concepts of language.
the concept is sometimes expressed as `minimum free form'; simultaneously the
unit of vocabulary and the smallest unit of syntax.
This term is used for
any unique string of characters,
bounded by spaces. Hence give, giving, gave, given are all separate
word-forms. See also lemma.
Back to Corpus Linguistics Home Page