Computer Assisted Text Analysis - An Online Workbook - A History of Computer Assisted Text Analysis

3. What is the history of computer assisted text analysis?

As long as there has been literature there have been literary critics. The popular meaning of "critic" to the contrary, literary criticism is not necessarily negative. Instead, the intent of a scholarly "critical" reading of a work of literature is to improve one's understanding of it - to understand better the ideas it is dealing with, and how it connects to other issues in society at large. Thus, Northrop Frye, in his introduction to The Great Code: The Bible and Literature explains that his first aim in this "study of the Bible from the point of view of a literary critic" was to "provide students with enough information about the Bible to enable them to see what kind of literary influence it has had". Literary criticism as a scholarly venture tries to find new ways to look at writings, and to discover new and interesting ideas in them.

Finding new ideas in, say, Shakespeare's A Midsummer Night's Dream requires an intimate grasp of the work as a whole. If one was trying to map out the development of an author's ideas across her entire lifetime, the task could be significantly more onerous still. Thus, the best literary scholarship has almost always been the result of an individual scholar dedicating his life to an individual writer, and, as a result, bringing a very mature, but very personal vision to this writer's works. The development of the computer, with the associated potential for accurate, reproducable results has attracted a small but growing number of scholars who have felt that modern literary scholarship needed an antidote to the personal approach. Computer assisted criticism has tended to de-emphasize the insight (and perhaps prejudices) of the individual scholar in exchange for the seemingly objective world of numbers and statistics. Unfortunately, this objectivity was bought at a price which is still too steep for many others to accept - methods and arguments in computer assisted literary criticism often operate in a territory more familiar to scientists and engineers than to people in the humanities. Nonetheless, in spite of this problem, a number of interesting and useful applications have appeared which one way or another work around this limitation.

The Concordance

The first tool of computer-based analysis was the "Concordance" - a tool that was transformed by the computer, but originated in methods that go back to the middle ages. The concordance grew out of medieval biblical scholarship that tried to find parallels between the Old and New Testaments by finding places where the words in the text of the Old Testament would preshadow a passage in the New. Scholars became aware that it would be useful to group words into categories, and develop indexes that pointed to all (or at least many) occurrences of those words in the different books of the Bible. Thus was started the early thematic concordance, which named the major people, places, things and ideas that appeared in the Bible. Although the first such Biblical concordances were first created in medieval times, they have never gone out of fashion. As a boy, for example, I was given at Sunday School a then popular Concordance of the Bible developed by the Rev. W.M. Clow. I don't know when it was created, but from the wording in the introduction it seems to have been created this century, and involved the labour of many students as well as, perhaps, the Rev. Clow himself. It lists, for example, under the verb "go":

 
Go, Gen. 32: 26, let me g.
 
Ex. 23: 23; 32: 24, angel shall g. before 
 
        33:14, m presence shall g. with thee
 
Deut. 31: 6, thy Good, he it is that doth g. 
 
Ruth 1: 15, whither thou g., I will go.
 
[...]

The editor of this concordance didn't attempt to list all occurrences of all the words in the Bible. Function words such as "a" or "for" are not present at all, and for words such as "go", not all occurences are listed - only those which were considered significant by the editor. The editor does not tell us about the labour involved in producing such a concordance, but it was undoubtedly large, since all occurrences would be recorded and ordered by hand. When done this way concordance production could become the life's work of a scholar and his team of researchers!

With the work of Roberta Busa as early as the late 1940's, we have the first computer-assisted work in the humanities. Actually, Busa decided to produce a concordance by machine of the complete writings of the Medieval scholar Thomas Acquinas for fundamentally scholarly (in this case, philosphical) reasons! His doctoral thesis (defended in 1946) was on the concept of "presence" as it was understood by Aquinas. After searching the available printed indexes for the Latin words praesens and praesentia he concluded that in Aquinas the concept was actually connected with Acquinas' use of the preposition in. Indeed, he came to believe that, in general, function words provided many clues about the connection between the conceptual world of the author and the words he or she used to describe it. Unfortunately, none of the scholarly resources Busa had available to him, created manually in the manner mentioned earlier, could possibly provide a list of all the occurrences of such common words as in, sum, or et. As a result, in the late 1940s he planned for an Index Thomisticus which would be a complete concordance of all 10.6 million words of Thomas Aquinas. Although computers were virtually unknown in Italy at the time, it was clear that the size of the undertaking required some kind of machinery.

The work was begun on with punched cards and card sorting machines in the late 1940s and was completed (33 years later) in the 1970s, using large IBM mainframe computers with computer-driven typesetting equipment. With various indexes and other associated information, the Index consists of about 70,000 typeset pages. There were two full concordances. One, produced directly by the computer, was a complete list of the occurrences of all word forms. This type of concordance, called unlemmatized, lists all word forms under separate entries. Busa's concordance also included a lemmatized concordance, where the list of headwords are standardized as they might appear in a dictionary - different forms of each verb or noun are gathered under a single entry. For the lemmatized concordance, the computer could not automatically bring all related forms together on its own. Thus, the lemmatized was a machine-assisted concordance, requiring significant human interaction with the machine. By Busa's estimate, the entire work consumed much more than one million person-hours - mainly to enter and verify the text, but also to perform a machine-assisted lemmatization. Most of the earliest use of computers in the humanities involved the production of unlemmatized concordances. Early on the KWIC concordance format was developed. "KWIC" stands for "Keyword in Context", and an typical excerpt is shown below:

sceptic (11)
 
[1,47]    abstractions. In vain would the sceptic make a distinction
 
[1,48]    to science, even no speculative sceptic, pretends to entertain
 
[1,49]   and philosophy, that Atheist and Sceptic are almost synonymous.
 
[1,49]       by which the most determined sceptic must allow himself to
 
[2,60] of my faculties? You might cry out sceptic and railer, as much as
 
[3,65]     profession of every reasonable sceptic is only to reject
 
[8,97] prepare a compleat triumph for the Sceptic; who tells them,  that
 
[11,121] to judge on such a subject. I am Sceptic enough to allow, that
 
[12,130]        absolutely insolvable. No Sceptic denies that we lie 
 
[12,130]    merit that name, is, that the Sceptic, from habit, caprice,
 
[12,139]            To be a philosophical Sceptic is, in a man of
 
sceptical (7)
 
[1,43]     himself, consistently with his sceptical principles. So that,
 
[1,44] if a man has accustomed himself to sceptical considerations on the
 
[1,45]       on any subject, were not the sceptical reasonings so refined
 
[3,65] that degree of firmness: even your sceptical play and wantonness
 
[6,84]   has no place, on any hypothesis, sceptical or religious. Every
 
[10,112]      and design, I needed all my sceptical and metaphysical
 
[11,115]          and obscurity, is to be sceptical, or at least

The entire vocabulary of the work (which happens to be the Scottish Philosopher David Hume's Dialogues Concerning Natural Religion) is listed in alphabetical order. In the excerpt we can see the portion of the entire KWIC concordance beginning with the word form "sceptic", and "sceptical". Each word form, called a headword, is followed by its occurrences. Each occurrence, in turn, is given on a separate line consisting, first, of some "reference information" that helps the KWIC user locate the occurrence in the full text, and then by a brief excerpt that shows the word in its context - hence the name "Keyword in Context", or KWIC. In this example, the reference information is enclosed in square brackets and gives the "part number" that the word occurs in, and a page number from a standard edition of the text - the first occurrence of "sceptic" occurs in part 1, and is on page 47.

OCP and the Oxford Text Archive

Like the Index, most of the earliest use of computers in the humanities was in the production of unlemmatized concordances. Obviously, if any work was to be done with a computer on a text, the text itself had to be in "machine-readable form" - essentially an editorial task. Nowadays, the idea of producing an "electronic version of the text" is an obvious one, but in the '50s, '60s, and most of the 70's there seemed to be limited interest in publishing the electronic versions - possibly because computing was outside the world of most humanists, and there was little suitable software that could work with texts in any event. Standard software packages were still not widely available, but a few started to appear - one of the first for the literary scholar was COCOA - a program developed by a consortium of British Universities.

Although by the end of the 70's humanities computing had been well established in a rather small community, it was clear that if it was ever to become more broadly based it needed both texts to be available in electronic form and standard software to be available that many could easily use. In both these areas Oxford University became an early leader. First, Susan Hockey and her associates at Oxford established the Oxford Text Archive - a repository for electronic texts. Any scholar, after s/he went to the effort of creating a text in electronic form, was encouraged to deposit it at the Archive and make it available to others. In parallel with the Archive, Oxford's programmers created the Oxford Concordance Program (OCP) which was destined to be the most widely used humanities batch computing program ever. Indeed, both OCP and the Archive are still important today. Nowadays, a significant part of the Archive's holdings are available virtually immediately via the Internet. OCP has been overshadowed by other computing developments, but it is software that is still run today.

In the late 1970s, computers were still central resources at academic institutions and cost millions of dollars. The computer production of a concordance was still a time consuming task. The author recalls assisting one academic who wanted to produce a KWIC concordance of the work of a particular Latin author in the late 1970s. The scholar was able to purchase the text, so she didn't have to enter it herself. The computer ran the concordance program COGS (written by the author John Bradley) for several hours processing the text. The result was first routed to magnetic tape and subsequently tied up a computer printer for the entire day printing a stack of paper about 40 feet high.

Computer Assisted Text Analysis Today

The work of Alastair McKinnon, a philosopher at McGill University, which has spanned the 60s, 70s and 80s and continues today, shows how humanities computing has changed. Prof. McKinnon began in the humanities computing mainstream in the 1960s with the goal of publishing a complete printed concordance of the published writings of the philosopher Soren Kierkegaard (1.9 million words) - based not only on the Danish text, but also on the standard German, English and French translations. From the preface to the printed edition (published in 1971) it is clear that some significant portion of the intellectual effort involved in the production of the electronic version of the text for his concordance consisted in dealing with the same type of editorial issues that would arise in the preparation of a conventional printed one.

However, McKinnon's work with his hard-won electronic version was not finished after the concordance was published. Instead, he continued to experiment, and it became clear that by the early 1980's his own intellectual interest had been sparked by the benefits of having access to a large and significant text in electronic form, and having useful computer tools that could answer more sophisticated enquiries than a standard concordance could. As his work became known, he took the step of publishing the electronic form of the Kierkagaard corpus, with his collection of computer programs (called TEXTMAP) which could be used to manipulate it. After acquiring the corpus and either using McKinnon's software, or developing their own, other researchers could also ask questions about the words in Kierkegaard's text that would be difficult to answer with a printed concordance: to find all the co-occurrences of two commonly occurring words, for example. Indeed, they could investigate the text in ways other than those McKinnon had envisioned. He also opened the door for the application of statistically-based methods to Kierkegaard's writings, a development to which we shall return shortly.

The appearance of the personal computer in the early 1980s, and its gradual growth to a machine capable of handling larger texts in the late 80s affected the humanities. The development of word processing software was probably the most important development, but the possibility of using the same computer to support new research struck many scholars. If this was to be possible, standard software had to be available, and it had to be interactive. The first widely known software was the Brigham Young Concordance program (BYC) - subsequently commercialized and called WordCruncher. Another widely used package is TACT, developed originally by John Bradley and Lidio Presutti at the University of Toronto. The software demonstrated here (TACTweb) was also built by John Bradley.

Statistical Methods

The availability of electronic texts has changed the orientation of computer assisted research projects. No longer need editorial issues and problems dominate in the reports of scholars doing computer based research; instead the scholar can focus on using the computer to understand the text itself in new ways. In addition to allowing one to search for words or phrases throughout a text, the availability of electronic texts has made possible increased use of statistical methods - often a particular collection of methods called Multivariate Analysis (MVA). Such methods promise to allow the computer to do more than just find things for you. Gerard Ledger's in his book Recounting Plato uses statistical methods to work out the chronology of Plato's dialogues. He ignores traditional critical methods and uses multivariate analysis techniques on the orthographic content of words to identify which dialogues were likely to have been written earlier or later. Ledger notes that multivariate analysis is widely used in the sciences like biology where there is a classification problem similar to that needed in stylometric studies. His methods are still controversial, as reviews in the Bryn Mayr Review indicate.

Alastair McKinnon has also been using multivariate analysis, but for a very different purpose. As he puts it, he is trying to show how MVA "could be used to identify and name the dimensions of a literary corpus". In other words MVA can tell us what a text is about - what are its major themes. Using a version of correspondence analysis developed by Jean-Paul Benzècri and promoted by Michael Greenacre, he starts with the word distributions of each of Kierkagaard's 34 works and produces a mapping of the words and works into an 8 dimensional space. By examining the words that most strongly affect the statistical definition of the dimensions, he attempts to identify themes that relate most strongly with these dimensions. Although this approach has been called "mystical" by a colleague at the University of Toronto, it does give an interesting perspective on the entire Kierkagaard corpus.

For both the literary scholar, and the average reader, working with the ideas that one finds in a text is much more interesting and rewarding than working with the actual words themselves. As you will see, if you use the TACTweb environment, making the computer move from identifying individual words to the ideas that these words represent is difficult. For this reason there is well-founded skepticism about the use of computers in text analysis and a corresponding need to improve on the existing computer based text analysis tools.

[TACTweb Query Form] [Next][TACTweb Home Page]

Web design Alex Stevens, content Geoffrey Rockwell or John Bradley. March 7, 1997