Introduction


Home ] [ Introduction ] Getting Started ] Lemmas ] Parts of Speech ] Phrases ] Patterns ] Varieties ] Collocation ] Research ] Further study ]


 

If you are a student of English, a teacher, a translator or if you are writing in English, analysing English, or have any questions about how English works, concordancing can be of great benefit to you.

This mini-course is designed to introduce concordancing to students and teachers, native and non-native speakers of English and people with a general interest in language and learner autonomy. It is therefore casting its net over a wide range of people. It does not assume a great deal of linguistic knowledge: all required terminology, both in computer use and linguistics, is explained as it is introduced. The main purpose of these sessions is to introduce the techniques involved in searching for answers to questions. Just what can be asked will be revealed at every step, as we see how searches can be formed and refined. 

This first Session is a short introduction. In Session 2 we will dive directly into concordancing using the Collins Cobuild Corpus Concordance Sampler. This has been chosen because it is readily accessible through the internet and because of its rich variety of functions that demonstrate many features of full concordancers. Throughout these sessions it will be shortened to CCS.

If you are a non-native speaker of English, it is likely that you will want your English to be as close as possible to the norms of English. You might even think that English is spoken and written with so much variation that the norms are too unstable to be grasped. There is, of course, a core language which represents the vast majority of English, without which it could not be called one language. It is not always easy to gain access to those norms, i.e. the most likely way a native speaker would express something, and grammar books are not always able to answer questions, especially about word use: modern dictionaries, based on corpus research, provide more reliable information. Corpus-based grammar books now exist also, but so far, they are of more use to the language professional than the general user. One of these is the Longman Grammar of Written and Spoken English. Read a review of it in TESL-EJ.

What is Cobuild?

It is an acronym: Collins Birmingham University International Language Database. A great deal of pioneering work in corpus linguistics has been done at Birmingham University.

What is a corpus?

A collection or body of texts in electronic form. The plural is corpora. The Cobuild corpus is referred to as The Bank of English

All this terminology!

Yes, there is quite a lot of terminology - as in any field. Most of the linguistic terms are lexical, which are not as familiar to language learners as grammar terms. There are several glossaries listed at the end of this course. And I have made another one with terms more specifically related to the ideas here.

What is a concordancer?

Software for looking into a corpus.

What is a concordance?

            The lines of text illustrating the search word, the node.

A caveat for a corpus

A corpus is assembled by collecting texts in electronic form. The texts are usually chosen to represent such things as:

genre

contract, letter of appointment, theatre program

domain

the family, at work

register

conversation, fiction, newspaper language, academic prose

mode

writing, speaking, gesturing

 

Jan Svartvik wrote: Every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way.

 

Importantly, texts are not corrected according to any grammar or spelling rules, taboo words are not “cleaned up”, and general abuse of the language sits happily alongside general use of the language. Slips of the tongue, pen and keyboard remain intact. It is therefore a descriptive sample of the language, not a prescriptive one: this makes it rich, but it also means that you should usually look for significant patterns, not oddities. More on this later.

A caveat for a sampler

In this Sampler, a database of millions of words is searched and up to forty concordance lines are shown. In a full concordancer, for example Microconcord (by Tim Johns and Mike Scott) which has its own corpus of only 2 million words, and the British National Corpus of 100 million words, you find:

 

Microconcord

BNC

hand

459

33484

grant

81

7594

unemployment

120

6409

university

225

16316

 

 

Somewhat more than forty! Further machine intervention is required when you have large numbers of finds. From an introductory point of view, the forty lines limitation is a manageable number to deal with. And there are techniques for refining your search to get forty sharply focused lines, as we shall soon see.

When forty lines are shown, however, it is not clear how many they were selected from. For example, if you are comparing the use of “at least” and “at the least”, knowing their relative frequencies would be a useful starting point since frequency is an indicator of typicality.

 

CCS

Microconcord

at least

40

662

at the least

34

1

 

 

Ultimately, a sampler can answer the question: is there any evidence for…?, rather than a more decisively-framed question.

Non-native students learning English

Data Driven Learning refers to studying English by isolating patterns that occur in real language. The student answers his or her language questions by analysing the data the concordance produces. The remainder of these sessions shows just how that is done. These procedures were pioneered largely by Tim Johns (now retired from Birmingham University) who named this pedagogical application Data Driven Learning. Links to some others involved in this form of language study appear throughout these sessions.

Jan Svartvik also wrote: I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate.

One last caveat

Since you have got this far, let me close the introductory session with another caveat, this time for this website. You are reading its first version before it has even been tried out on anyone. Therefore, all comments and suggestions will be gratefully received, read and taken on board. My contact: thomas@fi.muni.cz.