|
If you are a student of English,
a teacher, a translator or if you are writing in English, analysing
English, or have any questions about how English works, concordancing
can be of great benefit to you.
This mini-course is designed to
introduce concordancing to students and teachers, native and non-native
speakers of English and people with a general interest in language and
learner autonomy. It is therefore casting its net over a wide range of
people. It does not assume a great deal of linguistic knowledge: all
required terminology, both in computer use and linguistics, is explained
as it is introduced. The
main purpose of these sessions is to introduce the techniques
involved in searching for answers to questions. Just what can be
asked will be revealed at every step, as we see how searches can
be formed and refined.
This first Session is a short
introduction. In Session 2 we will dive directly into concordancing
using the Collins Cobuild Corpus Concordance Sampler. This has been
chosen because it is readily accessible through the internet and because
of its rich variety of functions that demonstrate many features of full
concordancers. Throughout these sessions it will be shortened to CCS.
If you are a non-native speaker
of English, it is likely that you will want your English to be as close
as possible to the norms of English. You might even think that English
is spoken and written with so much variation that the norms are too
unstable to be grasped. There is, of course, a core language which
represents the vast majority of English, without which it could not be
called one language. It is not always easy to gain access to those
norms, i.e. the most likely way a native speaker would express
something, and grammar books are not always able to answer questions,
especially about word use: modern dictionaries, based on corpus
research, provide more reliable information. Corpus-based grammar books
now exist also, but so far, they are of more use to the language
professional than the general user. One of these is the Longman
Grammar of Written and Spoken English. Read a review of it in TESL-EJ.
It is an acronym: Collins
Birmingham University International Language Database. A great deal of
pioneering work in corpus linguistics has been done at Birmingham
University.
A collection or body of
texts in electronic form. The plural is corpora. The Cobuild corpus is
referred to as The
Bank of English.
All this terminology!
Yes, there is quite a lot of terminology - as in any field. Most of
the linguistic terms are lexical, which are not as familiar to language
learners as grammar terms. There are several glossaries listed at the
end of this course. And I have made another one with terms more
specifically related to the ideas here.
Software
for looking into a corpus.
The lines of text illustrating the search word, the node.
A corpus is assembled by
collecting texts in electronic form. The texts are usually chosen to
represent such things as:
|
genre
|
contract, letter of
appointment, theatre program
|
|
domain
|
the family, at work
|
|
register
|
conversation, fiction,
newspaper language, academic prose
|
mode
|
writing, speaking,
gesturing
|
Jan Svartvik wrote:
Every corpus that I've had a chance to examine, however small,
has taught me facts that I couldn't imagine finding out about
in any other way.
|
Importantly,
texts are not corrected according to any grammar or spelling rules,
taboo words are not “cleaned up”, and general abuse of the
language sits happily alongside general use of the language.
Slips of the tongue, pen and keyboard remain intact. It is therefore a
descriptive sample of the language, not a prescriptive one: this
makes it rich, but it also means that you should usually look for
significant patterns, not oddities. More on this later.
In this Sampler, a database of
millions of words is searched and up to forty concordance lines are
shown. In a full concordancer, for example Microconcord (by Tim
Johns and Mike Scott) which has its own corpus of only 2 million words,
and the British National Corpus of 100 million words, you find:
|
|
Microconcord
|
BNC
|
|
hand
|
459
|
33484
|
|
grant
|
81
|
7594
|
|
unemployment
|
120
|
6409
|
|
university
|
225
|
16316
|
Somewhat more than forty!
Further machine intervention is required when you have large numbers of
finds. From an introductory point of view, the forty lines limitation is
a manageable number to deal with. And there are techniques for refining
your search to get forty sharply focused lines, as we
shall soon see.
When forty lines are shown,
however, it is not clear how many they were selected from. For example,
if you are comparing the use of “at least” and “at the least”,
knowing their relative frequencies would be a useful starting point
since frequency is an indicator of typicality.
|
|
CCS
|
Microconcord
|
|
at least
|
40
|
662
|
|
at the least
|
34
|
1
|
Ultimately, a sampler can answer
the question: is there any evidence for…?, rather than a more
decisively-framed question.
Non-native students learning English
Data Driven Learning refers to
studying English by isolating patterns that occur in real language. The
student answers his or her language questions by analysing the data the
concordance produces. The remainder of these sessions shows just how
that is done. These procedures were pioneered largely by Tim Johns (now
retired from Birmingham University) who named this
pedagogical application Data Driven Learning. Links to some others
involved in this form of language study appear throughout these
sessions.
Jan Svartvik also
wrote: I don't think there can be any corpora, however large,
that contain information about all of the areas of English
lexicon and grammar that I want to explore; all that I have
seen are inadequate.
One last caveat
Since you have got this far, let
me close the introductory session with another caveat, this time
for this website. You are reading its first version before it has even
been tried out on anyone. Therefore, all comments and suggestions will
be gratefully received, read and taken on board. My contact: thomas@fi.muni.cz.
|