Заглавная страница
Избранные статьи
Случайная статья
Познавательные статьи
Новые добавления
Обратная связь

ТОП 10 на сайте

Приготовление дезинфицирующих растворов различной концентрации

Техника нижней прямой подачи мяча.

Франко-прусская война (причины и последствия)

Организация работы процедурного кабинета

Смысловое и механическое запоминание, их место и роль в усвоении знаний

Коммуникативные барьеры и пути их преодоления

Обработка изделий медицинского назначения многократного применения

Образцы текста публицистического стиля

Четыре типа изменения баланса

Задачи с ответами для Всероссийской олимпиады по праву

Мы поможем в написании ваших работ!

ЗНАЕТЕ ЛИ ВЫ?

Влияние общества на человека

Приготовление дезинфицирующих растворов различной концентрации

Практические работы по географии для 6 класса

Организация работы процедурного кабинета

Изменения в неживой природе осенью

Уборка процедурного кабинета

Сольфеджио. Все правила по сольфеджио

Балочные системы. Определение реакций опор и моментов защемления

Главная Избранные Случайная статья Познавательные Новые добавления Обратная связь FAQ

LECTURE # 2 . The method of Corpus Analysis . The British National Corpus

⇐ ПредыдущаяСтр 2 из 4Следующая ⇒

1. General notes: benefits, purpose, definitions.

2.Design of the corpus.

3. Selection features.

4. Design of the spoken component.

5. The searching procedure.

Quantitative methods have been long and are now widely applied in linguistic researches. The statistical data obtained enables to draw solid conclusions. Nowadays the via-computer access to large amounts of linguistic evidence helps to avoid time and effort consuming complicated formula-based calculations. The corpus method is in question.

C orpus is a large collection of computer-readable writings.

Corpus Linguistics is a study of language that includes all processes related to processing, usage and analysis of written or spoken machine-readable corpora. Corpus linguistics is a relatively modern term used to refer to a methodology, which is based on examples of ‘real life’ language use. At present, effectiveness and usefulness of corpus linguistics is closely related to the development of computer science. There are:

The Bank of English – 524 mln words (COBUILD dictionaries are based on it).

The Corpus of Contemporary American English (COCA) – 450 million words (1990-2012)

The Longman Written American Corpus is a dynamic corpus of 100 million words comprising running text from newspapers, journals, magazines, best-selling novels, technical and scientific writing, and coffee-table books.

The Longman Spoken American Corpus is a unique resource of 5 million words of everyday American speech.

The British National Corpus – 100 mln words.

The Czech Corpus – focuses mainly on written Czech, over 100 million words.

The International Netherlands Language Corpus – 38 mln.words.

The International Netherlands Language Newspaper Corpus – 27 mln.words.

The Portuguese Corpus – 45 million words.

The Oslo Corpus of Bosnian Texts – 1.5 million words.

The British National Corpus: World Edition October 2000

General Notes

An initial report appeared in 1991, and a substantially revised and expanded version in early 1994.

Lead partner in consortium [[kən'sɔːtɪəm]] (an association of companies, esp. one formed for a particular purpose): Oxford University Press

The general benefits of the corpus method:

– The material collected in large computerized corpora represents authentic rather than invented language.

– Computers can process enormous amounts of data.

– The method of retrieving the data is objective rather than intuitive, which implies that studies can be replicated by other researches using the same or different corpora.

– Specific corpora selected from particular types of texts allow for comparisons of the use and frequency of certain features in different text-types, provided that the corpora are large enough.

Purpose

The uses originally envisaged for the British National Corpus were set out in a working document called Planned Uses of the British National Corpus BNCW02 (11 April 91). This document identified the following as likely application areas for the corpus:

• reference book publishing

• academic linguistic research

• language teaching

• artificial intelligence

• natural language processing

• speech processing

• information retrieval

Particularly, the database provided by the Corpus may be used:

1) as a source of examples of “real life” language usage in teaching English;

2) for finding new tendencies in language development;

3) for the investigation of a speaker’s role in language production;

4) for determining peculiarities of different registers;

5) for contrastive analysis of English as a Native Language and English as a Foreign Language;

6) for theory and practice of translation using so called “translation and parallel corpora”.

The same document identified the following categories of linguistic information derivable from the corpus:

• lexical

• semantic/pragmatic

• syntactic

• morphological

• graphological/written form/orthographical

The example of the contrastive analysis: the research in the sphere of infinitive and gerundial constructions usage has demonstrated the overuse of the infinitive construction after the word «possibility» by the students learning English as their second language. At the same time the speakers for whom English is a mother-tongue use the gerundial construction only.

General definitions

The British National Corpus is:

• a sample corpus: composed of text samples generally no longer than 45,000 words.

• a synchronic corpus: the corpus includes imaginative texts from 1960, informative texts from 1975.

• a general corpus: not specifically restricted to any particular subject field, register or genre.

• a monolingual British English corpus: it comprises text samples which are sub-stantially the product of speakers of British English.

• a mixed corpus: it contains examples of both spoken and written language.

Design of the corpus

There is a broad consensus among the participants in the project and among corpus linguists that a general-purpose corpus of the English language would ideally contain a high proportion of spoken language in relation to written texts. However, it is significantly more expensive to record and transcribe natural speech than to acquire written text in computer-readable form. Consequently the spoken component of the BNC constitutes approximately 10 per cent (10 million words) of the total and the written component 90 per cent (90 million words). These were agreed to be realistic targets, given the constraints of time and budget, yet large enough to yield valuable empirical statistical data about spoken English.

The BNC World Edition contains 4054 texts and occupies 1,508,392 Kbytes, or about 1.5 Gb. In total, it comprises just over 100 million orthographic words (specifically, 100,467,090), but the number of w-units is slightly less: 97,619,934. The total number of s-units is just over 6 million (6,053,093).

• S-units (segment-units): number of <s> elements – more or less equivalent to sentences

• W-units: number of <w> elements – more or less equivalent to words.

The percentage is calculated with reference to the relevant portion of the corpus, for example, in the table for "written text domain", with reference to the total number of written texts. These reference totals are given in the first table below.

Table 1. Composition of the BNC World Edition

Text type	Texts	Kbytes	W-units	S-units	percent
Spoken demographic	153	4206058	4.30	610563	10.08
Spoken context-governed	757	6135671	6.28	428558	7.07
All Spoken	910	10341729	10.58	1039121	17.78
Written books and periodicals	2688	78580018	80.49	4403803	72.75
Written-to-be-spoken	35	1324480	1.35	120153	1.98
Written miscellaneous	421	7373707	7.55	490016	8.09
All Written	3144	87278205	89.39	5013972	82.82

All texts are also classified according to their date of production. For spoken texts, the date was that of the recording. For written texts, the date used for classification was the date of production of the material actually transcribed, for the most part; in the case of imaginative works, however, the date of first publication was used. Informative texts were selected only from 1975 onwards, imaginative ones from 1960, reflecting their longer “shelf-life”, though most (75 per cent) of the latter were published no earlier than 1975.

Table 2. Date of production

Creation date	texts	w-units	%	s-units	%
Unknown	162	1814051	1.85	127132	2.10
Before 1974	47	1741624	1.78	121323	2.00
1974 to 1983	156	4621950	4.73	255057	4.21
1984 to 1994	3689	89442309	91.62	5549581	91.68

Selection features

Texts were chosen for inclusion according to three selection features: domain (subject field), time (within certain dates) and medium (book, periodical, etc.). The purpose of these selection features was to ensure that the corpus contained a broad range of different language styles, for two reasons. The first was so that the corpus could be regarded as a microcosm of current British English in its entirety, not just of particular types. The second was so that different types of text could be compared and contrasted with each other.

3.1. Sample size and method

For books, a target sample size of 40,000 words was chosen. No extract included in the corpus exceeds 47,000 words. Text samples normally consist of a continuous stretch of discourse from within the whole. Only one sample was taken from any one text. Samples were taken randomly from the beginning, middle or end of longer texts. (In a few cases, where a publication included essays or articles by a variety of authors of different nationalities, the work of non-UK authors was omitted.) As far as possible, the individual stories in one issue of a newspaper were grouped according to domain, for example as “Business” articles, “Leisure” articles, etc.

The following subsections discuss each selection criterion, and indicate the actual numbers of words in each category included.

Domain

Classification according to subject field seems hardly appropriate to texts which are fictional or which are generally perceived to be literary or creative. Consequently, these texts are all labelled imaginative and are not assigned to particular subject areas. All other texts are treated as informative and are assigned to one of the eight domains listed in Tab. 3.

Table 3. Written domain

Domain	texts	w-units	%	s-units	%
Applied science	370	7104635	8.14	357067	7.12
Arts	261	6520634	7.47	321442	6.41
Belief and thought	146	3007244	3.44	151418	3.01
Commerce and finance	295	7257542	8.31	382717	7.63
Imaginative	477	16377726	18.76	1356458	27.05
Leisure	438	12187946	13.96	760722	15.17
Natural and pure science	146	3784273	4.33	183466	3.65
Social science	527	13906182	15.93	700122	13.96
World affairs	484	17132023	19.62	800560	15.96

The labels we have adopted represent the highest levels of a fuller taxonomy of text medium.

Table 4. Written medium

Medium	texts	w-units	%	s-units	%
Book	1414	49891770	57.16	2895652	57.75
Periodical	1208	28356005	32.48	1487725	29.67
Published miscellanea	238	4197450	4.80	288004	5.74
Unpublished miscellanea	249	3508500	4.01	222438	4.43
To-be-spoken	35	1324480	1.51	120153	2.39

The ‘Miscellaneous published’ category includes brochures, leaflets, manuals, advertise-ments. The ‘Miscellaneous unpublished’ category includes letters, memos, reports, minutes, essays. The ‘written-to-be-spoken’ category includes scripted television material, play scripts etc.

3. Selection procedures employed – Books

Roughly half the titles were randomly selected from available candidates identified in Whitaker’s Books in Print (BIP), 1992, by students of Library and Information Studies at Leeds City University. Each text randomly chosen was accepted only if it fulfilled certain criteria: it had to be published by a British publisher, contain sufficient pages of text to make its incorporation worthwhile, consist mainly of written text, fall within the designated time limits, and cost less than a set price. The final selection weeded out texts by non-UK authors. Half of the books having been selected by this method, the remaining half were selected systematically.

⇐ Предыдущая 123 4 Следующая ⇒

Читайте также:

Как правильно слушать собеседника

Типичные ошибки при выполнении бросков в баскетболе

Принятие христианства на Руси и его значение

Средства массовой информации США

Последнее изменение этой страницы: 2021-03-10; просмотров: 73; Нарушение авторского права страницы; Мы поможем в написании вашей работы!

infopedia.su Все материалы представленные на сайте исключительно с целью ознакомления читателями и не преследуют коммерческих целей или нарушение авторских прав. Обратная связь - 3.14.255.254 (0.015 с.)