Corazones de Alcachofa

Corpus Review

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.

In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word’s part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.

Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around 1 to 3 million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics.

Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for part of speech tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.

COCA-The Corpus of Contemporary American English
The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. It was created by Mark Davies of Brigham Young University in 2008, and it is now used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created or modified, including the British National Corpus (our architecture and interface), the 100 million word TIME Corpus (1920s-2000s), and the new 400 million word Corpus of Historical American English (COHA; 1810-2009).

The corpus contains more than 410 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2010 and the corpus is also updated once or twice a year (the most recent texts are from Summer 2010). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language.

The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. You can search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near faint, all adjectives near woman, or all verbs near feelings), which often gives you good insight into the meaning and use of a word.

The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:

By genre: comparisons between spoken, fiction, popular magazines, newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals

Over time: compare different years from 1990 to the present time

You can also easily carry out semantically-based queries of the corpus. For example, you can contrast and compare the collocates of two related words (little/small, democrats/republicans, men/women), to determine the difference in meaning or use between these words. You can find the frequency and distribution of synonyms for nearly 60,000 words and also compare their frequency in different genres, and also use these word lists as part of other queries. Finally, you can easily create your own lists of semantically-related words, and then use them directly as part of the query.

Using the web interface, you can search by words (mysterious), phrases (nooks and crannies or faint + noun), lemmas (all forms of words, like sing or tall), wildcards (un*ly or r?n*), and more complex searches such as un-X-ed adjectives or verb + any word + a form of ground. Notice that from the “frequency results” window you can click on the word or phrase to see it in context in this lower window.

The first option in the search form allows to either see a list of all matching strings, or a chart display that shows the frequency in the five “macro” registers (spoken, fiction, popular magazines, newspapers, and academic journals).

Look for the frequency of funky, whom, incredibly + adjective, or forms of need + to + VERB. Via the chart display, you can also see the frequency of the word or phrase in subregisters as well, such as movie scripts, children’s fiction, women’s magazines, or medical journals. With the list display, you can also see the frequency of each matching string in each of the major sections of the corpus .

You can also search for collocates (words nearby a given word), which often provides insight into the meaning of a given word.

You can also include information about genre or a specific time period directly as part of the query. This allows you to see how words and phrases vary across speech and many different types of written texts. We can easily find which words and phrases occur much more frequently in one register than another, such as good + [noun] in fiction, or verbs in the slot [we * that] in academic writing.

Compare to other Corpora:

COCA offers a balance of availability, size, genres, and currency that is not found in other corpora, including the ANC, the BNC, the BOE, or the OEC.

The chart below provides a summary of the features of the different corpora:

Nevertheless, this AMerican Corpus could also be compared to that of the Oxford English.

The Oxford English Corpus is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press’s language research programme. It is the largest corpus of its kind, containing over two billion words.[1] The sources for these words are writings of all sorts, from “literary novels and specialist journals to everyday newspapers and magazines and from Hansard to the language of chatrooms, emails, and weblogs”[2]. This may be contrasted with similar databases that sample only a specific kind of writing.

The digital version of the Oxford English Corpus is formatted in XML and usually analysed with Sketch Engine software.[3]

Each document in the OE Corpus is accompanied by metadata naming:

author (if known; many websites make this difficult to determine reliably)
author gender (if known)
language type (e.g. British English, American English)
source website
year (+ date, if known)
date of collection
domain + subdomain
document statistics (number of tokens, sentences, etc.)[3]

Unlocking secrets
Using a corpus in modern lexicography is not only about tracking change. It’s worth remembering that corpus lexicography is still a relatively new art. For hundreds of years, including most of the 20th century, lexicographers worked without enough evidence: certainly nothing comparable with corpus data and sometimes with no evidence at all except their own intuition. Even when evidence of usage was available, dictionary editors had no means of filtering or sorting large amounts of data efficiently and reliably. That was only possible once technological advances in the late 20th century allowed computers to manipulate and process very large texts.

A huge part of the benefit of corpus lexicography, therefore, is in uncovering facts about the language which are not new, but which have simply not been noticed before. Take a look at the following examples for the verb cause and the adjective vivacious.

The verb cause is common in English (the 99th most common verb in the Oxford English Corpus as a whole, with 192,899 occurrences) and it’s likely to be part of every native speaker’s active vocabulary.

Try this exercise: first, think of a few sentences containing the verb cause.

You might come up with examples like these, which were made up by people when they were asked to do the same exercise:

The car went out of control and caused an accident.
The interruption in service was caused by unexpected shutdown.
The virus caused an epidemic.

The meaning seems to be quite clear. Here is a typical definition, found in many dictionaries:

cause v. be the cause of; make happen.
However, looking at the corpus evidence reveals something else about cause, which is not mentioned in the definition.


Text Corpus. Retrieved 10.28.2010

COCA. Corpus of Contenporary American English. Retrieved 10.27.2010

Corpus Oxford Dictionary. Retrieved 10.27.2010

American Corpus. Retrieved 10.28.1010


Aún no hay comentarios.


Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de

Estás comentando usando tu cuenta de Cerrar sesión /  Cambiar )

Google photo

Estás comentando usando tu cuenta de Google. Cerrar sesión /  Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión /  Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión /  Cambiar )

Conectando a %s

A %d blogueros les gusta esto: