Copyright note: most material on this page is taken from the help file in WordSmith Tools. Thanks to Mike Scott.
This tool generates word lists based on one or more ASCII or ANSI text files. The word lists can be generated in both a lphabetical and frequency order, and optionally you can generate a word index list too.
These lists can be used
These word-lists may also be used as input to a KeyWords prog ram, which analyses the words in a given text and compares frequencies with a reference corpus, in order to generate lists of "key-words" and "key-key-words" (see below).
If a text is 1,000 words long, it is said to have 1,000 tokens. But a lot of these words will be repeated, and there may be only say 400 different words in the text. Types therefore are the different words.
The r atio between types and tokens in this example would be 40%. But this ratio varies very widely in accordance with the length of the text -- or corpus of texts -- which is being studied. A 1,000 word article might have a type/token ratio of 40%; a shorter one might reach 70%; 4 million words will probably give a type/token ratio of about 2%, and so on. Such type/token information is rather meaningless in most cases.
The standardised type/token ratio is computed every n words as the pro gram goes through each text file. In other words, if n=1,000, then the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.)
The purpose of this tool is to locate and identify key words in a given text. To do so, it compares the words in the text with a reference set of words usually taken from a large corpus of text. Any word which is found to be outstanding in its frequency in the text is considered "key". The key words are presented in order of outstandingness.
The "key words" are calculated by comparing the frequency of each word in the smaller of the two wordlists with the frequency of the same word i n the reference wordlist. All words which appear in the smaller list are considered, unless they are in a stop list.
If the occurs say, 5% of the time in the small wordlist and 6% of the time in the reference corpus, it will not turn out to be "key", though it may well be the most frequent word. If the text concerns the anatomy of spiders, it may well turn out that the names of the researchers, and the items spider, leg, eight, etc. may be more frequent than they would otherwise be in your reference corpus (unless your reference corpus only concerns spiders!)
To compute the "key-ness" of an item, the program therefore computes
and cross-tabulates these.
Statistical tests include:
A word will get into the listing here if it is unusually frequent (or unusually infrequent) in comparison with what one would expect on the basis of the larger wordlist.
A "key key-word" is one which is "key" in more than one of a number of related texts. The more texts it is "key" in, the more "key key" it is. This will depend a lot on the topic homogeneity of the corpus being investigated. In a corpus of City news texts, items like bank, profit, companies are key key-words, while computer will not be, though computer might be a key word in a few City news stories about IBM or Microsoft share dealings.
A Mutual Inform ation (MI) score relates one word to another. For example, if problem is often found with solve, they may have a high mutual information score. Usually, the will be found much more often besides problem than solve, so the procedure for calculating Mutual Information takes into account not just the most frequent words found near the word in question, but also whether each word is often found elsewhere, well away from the word in question. Since the is found very ofte n indeed far away from problem, it will not tend to be related, that is, it will get a low MI score.
This relationship is bi-lateral: in the case of kith and kin, it doesn't distinguish between the virtual certainty of finding kin near kith, and the much lower likelihood of finding kith near kin.
The MI score expresses the extent to which observed frequency of co-occurrence differs from what w e would expect (statistically speaking). It does not work very well with very low frequencies. For instance, sour occurs 472 times and puss 31 times in the CobuildDirect corpus. Since sour and puss co-occur 4 times, this gives this particular collocation a very high MI score. The t-score provides a way of getting away from this problem since it also takes frequency into account. To sum up, MI is more likely to give high scores to totally fixed phrases whereas t-score will yield s ignificant collocates that occur relatively frequently. In most cases, t-score is the most reliable measurement.
A KWIC (KeyWord In Context) concordance is a set of examples of a given word or phrase. A line of text is shown for each occurrence found in the corpus. The search is usually word center-aligned for easier analysis.
The point of a concordance is to be able to see lots of examples of a word or phrase, in their contexts. Y ou get a much better idea of the use of a word by seeing lots of examples of it, and it's by seeing or hearing new words in context lots of times that you come to grasp the meaning of most of the words in your native language. It's by seeing the contexts that you get a better idea about how to use the new word yourself. A dictionary can tell you the meanings but it's not much good at showing you how to use the word.
Language students can use a concordancer to find out how to use a word or phrase, or to find out which other words belong with a word they want to use. For example, it's through using a concordancer that you could find out that in academic writing, a paper can describe, claim, or show, though it doesn't believe or want (*this paper wants to prove that ...).
Language teachers can use the concordancer to find similar patterns so as to help their students. They can also use this tool to help produce vocabulary exercises, by choosing two or three search-words, blanking them out, then pri nting.
Researchers can use a concordancer, for example when searching through a database of hospital accident records, to see whether fracture is associated with fall, grease, ladder. Or to examine historical documents to find all the references to land ownership.
Collocates are the words which occur in the neighbourhood of your search word. Collocates of letter might include po st, stamp, envelope, etc. However, very common words like the will also collocate with letter.
By examining the collocates you can find out more about "the company the word keeps", which helps to show its meaning and its usage.
You may compute a concordance with or without collocates: without is slightly quicker and will take up less room on your hard disk. The number of collocates stored will depend on the collocation horizons.
The literature on collocation has never distinguished very satisfactorily between collocates which we think of as "associated" with a word (letter - stamp) on the one hand, and on the other, the words which do actually co-occur with the word (letter - my, this, a, etc.). We could call the first type coherence collocates and the second neighbourhood collocates or horizon collocates. It has been suggested that to detect coherence collocates is very tricky, as once we start looking beyond a horiz on of about 4 or 5 words on either side, we get so many words that there is more noise than signal in the system.