{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Language Models\n",
"\n",
"Text is an important example of sequence data. In fact, we will use natural language models as the basis for many of the examples in this chapter. Given that, it's worth while discussing some things in a bit more detail. In the following we will view words (or sequences of characters) as a time series of discrete observations.\n",
"Assuming the words in a text of length $T$ are in turn $w_1, w_2, \\ldots, w_T$, then, in the discrete time series, $w_t$($1 \\leq t \\leq T$) can be considered as the output or label of time step $t$. Given such a sequence, the goal of a language model is to estimate the probability\n",
"\n",
"$$p(w_1, w_2, \\ldots, w_T).$$\n",
"\n",
"Language models are incredibly useful. For instance, an ideal language model would be able to generate natural text just on its own, simply by drawing one word at a time $w_t \\sim p(w_t|w_{t-1}, \\ldots w_1)$. Quite unlike the monkey using a typewriter, all text emerging from such a model would pass as natural language, e.g. English text. Furthermore, it would be sufficient for generating a meaningful dialog, simply by conditioning the text on previous dialog fragments. Clearly we are still very far from designing such a system, since it would need to *understand* the text rather than just generate grammatically sensible content.\n",
"\n",
"Nonetheless language models are of great service even in their limited form. For instance, the phrases *'to recognize speech'* and *'to wreck a nice beach'* sound very similar. This can cause ambiguity in speech recognition, ambiguity that is easily resolved through a language model which rejects the second translation as outlandish. Likewise, in a document summarization algorithm it's worth while knowing that *'dog bites man'* is much more frequent than *'man bites dog'*, or that *'let's eat grandma'* is a rather disturbing statement, whereas *'let's eat, grandma'* is much more benign.\n",
"\n",
"## Estimating a language model\n",
"\n",
"The obvious question is how we should model a document, or even a sequence of words. We can take recourse to the analysis we applied to sequence models in the previous section. Let's start by applying basic probability rules:\n",
"\n",
"$$p(w_1, w_2, \\ldots, w_T) = \\prod_{t=1}^T p(w_t | w_1, \\ldots, w_{t-1}).$$\n",
"\n",
"For example, the probability of a text sequence containing four tokens consisting of words and punctuation would be given as:\n",
"\n",
"$$p(\\mathrm{Statistics}, \\mathrm{is}, \\mathrm{fun}, \\mathrm{.}) = p(\\mathrm{Statistics}) p(\\mathrm{is} | \\mathrm{Statistics}) p(\\mathrm{fun} | \\mathrm{Statistics}, \\mathrm{is}) p(\\mathrm{.} | \\mathrm{Statistics}, \\mathrm{is}, \\mathrm{fun}).$$\n",
"\n",
"In order to compute the language model, we need to calculate the\n",
"probability of words and the conditional probability of a word given\n",
"the previous few words, i.e. language model parameters. Here, we\n",
"assume that the training data set is a large text corpus, such as all\n",
"Wikipedia entries, Project Gutenberg, or all text posted online on the\n",
"web. The probability of words can be calculated from the relative word\n",
"frequency of a given word in the training data set.\n",
"\n",
"For example, $p(\\mathrm{Statistics})$ can be calculated as the\n",
"probability of any sentence starting with the word 'statistics'. A\n",
"slightly less accurate approach would be to count all occurrences of\n",
"the word 'statistics' and divide it by the total number of words in\n",
"the corpus. This works fairly well, particularly for frequent\n",
"words. Moving on, we could attempt to estimate\n",
"\n",
"$$\\hat{p}(\\mathrm{is}|\\mathrm{Statistics}) = \\frac{n(\\mathrm{Statistics~is})}{n(\\mathrm{Statistics})}.$$\n",
"\n",
"Here $n(w)$ and $n(w, w')$ are the number of occurrences of singletons\n",
"and pairs of words respectively. Unfortunately, estimating the\n",
"probability of a word pair is somewhat more difficult, since the\n",
"occurrences of *'Statistics is'* are a lot less frequent. In\n",
"particular, for some unusual word combinations it may be tricky to\n",
"find enough occurrences to get accurate estimates. Things take a turn for the worse for 3 word combinations and beyond. There will be many plausible 3-word combinations that we likely won't see in our dataset. Unless we provide some solution to give such word combinations nonzero weight we will not be able to use these as a language model. If the dataset is small or if the words are very rare, we might not find even a single one of them.\n",
"\n",
"A common strategy is to perform some form of Laplace smoothing. We already encountered this in our discussion of [Naive Bayes](../chapter_crashcourse/naive-bayes.md) where the solution was to add a small constant to all counts. This helps with singletons, e.g. via\n",
"\n",
"$$\\begin{aligned}\n",
"\t\\hat{p}(w) & = \\frac{n(w) + \\epsilon_1/m}{n + \\epsilon_1} \\\\\n",
"\t\\hat{p}(w'|w) & = \\frac{n(w,w') + \\epsilon_2 \\hat{p}(w')}{n(w) + \\epsilon_2} \\\\\n",
"\t\\hat{p}(w''|w',w) & = \\frac{n(w,w',w'') + \\epsilon_3 \\hat{p}(w',w'')}{n(w,w') + \\epsilon_3}\n",
"\\end{aligned}$$\n",
"\n",
"Here the coefficients $\\epsilon_i > 0$ determine how much we use the\n",
"estimate for a shorter sequence as a fill-in for longer\n",
"ones. Moreover, $m$ is the total number of words we encounter. The\n",
"above is a rather primitive variant of what is Kneser-Ney smoothing\n",
"and Bayesian Nonparametrics can accomplish. See e.g. the Sequence\n",
"Memoizer of Wood et al., 2012 for more details of how to accomplish\n",
"this. Unfortunately models like this get unwieldy rather quickly:\n",
"first off, we need to store all counts and secondly, this entirely\n",
"ignores the meaning of the words. For instance, *'cat'* and *'feline'*\n",
"should occur in related contexts. Deep learning based language models\n",
"are well suited to take this into account. This, it is quite difficult\n",
"to adjust such models to additional context. Lastly, long word\n",
"sequences are almost certain to be novel, hence a model that simply\n",
"counts the frequency of previously seen word sequences is bound to\n",
"perform poorly there.\n",
"\n",
"\n",
"## Markov Models and $n$-grams\n",
"\n",
"Before we discuss solutions involving deep learning we need some more terminology and concepts. Recall our discussion of Markov Models in the previous section. Let's apply this to language modeling. A distribution over sequences satisfies the Markov property of first order if $p(w_{t+1}|w_t, \\ldots w_1) = p(w_{t+1}|w_t)$. Higher orders correspond to longer dependencies. This leads to a number of approximations that we could apply to model a sequence:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"p(w_1, w_2, w_3, w_4) &= p(w_1) p(w_2) p(w_3) p(w_4)\\\\\n",
"p(w_1, w_2, w_3, w_4) &= p(w_1) p(w_2 | w_1) p(w_3 | w_2) p(w_4 | w_3)\\\\\n",
"p(w_1, w_2, w_3, w_4) &= p(w_1) p(w_2 | w_1) p(w_3 | w_1, w_2) p(w_4 | w_2, w_3)\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"Since they involve one, two or three terms, these are typically referred to as unigram, bigram and trigram models. In the following we will learn how to design better models.\n",
"\n",
"## Natural Language Statistics\n",
"\n",
"Let's see how this works on real data. To get started we load text from H.G. Wells' [Time Machine](http://www.gutenberg.org/ebooks/35). This is a fairly small corpus of just over 30,000 words but for the purpose of what we want to illustrate this is just fine. More realistic document collections contain many billions of words. To begin, we split the document up into words and ignore punctuation and capitalization. While this discards some relevant information, it is useful for computing count statistics in general. Let's see what the first few lines look like."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# tokens: 13 ['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him']\n",
"# tokens: 12 ['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']\n"
]
}
],
"source": [
"import sys\n",
"sys.path.insert(0, '..')\n",
"\n",
"import collections\n",
"import re\n",
"with open('../data/timemachine.txt', 'r') as f:\n",
" lines = f.readlines()\n",
" raw_dataset = [re.sub('[^A-Za-z]+', ' ', st).lower().split()\n",
" for st in lines]\n",
"\n",
"# Let's read the first 10 lines of the text\n",
"for st in raw_dataset[8:10]:\n",
" print('# tokens:', len(st), st)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we need to insert this into a word counter. This is where the `collections` datastructure comes in handy. It takes care of all the accounting for us."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"frequency of 'traveller': 61\n",
"[('the', 2261), ('i', 1267), ('and', 1245), ('of', 1155), ('a', 816), ('to', 695), ('was', 552), ('in', 541), ('that', 443), ('my', 440)]\n"
]
}
],
"source": [
"counter = collections.Counter([tk for st in raw_dataset for tk in st])\n",
"print(\"frequency of 'traveller':\", counter['traveller'])\n",
"# Print the 10 most frequent words with word frequency count\n",
"print(counter.most_common(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, the most popular words are actually quite boring to look at. In traditional NLP they're often referred to as [stopwords](https://en.wikipedia.org/wiki/Stop_words) and thus filtered out. That said, they still carry meaning and we will use them nonetheless. However, one thing that is quite clear is that the word frequency decays rather rapidly. The 10th word is less than $\\frac{1}{5}$ as common as the most popular one. To get a better idea we plot the graph of word frequencies."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"