Text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies. In this tutorial, we describe how to build a text classifier with the fastText tool.
The goal of text classification is to assign documents (such as emails, posts, text messages, product reviews, etc...) to one or multiple categories. Such categories can be review scores, spam v.s. non-spam, or the language in which the document was typed. Nowadays, the dominant approach to build such classifiers is machine learning, that is learning classification rules from examples. In order to build such classifiers, we need labeled data, which consists of documents and their corresponding categories (or tags, or labels).
As an example, we build a classifier which automatically classifies stackexchange questions about cooking into one of several possible tags, such as `pot`, `bowl` or `baking`.
quantize quantize a model to reduce the memory usage
test evaluate a supervised classifier
predict predict most likely labels
predict-prob predict most likely labels with probabilities
skipgram train a skipgram model
cbow train a cbow model
print-word-vectors print word vectors given a trained model
print-sentence-vectors print sentence vectors given a trained model
nn query for nearest neighbors
analogies query for analogies
supervised 训练一个监督分类器
quantize 量化模型以减少内存使用量
test 评估一个监督分类器
predict 预测最有可能的标签
predict-prob 用概率预测最可能的标签
skipgram 训练一个 skipgram 模型
cbow 训练一个 cbow 模型
print-word-vectors 给定一个训练好的模型,打印出所有的单词向量
print-sentence-vectors 给定一个训练好的模型,打印出所有的句子向量
nn 查询最近邻居
analogies 查找所有同类词
```
In this tutorial, we mainly use the `supervised`, `test` and `predict` subcommands, which corresponds to learning (and using) text classifier. For an introduction to the other functionalities of fastText, please see the [tutorial about learning word vectors](https://fasttext.cc/docs/en/unsupervised-tutorial.html).
As mentioned in the introduction, we need labeled data to train our supervised classifier. In this tutorial, we are interested in building a classifier to automatically recognize the topic of a stackexchange question about cooking. Let's download examples of questions from [the cooking section of Stackexchange](http://cooking.stackexchange.com/), and their associated tags:
Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the `__label__` prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document.
Before training our first classifier, we need to split the data into train and validation. We will use the validation set to evaluate how good the learned classifier is on new data.
The `-input` command line option indicates the file containing the training examples, while the `-output` option indicates where to save the model. At the end of training, a file `model_cooking.bin`, containing the trained classifier, is created in the current directory.
It is possible to directly test our classifier interactively, by running the command:
可以通过命令行直接交互式地测试分类器:
```bash
>> ./fasttext predict model_cooking.bin -
```
and then typing a sentence. Let's first try the sentence:
然后输入一个句子。我们来试一下:
*Which baking dish is best to bake a banana bread ?*
The predicted tag is `baking` which fits well to this question. Let us now try a second example:
其预测的标签是 `baking`,非常贴切。 现在让我们尝试第二个例子:
*Why not put knives in the dishwasher?*
The label predicted by the model is `food-safety`, which is not relevant. Somehow, the model seems to fail on simple examples. To get a better sense of its quality, let's test it on the validation data by running:
>> ./fasttext test model_cooking.bin cooking.valid
...
...
@@ -116,7 +116,7 @@ R@1 0.0541
Number of examples: 3000
```
The output of fastText are the precision at one (`P@1`) and the recall at one (`R@1`). We can also compute the precision at five and recall at five with:
>> ./fasttext test model_cooking.bin cooking.valid 5
...
...
@@ -126,31 +126,31 @@ R@5 0.146
Number of examples: 3000
```
## Advanced readers: precision and recall
## 高级读者:精确度和召回率
The precision is the number of correct labels among the labels predicted by fastText. The recall is the number of labels that successfully were predicted, among all the real labels. Let's take an example to make this more clear:
On Stack Exchange, this sentence is labeled with three tags: `equipment`, `cleaning` and `knives`. The top five labels predicted by the model can be obtained with:
are`food-safety`, `baking`, `equipment`, `substitutions` and `bread`.
前五名是`food-safety`, `baking`, `equipment`, `substitutions` and `bread`.
Thus, one out of five labels predicted by the model is correct, giving a precision of 0.20. Out of the three real labels, only one is predicted by the model, giving a recall of 0.33.
The model obtained by running fastText with the default arguments is pretty bad at classifying new questions. Let's try to improve the performance, by changing the default parameters.
Looking at the data, we observe that some words contain uppercase letter or punctuation. One of the first step to improve the performance of our model is to apply some simple pre-processing. A crude normalization can be obtained using command line tools such as `sed` and `tr`:
We observe that thanks to the pre-processing, the vocabulary is smaller (from 14k words to 9k). The precision is also starting to go up by 4%!
我们观察到,由于预处理,词汇量更小(从 14k 到 9k)。精确度也开始上升4%!
### more epochs and larger learning rate
### 更多的迭代和更快的学习速率
By default, fastText sees each training example only five times during training, which is pretty small, given that our training set only have 12k training examples. The number of times each examples is seen (also known as the number of epochs), can be increased using the `-epoch` option:
>> ./fasttext test model_cooking.bin cooking.valid
...
...
@@ -198,7 +198,7 @@ R@1 0.218
Number of examples: 3000
```
This is much better! Another way to change the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would means that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range `0.1 - 1.0`.
Let us now add a few more features to improve even further our performance!
现在让我们多添加一些功能来进一步提高我们的性能!
### word n-grams
Finally, we can improve the performance of a model by using word bigrams, instead of just unigrams. This is especially important for classification problems where word order is important, such as sentiment analysis.
最后,我们可以通过使用 word bigrams 而不是 unigrams 来提高模型的性能。 这对于词序重要的分类问题尤其重要,例如情感分析。
With a few steps, we were able to go from a precision at one of 12.4% to 59.9%. Important steps included:
只需几个步骤,我们就可以从 12.4% 到达 59.9% 的精度。 重要步骤包括:
*preprocessing the data ;
*changing the number of epochs (using the option `-epoch`, standard range`[5 - 50]`) ;
*changing the learning rate (using the option `-lr`, standard range`[0.1 - 1.0]`) ;
*using word n-grams (using the option `-wordNgrams`, standard range`[1 - 5]`).
*预处理数据 ;
*改变迭代次数 (使用选项 `-epoch`, 标准范围`[5 - 50]`) ;
*改变学习速率 (使用选项 `-lr`, 标准范围`[0.1 - 1.0]`) ;
*使用 word n-grams (使用选项 `-wordNgrams`, 标准范围`[1 - 5]`).
## Advanced readers: What is a Bigram?
## 高级读者: 什么是 Bigram?
A 'unigram' refers to a single undividing unit, or token, usually used as an input to a model. For example a unigram can a word or a letter depending on the model. In fastText, we work at the word level and thus unigrams are words.
Similarly we denote by 'bigram' the concatenation of 2 consecutive tokens or words. Similarly we often talk about n-gram to refer to the concatenation any n consecutive tokens.
类似地,我们用 'bigram' 表示2个连续标记或单词的连接。 类似地,我们经常谈论 n-gram 来引用任意 n 个连续标记或单词的级联。
For example, in the sentence, 'Last donut of the night', the unigrams are 'last', 'donut', 'of', 'the' and 'night'. The bigrams are: 'Last donut', 'donut of', 'of the' and 'the night'.
例如,在 'Last donut of the night' 这个句子中,unigrams是 'last','donut','of','the' 和 'night'。 bigrams 是 'Last donut', 'donut of', 'of the' 和 'the night'。
Bigrams are particularly interesting because, for most sentences, you can reconstruct the order of the words just by looking at a bag of n-grams.
Let us illustrate this by a simple exercise, given the following bigrams, try to reconstruct the original sentence: 'all out', 'I am', 'of bubblegum', 'out of' and 'am all'.
Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the hierarchical softmax, instead of the regular softmax [Add a quick explanation of the hierarchical softmax]. This can be done with the option `-loss hs`: