提交 1d7a0f9d 编写于 作者: T tink2123

update doc for rec and training

上级 ef9101b1
## 文字识别
# 文字识别
本文提供了PaddleOCR文本识别任务的全流程指南,包括数据准备、模型训练、调优、评估、预测,各个阶段的详细说明:
......@@ -19,7 +19,7 @@
<a name="数据准备"></a>
### 1. 数据准备
## 1. 数据准备
PaddleOCR 支持两种数据格式:
......@@ -36,7 +36,7 @@ mklink /d <path/to/paddle_ocr>/train_data/dataset <path/to/dataset>
```
<a name="准备数据集"></a>
#### 1.1 自定义数据集
### 1.1 自定义数据集
下面以通用数据集为例, 介绍如何准备数据集:
* 训练集
......@@ -82,7 +82,7 @@ train_data/rec/train/word_002.jpg 用科技让复杂的世界更简单
<a name="数据下载"></a>
#### 1.2 数据下载
### 1.2 数据下载
- ICDAR2015
......@@ -114,7 +114,7 @@ python gen_label.py --mode="rec" --input_path="{path/of/origin/label}" --output_
<a name="字典"></a>
#### 1.3 字典
### 1.3 字典
最后需要提供一个字典({word_dict_name}.txt),使模型在训练时,可以将所有出现的字符映射为字典的索引。
......@@ -161,16 +161,16 @@ PaddleOCR内置了一部分字典,可以按需使用。
并将 `character_type` 设置为 `ch`
<a name="支持空格"></a>
#### 1.4 添加空格类别
### 1.4 添加空格类别
如果希望支持识别"空格"类别, 请将yml文件中的 `use_space_char` 字段设置为 `True`
<a name="启动训练"></a>
### 2. 启动训练
## 2. 启动训练
<a name="数据增强"></a>
#### 2.1 数据增强
### 2.1 数据增强
PaddleOCR提供了多种数据增强方式,默认配置文件中已经添加了数据增广。
......@@ -181,7 +181,7 @@ PaddleOCR提供了多种数据增强方式,默认配置文件中已经添加
*由于OpenCV的兼容性问题,扰动操作暂时只支持Linux*
<a name="通用模型训练"></a>
#### 2.2 通用模型训练
### 2.2 通用模型训练
PaddleOCR提供了训练脚本、评估脚本和预测脚本,本节将以 CRNN 识别模型为例:
......@@ -300,7 +300,7 @@ Eval:
**注意,预测/评估时的配置文件请务必与训练一致。**
<a name="多语言模型训练"></a>
#### 2.3 多语言模型训练
### 2.3 多语言模型训练
PaddleOCR目前已支持80种(除中文外)语种识别,`configs/rec/multi_languages` 路径下提供了一个多语言的配置文件模版: [rec_multi_language_lite_train.yml](../../configs/rec/multi_language/rec_multi_language_lite_train.yml)
......@@ -356,7 +356,7 @@ Eval:
...
```
<a name="评估"></a>
### 3 评估
## 3 评估
评估数据集可以通过 `configs/rec/rec_icdar15_train.yml` 修改Eval中的 `label_file_path` 设置。
......@@ -366,7 +366,7 @@ python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/rec
```
<a name="预测"></a>
### 4 预测
## 4 预测
使用 PaddleOCR 训练好的模型,可以通过以下脚本进行快速预测。
......
## 模型训练
# 模型训练
本文将介绍模型训练时需掌握的基本概念,和训练时的调优方法。
同时会简单介绍PaddleOCR模型训练数据的组成部分,以及如何在垂类场景中准备数据finetune模型。
### 1. 基本概念
- [1. 基本概念](#基本概念)
* [1.1 学习率](#学习率)
* [1.2 正则化](#正则化)
* [1.3 评估指标](#评估指标)
- [2. 常见问题](#常见问题)
- [3. 数据与垂类场景](#数据与垂类场景)
* [3.1 训练数据](#训练数据)
* [3.2 垂类场景](#垂类场景)
* [3.3 自己构建数据集](#自己构建数据集)
<a name="基本概念"></a>
## 1. 基本概念
OCR(Optical Character Recognition,光学字符识别)是指对图像进行分析识别处理,获取文字和版面信息的过程,是典型的计算机视觉任务,
通常由文本检测和文本识别两个子任务构成。
模型调优时需要关注以下参数:
#### 1.1 学习率
<a name="学习率"></a>
### 1.1 学习率
学习率是训练神经网络的重要超参数之一,它代表在每一次迭代中梯度向损失函数最优解移动的步长。
在PaddleOCR中提供了多种学习率更新策略,可以通过配置文件修改,例如:
......@@ -29,8 +41,8 @@ Optimizer:
Piecewise 代表分段常数衰减,在不同的学习阶段指定不同的学习率,在每段内学习率相同。
warmup_epoch 代表在前5个epoch中,学习率将逐渐从0增加到base_lr。全部策略可以参考代码[learning_rate.py](../../ppocr/optimizer/learning_rate.py)
#### 1.2 正则化
<a name="正则化"></a>
### 1.2 正则化
正则化可以有效的避免算法过拟合,PaddleOCR中提供了L1、L2正则方法,L1 和 L2 正则化是最常用的正则化方法。L1 正则化向目标函数添加正则化项,以减少参数的绝对值总和;而 L2 正则化中,添加正则化项的目的在于减少参数平方的总和。配置方法如下:
......@@ -42,8 +54,8 @@ Optimizer:
factor: 2.0e-05
```
#### 1.3 评估指标:
<a name="评估指标"></a>
### 1.3 评估指标
(1)检测阶段:先按照检测框和标注框的IOU评估,IOU大于某个阈值判断为检测准确。这里检测框和标注框不同于一般的通用目标检测框,是采用多边形进行表示。检测准确率:正确的检测框个数在全部检测框的占比,主要是判断检测指标。检测召回率:正确的检测框个数在全部标注框的占比,主要是判断漏检的指标。
......@@ -51,35 +63,34 @@ Optimizer:
(3)端到端统计: 端对端召回率:准确检测并正确识别文本行在全部标注文本行的占比; 端到端准确率:准确检测并正确识别文本行在 检测到的文本行数量 的占比; 准确检测的标准是检测框与标注框的IOU大于某个阈值,正确识别的的检测框中的文本与标注的文本相同。
### 2. 常见问题
**Q**: 基于深度学习的文字检测方法有哪几种?各有什么优缺点?
<a name="常见问题"></a>
## 2. 常见问题
A: 常用的基于深度学习的文字检测方法一般可以分为基于回归的、基于分割的两大类,当然还有一些将两者进行结合的方法。
(1)基于回归的方法分为box回归和像素值回归。a. 采用box回归的方法主要有CTPN、Textbox系列和EAST,这类算法对规则形状文本检测效果较好,但无法准确检测不规则形状文本。 b. 像素值回归的方法主要有CRAFT和SA-Text,这类算法能够检测弯曲文本且对小文本效果优秀但是实时性能不够。
**Q**:训练CRNN识别时,如何选择合适的网络输入shape?
(2)基于分割的算法,如PSENet,这类算法不受文本形状的限制,对各种形状的文本都能取得较好的效果,但是往往后处理比较复杂,导致耗时严重。目前也有一些算法专门针对这个问题进行改进,如DB,将二值化进行近似,使其可导,融入训练,从而获取更准确的边界,大大降低了后处理的耗时。
A:一般高度采用32,最长宽度的选择,有两种方法:
**Q**:对于中文行文本识别,CTC和Attention哪种更优?
(1)统计训练样本图像的宽高比分布。最大宽高比的选取考虑满足80%的训练样本。
A:
(1)从效果上来看,通用OCR场景CTC的识别效果优于Attention,因为带识别的字典中的字符比较多,常用中文汉字三千字以上,如果训练样本不足的情况下,对于这些字符的序列关系挖掘比较困难。中文场景下Attention模型的优势无法体现。而且Attention适合短语句识别,对长句子识别比较差。
(2)统计训练样本文字数目。最长字符数目的选取考虑满足80%的训练样本。然后中文字符长宽比近似认为是1,英文认为3:1,预估一个最长宽度。
(2)从训练和预测速度上,Attention的串行解码结构限制了预测速度,而CTC网络结构更高效,预测速度上更有优势。
**Q**:识别训练时,训练集精度已经到达90了,但验证集精度一直在70,涨不上去怎么办?
**Q**:训练CRNN识别时,如何选择合适的网络输入shape?
A:训练集精度90,测试集70多的话,应该是过拟合了,有两个可尝试的方法:
A:一般高度采用32,最长宽度的选择,有两种方法:
(1)加入更多的增广方式或者调大增广prob的[概率](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/data/imaug/rec_img_aug.py#L341),默认为0.4。
1)统计训练样本图像的宽高比分布。最大宽高比的选取考虑满足80%的训练样本。
2)调大系统的[l2 dcay值](https://github.com/PaddlePaddle/PaddleOCR/blob/a501603d54ff5513fc4fc760319472e59da25424/configs/rec/ch_ppocr_v1.1/rec_chinese_lite_train_v1.1.yml#L47)
(2)统计训练样本文字数目。最长字符数目的选取考虑满足80%的训练样本。然后中文字符长宽比近似认为是1,英文认为3:1,预估一个最长宽度。
**Q**: 识别模型训练时,loss能正常下降,但acc一直为0
A:识别模型训练初期acc为0是正常的,多训一段时间指标就上来了。
### 3. 数据与垂类场景
<a name="数据与垂类场景"></a>
## 3. 数据与垂类场景
#### 3.1 训练数据:
<a name="训练数据"></a>
### 3.1 训练数据
目前开源的模型,数据集和量级如下:
- 检测:
......@@ -93,13 +104,14 @@ Optimizer:
其中,公开数据集都是开源的,用户可自行搜索下载,也可参考[中文数据集](./datasets.md),合成数据暂不开源,用户可使用开源合成工具自行合成,可参考的合成工具包括[text_renderer](https://github.com/Sanster/text_renderer)[SynthText](https://github.com/ankush-me/SynthText)[TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) 等。
#### 3.2 垂类场景
<a name="垂类场景"></a>
### 3.2 垂类场景
PaddleOCR主要聚焦通用OCR,如果有垂类需求,您可以用PaddleOCR+垂类数据自己训练;
如果缺少带标注的数据,或者不想投入研发成本,建议直接调用开放的API,开放的API覆盖了目前比较常见的一些垂类。
#### 3.3 自己构建数据集
<a name="自己构建数据集"></a>
### 3.3 自己构建数据集
在构建数据集时有几个经验可供参考:
......@@ -113,4 +125,4 @@ PaddleOCR主要聚焦通用OCR,如果有垂类需求,您可以用PaddleOCR+
a. 人工采集更多的训练数据,最直接也是最有效的方式。
b. 基于PIL和opencv基本图像处理或者变换。例如PIL中ImageFont, Image, ImageDraw三个模块将文字写到背景中,opencv的旋转仿射变换,高斯滤波等。
c. 利用数据生成算法合成数据,例如pix2pix等算法。
c. 利用数据生成算法合成数据,例如pix2pix或StyleText等算法。
## TEXT RECOGNITION
# TEXT RECOGNITION
- [1 DATA PREPARATION](#DATA_PREPARATION)
- [1.1 Costom Dataset](#Costom_Dataset)
......@@ -17,7 +17,7 @@
- [4.1 Training engine prediction](#Training_engine_prediction)
<a name="DATA_PREPARATION"></a>
### DATA PREPARATION
## 1 DATA PREPARATION
PaddleOCR supports two data formats:
......@@ -36,7 +36,7 @@ mklink /d <path/to/paddle_ocr>/train_data/dataset <path/to/dataset>
```
<a name="Costom_Dataset"></a>
#### 1.1 Costom dataset
### 1.1 Costom dataset
If you want to use your own data for training, please refer to the following to organize your data.
......@@ -84,7 +84,7 @@ Similar to the training set, the test set also needs to be provided a folder con
```
<a name="Dataset_download"></a>
#### 1.2 Dataset download
### 1.2 Dataset download
- ICDAR2015
......@@ -121,7 +121,7 @@ The multi-language model training method is the same as the Chinese model. The t
<a name="Dictionary"></a>
#### 1.3 Dictionary
### 1.3 Dictionary
Finally, a dictionary ({word_dict_name}.txt) needs to be provided so that when the model is trained, all the characters that appear can be mapped to the dictionary index.
......@@ -166,17 +166,17 @@ To customize the dict file, please modify the `character_dict_path` field in `co
If you need to customize dic file, please add character_dict_path field in configs/rec/rec_icdar15_train.yml to point to your dictionary path. And set character_type to ch.
<a name="Add_space_category"></a>
#### 1.4 Add space category
### 1.4 Add space category
If you want to support the recognition of the `space` category, please set the `use_space_char` field in the yml file to `True`.
**Note: use_space_char only takes effect when character_type=ch**
<a name="TRAINING"></a>
### 2 TRAINING
## 2 TRAINING
<a name="Data_Augmentation"></a>
#### 2.1 Data Augmentation
### 2.1 Data Augmentation
PaddleOCR provides a variety of data augmentation methods. All the augmentation methods are enabled by default.
......@@ -185,7 +185,7 @@ The default perturbation methods are: cvtColor, blur, jitter, Gasuss noise, rand
Each disturbance method is selected with a 40% probability during the training process. For specific code implementation, please refer to: [rec_img_aug.py](../../ppocr/data/imaug/rec_img_aug.py)
<a name="Training"></a>
#### 2.2 General Training
### 2.2 General Training
PaddleOCR provides training scripts, evaluation scripts, and prediction scripts. In this section, the CRNN recognition model will be used as an example:
......@@ -304,7 +304,7 @@ Eval:
**Note that the configuration file for prediction/evaluation must be consistent with the training.**
<a name="Multi_language"></a>
#### 2.3 Multi-language Training
### 2.3 Multi-language Training
Currently, the multi-language algorithms supported by PaddleOCR are:
......@@ -361,7 +361,7 @@ Eval:
```
<a name="EVALUATION"></a>
### 3 EVALUATION
## 3 EVALUATION
The evaluation dataset can be set by modifying the `Eval.dataset.label_file_list` field in the `configs/rec/rec_icdar15_train.yml` file.
......@@ -371,7 +371,7 @@ python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/rec
```
<a name="PREDICTION"></a>
### 4 PREDICTION
## 4 PREDICTION
Using the model trained by paddleocr, you can quickly get prediction through the following script.
......
## MODEL TRAINING
# MODEL TRAINING
- [1. Basic concepts](#1-basic-concepts)
* [1.1 Learning rate](#11-learning-rate)
* [1.2 Regularization](#12-regularization)
* [1.3 Evaluation indicators](#13-evaluation-indicators-)
- [2. FAQ](#2-faq)
- [3. Data and vertical scenes](#3-data-and-vertical-scenes)
* [3.1 Training data](#31-training-data)
* [3.2 Vertical scene](#32-vertical-scene)
* [3.3 Build your own data set](#33-build-your-own-data-set)
This article will introduce the basic concepts that need to be mastered during model training and the tuning methods during training.
At the same time, it will briefly introduce the components of the PaddleOCR model training data and how to prepare the data finetune model in the vertical scene.
### 1. Basic concepts
<a name="1-basic-concepts"></a>
# 1. Basic concepts
OCR (Optical Character Recognition) refers to the process of analyzing and recognizing images to obtain text and layout information. It is a typical computer vision task.
It usually consists of two subtasks: text detection and text recognition.
The following parameters need to be paid attention to when tuning the model:
#### 1.1 Learning rate
<a name="11-learning-rate"></a>
## 1.1 Learning rate
The learning rate is one of the important hyperparameters for training neural networks. It represents the step length of the gradient moving to the optimal solution of the loss function in each iteration.
A variety of learning rate update strategies are provided in PaddleOCR, which can be modified through configuration files, for example:
......@@ -31,7 +44,8 @@ and the learning rate is the same in each stage.
warmup_epoch means that in the first 5 epochs, the learning rate will gradually increase from 0 to base_lr. For all strategies, please refer to the code [learning_rate.py](../../ppocr/optimizer/learning_rate.py).
#### 1.2 Regularization
<a name="12-regularization"></a>
## 1.2 Regularization
Regularization can effectively avoid algorithm overfitting. PaddleOCR provides L1 and L2 regularization methods.
L1 and L2 regularization are the most commonly used regularization methods.
......@@ -46,8 +60,8 @@ Optimizer:
name: L2
factor: 2.0e-05
```
#### 1.3 Evaluation indicators:
<a name="13-evaluation-indicators-"></a>
## 1.3 Evaluation indicators
(1) Detection stage: First, evaluate according to the IOU of the detection frame and the labeled frame. If the IOU is greater than a certain threshold, it is judged that the detection is accurate. Here, the detection frame and the label frame are different from the general general target detection frame, and they are represented by polygons. Detection accuracy: the percentage of the correct detection frame number in all detection frames is mainly used to judge the detection index. Detection recall rate: the percentage of correct detection frames in all marked frames, which is mainly an indicator of missed detection.
......@@ -55,25 +69,8 @@ Optimizer:
(3) End-to-end statistics: End-to-end recall rate: accurately detect and correctly identify the proportion of text lines in all labeled text lines; End-to-end accuracy rate: accurately detect and correctly identify the number of text lines in the detected text lines The standard for accurate detection is that the IOU of the detection box and the labeled box is greater than a certain threshold, and the text in the correctly identified detection box is the same as the labeled text.
### 2. FAQ
**Q**: What are the text detection methods based on deep learning? What are the advantages and disadvantages of each?
A: Commonly used deep learning-based text detection methods can generally be divided into two categories: regression-based and segmentation-based, and of course there are some methods that combine the two.
(1) Methods based on regression are divided into box regression and pixel value regression. a. The methods that use box regression mainly include CTPN, Textbox series and EAST. This type of algorithm has a better effect on regular shape text detection, but it cannot accurately detect irregular shape text. b. The methods of pixel value regression mainly include CRAFT and SA-Text. This type of algorithm can detect curved text and has an excellent effect on small text, but the real-time performance is not enough.
(2) Algorithms based on segmentation, such as PSENet, are not limited by the shape of the text, and can achieve better results for texts of various shapes, but the post-processing is often more complicated, leading to serious time-consuming. At present, there are also some algorithms that are specifically improved for this problem, such as DB, which approximates the binarization, makes it guideable, and integrates it into training, so as to obtain a more accurate boundary, which greatly reduces the time-consuming post-processing.
**Q**: For Chinese line text recognition, which is better, CTC or Attention?
A:
(1) From the point of view of effect, the recognition effect of CTC in general OCR scene is better than Attention, because there are more characters in the dictionary with recognition, and the commonly used Chinese characters are more than 3,000 characters. If the training samples are insufficient, for these characters Sequence relationship mining is more difficult. The advantages of the Attention model in the Chinese scene cannot be reflected. Moreover, Attention is suitable for short sentence recognition, and it is relatively poor in recognition of long sentences.
(2) In terms of training and prediction speed, Attention's serial decoding structure limits the prediction speed, while the CTC network structure is more efficient and has an advantage in prediction speed.
<a name="2-faq"></a>
# 2. FAQ
**Q**: How to choose a suitable network input shape when training CRNN recognition?
......@@ -83,11 +80,23 @@ Optimizer:
(2) Count the number of texts in training samples. The selection of the longest number of characters considers the training sample that satisfies 80%. Then the aspect ratio of Chinese characters is approximately considered to be 1, and that of English is 3:1, and the longest width is estimated.
**Q**: During the recognition training, the accuracy of the training set has reached 90, but the accuracy of the verification set has been kept at 70, what should I do?
A: If the accuracy of the training set is 90 and the test set is more than 70, it should be over-fitting. There are two methods to try:
(1) Add more augmentation methods or increase the [probability] of augmented prob (https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/data/imaug/rec_img_aug.py#L341), The default is 0.4.
(2) Increase the [l2 dcay value] of the system (https://github.com/PaddlePaddle/PaddleOCR/blob/a501603d54ff5513fc4fc760319472e59da25424/configs/rec/ch_ppocr_v1.1/rec_chinese_lite_train_v1.1.yml#L47)
### 3. Data and vertical scenes
**Q**: When the recognition model is trained, loss can drop normally, but acc is always 0
#### 3.1 Training data
A: It is normal for the acc to be 0 at the beginning of the recognition model training, and the indicator will come up after a longer training period.
<a name="3-data-and-vertical-scenes"></a>
# 3. Data and vertical scenes
<a name="31-training-data"></a>
## 3.1 Training data
The current open source models, data sets and magnitudes are as follows:
......@@ -102,14 +111,14 @@ The current open source models, data sets and magnitudes are as follows:
Among them, the public data sets are all open source, users can search and download by themselves, or refer to [Chinese data set](./datasets.md), synthetic data is not open source, users can use open source synthesis tools to synthesize by themselves. Synthesis tools include [text_renderer](https://github.com/Sanster/text_renderer), [SynthText](https://github.com/ankush-me/SynthText), [TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) etc.
#### 3.2 Vertical scene
<a name="32-vertical-scene"></a>
## 3.2 Vertical scene
PaddleOCR mainly focuses on general OCR. If you have vertical requirements, you can use PaddleOCR + vertical data to train yourself;
If there is a lack of labeled data, or if you do not want to invest in research and development costs, it is recommended to directly call the open API, which covers some of the more common vertical categories.
#### 3.3 Build your own data set
<a name="33-build-your-own-data-set"></a>
## 3.3 Build your own data set
There are several experiences for reference when constructing the data set:
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册