data-quality.md 45.2 KB
Newer Older
1
# Data Quality
2

3 4 5
## Introduction

The data quality task is used to check the data accuracy during the integration and processing of data. Data quality tasks in this release include single-table checking, single-table custom SQL checking, multi-table accuracy, and two-table value comparisons. The running environment of the data quality task is Spark 2.4.0, and other versions have not been verified, and users can verify by themselves.
6 7

The execution logic of the data quality task is as follows:
8

9 10
- The user defines the task in the interface, and the user input value is stored in `TaskParam`.
- When running a task, `Master` will parse `TaskParam`, encapsulate the parameters required by `DataQualityTask` and send it to `Worker`.
11
- Worker runs the data quality task. After the data quality task finishes running, it writes the statistical results to the specified storage engine.
12
- The current data quality task result is stored in the `t_ds_dq_execute_result` table of `dolphinscheduler`
13
  `Worker` sends the task result to `Master`, after `Master` receives `TaskResponse`, it will judge whether the task type is `DataQualityTask`, if so, it will read the corresponding result from `t_ds_dq_execute_result` according to `taskInstanceId`, and then The result is judged according to the check mode, operator and threshold configured by the user.
14 15
- If the result is a failure, the corresponding operation, alarm or interruption will be performed according to the failure policy configured by the user.
- Add config : `<server-name>/conf/common.properties`
16 17

```properties
18
# Change to specific version if you not use dev branch
19 20 21
data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
```

22 23 24 25 26 27 28 29 30
- Please fill in `data-quality.jar.name` according to the actual package name.
- If you package `data-quality` separately, remember to modify the package name to be consistent with `data-quality.jar.name`.
- If the old version is upgraded and used, you need to execute the `sql` update script to initialize the database before running.
- If you want to use `MySQL` data, you need to comment out the `scope` of `MySQL` in `pom.xml`.
- Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, other data sources have not been tested yet.
- `Spark` needs to be configured to read `Hive` metadata, `Spark` does not use `jdbc` to read `Hive`.

## Detailed Inspection Logic

31 32 33 34 35
| **Parameter** |                                                                                 **Description**                                                                                  |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CheckMethod   | [CheckFormula][Operator][Threshold], if the result is true, it indicates that the data does not meet expectations, and the failure strategy is executed.                         |
| CheckFormula  | <ul><li>Expected-Actual</li><li>Actual-Expected</li><li>(Actual/Expected)x100%</li><li>(Expected-Actual)/Expected x100%</li></ul>                                                |
| Operator      | =, >, >=, <, <=, !=                                                                                                                                                              |
36
| ExpectedValue | <ul><li>FixValue</li><li>DailyAvg</li><li>WeeklyAvg</li><li>MonthlyAvg</li><li>Last7DayAvg</li><li>Last30DayAvg</li><li>SrcTableTotalRows</li><li>TargetTableTotalRows</li></ul> |
37 38
| Example       | <ul><li>CheckFormula:Expected-Actual</li><li>Operator:></li><li>Threshold:0</li><li>ExpectedValue:FixValue=9</li></ul>                                                           |

39 40 41 42 43
In the example, assuming that the actual value is 10, the operator is >, and the expected value is 9, then the result 10 -9 > 0 is true, which means that the row data in the empty column has exceeded the threshold, and the task is judged to fail.

# Task Operation Guide

## Null Value Check for Single Table Check
44

45
### Inspection Introduction
46

47
The goal of the null value check is to check the number of empty rows in the specified column. The number of empty rows can be compared with the total number of rows or a specified threshold. If it is greater than a certain threshold, it will be judged as failure.
48 49

- The SQL statement that calculates the null of the specified column is as follows:
50 51 52 53 54 55 56 57 58 59

  ```sql
  SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or ${src_field} = '') AND (${src_filter})
  ```
- The SQL to calculate the total number of rows in the table is as follows:

  ```sql
  SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
  ```

60 61
### Interface Operation Guide

62
![dataquality_null_check](../../../img/tasks/demo/null_check.png)
63

64 65 66 67 68 69 70 71 72 73 74 75
|     **Parameter**      |                                                                                                                **Description**                                                                                                                |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type       | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                                |
| Source data source     | The corresponding data source under the source data type.                                                                                                                                                                                     |
| Source data table      | Drop-down to select the table where the validation data is located.                                                                                                                                                                           |
| Src filter conditions  | Such as the title, it will also be used when counting the total number of rows in the table, optional.                                                                                                                                        |
| Src table check column | Drop-down to select the check column name.                                                                                                                                                                                                    |
| Check method           | <ul><li>[Expected-Actual]</li><li>[Actual-Expected]</li><li>[Actual/Expected]x100%</li><li>[(Expected-Actual)/Expected]x100%</li></ul>                                                                                                        |
| Check operators        | =, >, >=, <, <=, ! =                                                                                                                                                                                                                          |
| Threshold              | The value used in the formula for comparison.                                                                                                                                                                                                 |
| Failure strategy       | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
| Expected value type    | Select the desired type from the drop-down menu.                                                                                                                                                                                              |
76 77

## Timeliness Check of Single Table Check
78

79
### Inspection Introduction
80

81 82 83 84
The timeliness check is used to check whether the data is processed within the expected time. The start time and end time can be specified to define the time range. If the amount of data within the time range does not reach the set threshold, the check task will be judged as fail.

### Interface Operation Guide

85
![dataquality_timeliness_check](../../../img/tasks/demo/timeliness_check.png)
86

87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
|     **Parameter**      |                                                                                                                **Description**                                                                                                                |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type       | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                                |
| Source data source     | The corresponding data source under the source data type.                                                                                                                                                                                     |
| Source data table      | Drop-down to select the table where the validation data is located.                                                                                                                                                                           |
| Src filter conditions  | Such as the title, it will also be used when counting the total number of rows in the table, optional.                                                                                                                                        |
| Src table check column | Drop-down to select check column name.                                                                                                                                                                                                        |
| Start time             | The start time of a time range.                                                                                                                                                                                                               |
| end time               | The end time of a time range.                                                                                                                                                                                                                 |
| Time Format            | Set the corresponding time format.                                                                                                                                                                                                            |
| Check method           | <ul><li>[Expected-Actual]</li><li>[Actual-Expected]</li><li>[Actual/Expected]x100%</li><li>[(Expected-Actual)/Expected]x100%</li></ul>                                                                                                        |
| Check operators        | =, >, >=, <, <=, ! =                                                                                                                                                                                                                          |
| Threshold              | The value used in the formula for comparison.                                                                                                                                                                                                 |
| Failure strategy       | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
| Expected value type    | Select the desired type from the drop-down menu.                                                                                                                                                                                              |
102 103 104 105

## Field Length Check for Single Table Check

### Inspection Introduction
106

107 108 109 110
The goal of field length verification is to check whether the length of the selected field meets the expectations. If there is data that does not meet the requirements, and the number of rows exceeds the threshold, the task will be judged to fail.

### Interface Operation Guide

111
![dataquality_length_check](../../../img/tasks/demo/field_length_check.png)
112

113 114 115 116 117 118 119 120 121 122 123 124 125 126
|     **Parameter**      |                                                                                                                **Description**                                                                                                                |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type       | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                                |
| Source data source     | The corresponding data source under the source data type.                                                                                                                                                                                     |
| Source data table      | Drop-down to select the table where the validation data is located.                                                                                                                                                                           |
| Src filter conditions  | Such as the title, it will also be used when counting the total number of rows in the table, optional.                                                                                                                                        |
| Src table check column | Drop-down to select the check column name.                                                                                                                                                                                                    |
| Logical operators      | =, >, >=, <, <=, ! =                                                                                                                                                                                                                          |
| Field length limit     | Like the title.                                                                                                                                                                                                                               |
| Check method           | <ul><li>[Expected-Actual]</li><li>[Actual-Expected]</li><li>[Actual/Expected]x100%</li><li>[(Expected-Actual)/Expected]x100%</li></ul>                                                                                                        |
| Check operators        | =, >, >=, <, <=, ! =                                                                                                                                                                                                                          |
| Threshold              | The value used in the formula for comparison.                                                                                                                                                                                                 |
| Failure strategy       | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
| Expected value type    | Select the desired type from the drop-down menu.                                                                                                                                                                                              |
127 128 129 130

## Uniqueness Check for Single Table Check

### Inspection Introduction
131

132
The goal of the uniqueness check is to check whether the fields are duplicated. It is generally used to check whether the primary key is duplicated. If there are duplicates and the threshold is reached, the check task will be judged to be failed.
133 134 135

### Interface Operation Guide

136
![dataquality_uniqueness_check](../../../img/tasks/demo/uniqueness_check.png)
137

138 139 140 141 142 143 144 145 146 147 148 149
|     **Parameter**      |                                                                                                                **Description**                                                                                                                |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type       | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                                |
| Source data source     | The corresponding data source under the source data type.                                                                                                                                                                                     |
| Source data table      | Drop-down to select the table where the validation data is located.                                                                                                                                                                           |
| Src filter conditions  | Such as the title, it will also be used when counting the total number of rows in the table, optional.                                                                                                                                        |
| Src table check column | Drop-down to select the check column name.                                                                                                                                                                                                    |
| Check method           | <ul><li>[Expected-Actual]</li><li>[Actual-Expected]</li><li>[Actual/Expected]x100%</li><li>[(Expected-Actual)/Expected]x100%</li></ul>                                                                                                        |
| Check operators        | =, >, >=, <, <=, ! =                                                                                                                                                                                                                          |
| Threshold              | The value used in the formula for comparison.                                                                                                                                                                                                 |
| Failure strategy       | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
| Expected value type    | Select the desired type from the drop-down menu.                                                                                                                                                                                              |
150 151 152 153

## Regular Expression Check for Single Table Check

### Inspection Introduction
154

155
The goal of regular expression verification is to check whether the format of the value of a field meets the requirements, such as time format, email format, ID card format, etc. If there is data that does not meet the format and exceeds the threshold, the task will be judged as failed.
156 157 158

### Interface Operation Guide

159
![dataquality_regex_check](../../../img/tasks/demo/regexp_check.png)
160

161 162 163 164 165 166 167 168 169 170 171 172 173
|     **Parameter**      |                                                                                                                **Description**                                                                                                                |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type       | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                                |
| Source data source     | The corresponding data source under the source data type.                                                                                                                                                                                     |
| Source data table      | Drop-down to select the table where the validation data is located.                                                                                                                                                                           |
| Src filter conditions  | Such as the title, it will also be used when counting the total number of rows in the table, optional.                                                                                                                                        |
| Src table check column | Drop-down to select check column name.                                                                                                                                                                                                        |
| Regular expression     | As title.                                                                                                                                                                                                                                     |
| Check method           | <ul><li>[Expected-Actual]</li><li>[Actual-Expected]</li><li>[Actual/Expected]x100%</li><li>[(Expected-Actual)/Expected]x100%</li></ul>                                                                                                        |
| Check operators        | =, >, >=, <, <=, ! =                                                                                                                                                                                                                          |
| Threshold              | The value used in the formula for comparison.                                                                                                                                                                                                 |
| Failure strategy       | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
| Expected value type    | Select the desired type from the drop-down menu.                                                                                                                                                                                              |
174 175

## Enumeration Value Validation for Single Table Check
176

177
### Inspection Introduction
178

179
The goal of enumeration value verification is to check whether the value of a field is within the range of the enumeration value. If there is data that is not in the range of the enumeration value and exceeds the threshold, the task will be judged to fail.
180 181 182

### Interface Operation Guide

183
![dataquality_enum_check](../../../img/tasks/demo/enumeration_check.png)
184

185 186 187 188 189 190 191 192 193 194 195 196 197
|        **Parameter**        |                                                                                                                **Description**                                                                                                                |
|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type            | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                                |
| Source data source          | The corresponding data source under the source data type.                                                                                                                                                                                     |
| Source data table           | Drop-down to select the table where the validation data is located.                                                                                                                                                                           |
| Src table filter conditions | Such as title, also used when counting the total number of rows in the table, optional.                                                                                                                                                       |
| Src table check column      | Drop-down to select the check column name.                                                                                                                                                                                                    |
| List of enumeration values  | Separated by commas.                                                                                                                                                                                                                          |
| Check method                | <ul><li>[Expected-Actual]</li><li>[Actual-Expected]</li><li>[Actual/Expected]x100%</li><li>[(Expected-Actual)/Expected]x100%</li></ul>                                                                                                        |
| Check operators             | =, >, >=, <, <=, ! =                                                                                                                                                                                                                          |
| Threshold                   | The value used in the formula for comparison.                                                                                                                                                                                                 |
| Failure strategy            | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
| Expected value type         | Select the desired type from the drop-down menu.                                                                                                                                                                                              |
198 199 200 201

## Table Row Number Verification for Single Table Check

### Inspection Introduction
202

203
The goal of table row number verification is to check whether the number of rows in the table reaches the expected value. If the number of rows does not meet the standard, the task will be judged as failed.
204 205 206

### Interface Operation Guide

207
![dataquality_count_check](../../../img/tasks/demo/table_count_check.png)
208

209 210 211 212 213 214 215 216 217 218 219 220
|     **Parameter**      |                                                                                                                **Description**                                                                                                                |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type       | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                                |
| Source data source     | The corresponding data source under the source data type.                                                                                                                                                                                     |
| Source data table      | Drop-down to select the table where the validation data is located.                                                                                                                                                                           |
| Src filter conditions  | Such as the title, it will also be used when counting the total number of rows in the table, optional.                                                                                                                                        |
| Src table check column | Drop-down to select the check column name.                                                                                                                                                                                                    |
| Check method           | <ul><li>[Expected-Actual]</li><li>[Actual-Expected]</li><li>[Actual/Expected]x100%</li><li>[(Expected-Actual)/Expected]x100%</li></ul>                                                                                                        |
| Check operators        | =, >, >=, <, <=, ! =                                                                                                                                                                                                                          |
| Threshold              | The value used in the formula for comparison.                                                                                                                                                                                                 |
| Failure strategy       | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul> |
| Expected value type    | Select the desired type from the drop-down menu.                                                                                                                                                                                              |
221 222 223 224 225

## Custom SQL Check for Single Table Check

### Interface Operation Guide

226
![dataquality_custom_sql_check](../../../img/tasks/demo/custom_sql_check.png)
227

228 229 230 231 232 233 234 235 236 237 238 239 240
|        **Parameter**         |                                                                                                                                **Description**                                                                                                                                 |
|------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type             | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                                                                 |
| Source data source           | The corresponding data source under the source data type.                                                                                                                                                                                                                      |
| Source data table            | Drop-down to select the table where the data to be verified is located.                                                                                                                                                                                                        |
| Actual value name            | Alias in SQL for statistical value calculation, such as max_num.                                                                                                                                                                                                               |
| Actual value calculation SQL | SQL for outputting actual values. Note:<ul><li>The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.</li><li>Select max(a) as max_num from ${src_table}, the table name must be filled like this.</li></ul> |
| Src filter conditions        | Such as the title, it will also be used when counting the total number of rows in the table, optional.                                                                                                                                                                         |
| Check method                 | <ul><li>[Expected-Actual]</li><li>[Actual-Expected]</li><li>[Actual/Expected]x100%</li><li>[(Expected-Actual)/Expected]x100%</li></ul>                                                                                                                                         |
| Check operators              | =, >, >=, <, <=, ! =                                                                                                                                                                                                                                                           |
| Threshold                    | The value used in the formula for comparison.                                                                                                                                                                                                                                  |
| Failure strategy             | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul>                                  |
| Expected value type          | Select the desired type from the drop-down menu.                                                                                                                                                                                                                               |
241 242

## Accuracy Check of Multi-table
243

244
### Inspection Introduction
245

246 247 248 249
Accuracy checks are performed by comparing the accuracy differences of data records for selected fields between two tables, examples are as follows
- table test1

| c1 | c2 |
250 251 252
|:--:|:--:|
| a  | 1  |
| b  | 2  |
253 254 255 256

- table test2

| c21 | c22 |
257 258 259
|:---:|:---:|
|  a  |  1  |
|  b  |  3  |
260

261
If you compare the data in c1 and c21, the tables test1 and test2 are exactly the same. If you compare c2 and c22, the data in table test1 and table test2 are inconsistent.
262 263 264

### Interface Operation Guide

265
![dataquality_multi_table_accuracy_check](../../../img/tasks/demo/multi_table_accuracy_check.png)
266

267 268 269 270 271 272 273 274 275 276 277 278 279 280 281
|      **Parameter**       |                                                                                                               **Description**                                                                                                                |
|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type         | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                               |
| Source data source       | The corresponding data source under the source data type.                                                                                                                                                                                    |
| Source data table        | Drop-down to select the table where the data to be verified is located.                                                                                                                                                                      |
| Src filter conditions    | Such as the title, it will also be used when counting the total number of rows in the table, optional.                                                                                                                                       |
| Target data type         | Choose MySQL, PostgreSQL, etc.                                                                                                                                                                                                               |
| Target data source       | The corresponding data source under the source data type.                                                                                                                                                                                    |
| Target data table        | Drop-down to select the table where the data to be verified is located.                                                                                                                                                                      |
| Target filter conditions | Such as the title, it will also be used when counting the total number of rows in the table, optional.                                                                                                                                       |
| Check column             | Fill in the source data column, operator and target data column respectively.                                                                                                                                                                |
| Verification method      | Select the desired verification method.                                                                                                                                                                                                      |
| Operators                | =, >, >=, <, <=, ! =                                                                                                                                                                                                                         |
| Failure strategy         | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li><ul> |
| Expected value type      | Select the desired type in the drop-down menu, only `SrcTableTotalRow`, `TargetTableTotalRow` and fixed value are suitable for selection here.                                                                                               |
282

283
## Comparison of the values checked by the two tables
284

285
### Inspection Introduction
286

287 288 289 290
Two-table value comparison allows users to customize different SQL statistics for two tables and compare the corresponding values. For example, for the source table A, the total amount of a certain column is calculated, and for the target table, the total amount of a certain column is calculated. value sum2, compare sum1 and sum2 to determine the check result.

### Interface Operation Guide

291
![dataquality_multi_table_comparison_check](../../../img/tasks/demo/multi_table_comparison_check.png)
292

293 294 295 296 297 298 299 300 301 302 303
|         **Parameter**          |                                                                                                                                    **Description**                                                                                                                                    |
|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Source data type               | Select MySQL, PostgreSQL, etc.                                                                                                                                                                                                                                                        |
| Source data source             | The corresponding data source under the source data type.                                                                                                                                                                                                                             |
| Source data table              | The table where the data is to be verified.                                                                                                                                                                                                                                           |
| Actual value name              | Calculate the alias in SQL for the actual value, such as max_age1.                                                                                                                                                                                                                    |
| Actual value calculation SQL   | SQL for outputting actual values. Note: <ul><li>The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.</li><li>Select max(age) as max_age1 from ${src_table} The table name must be filled like this.</li></ul>     |
| Target data type               | Choose MySQL, PostgreSQL, etc.                                                                                                                                                                                                                                                        |
| Target data source             | The corresponding data source under the source data type.                                                                                                                                                                                                                             |
| Target data table              | The table where the data is to be verified.                                                                                                                                                                                                                                           |
| Expected value name            | Calculate the alias in SQL for the expected value, such as max_age2.                                                                                                                                                                                                                  |
304
| Expected value calculation SQL | SQL for outputting expected value. Note: <ul><li>The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.</li><li>Select max(age) as max_age2 from ${target_table} The table name must be filled like this.</li></ul> |
305 306 307
| Verification method            | Select the desired verification method.                                                                                                                                                                                                                                               |
| Operators                      | =, >, >=, <, <=, ! =                                                                                                                                                                                                                                                                  |
| Failure strategy               | <ul><li>Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent.</li><li>Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent.</li></ul>                                         |
308 309

## Task result view
310

311
![dataquality_result](../../../img/tasks/demo/result.png)
312

313
## Rule View
314

315
### List of rules
316

317
![dataquality_rule_list](../../../img/tasks/demo/rule_list.png)
318

319
### Rules Details
320

321
![dataquality_rule_detail](../../../img/tasks/demo/rule_detail.png)