未验证 提交 e9b3350e 编写于 作者: 片刻小哥哥's avatar 片刻小哥哥 提交者: GitHub

Merge pull request #39 from Islotus/master

翻译 61~70 by Islotus
# Time Attributes
# 时间属性
Flink is able to process streaming data based on different notions of _time_.
Flink能够根据_time_的不同概念处理流式数据。
* _Processing time_ refers to the system time of the machine (also known as “wall-clock time”) that is executing the respective operation.
* _Event time_ refers to the processing of streaming data based on timestamps which are attached to each row. The timestamps can encode when an event happened.
* _Ingestion time_ is the time that events enter Flink; internally, it is treated similarly to event time.
* _Processing time_ 指的是执行相应操作的机器的系统时间(也称为“挂钟时间”)。
* _Event time_ 指的是基于附加到每一行的时间戳流式数据处理。时间戳可以在事件发生时进行编码。
* _Ingestion time_ 是事件进入Flink的时间; 在内部,它与事件时间类似地对待。
For more information about time handling in Flink, see the introduction about [Event Time and Watermarks](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html).
有关Flink中时间处理的更多信息,请参阅有关 [Event Time and Watermarks](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html)的介绍。
This pages explains how time attributes can be defined for time-based operations in Flink’s Table API & SQL.
本页介绍了如何在Flink的Table API和SQL中为基于时间的操作定义时间属性。
## Introduction to Time Attributes
## 时间属性简介
Time-based operations such as windows in both the [Table API](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/table/tableApi.html#group-windows) and [SQL](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/table/sql.html#group-windows) require information about the notion of time and its origin. Therefore, tables can offer _logical time attributes_ for indicating time and accessing corresponding timestamps in table programs.
基于时间的操作,例如 [Table API](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/table/tableApi.html#group-windows)[SQL](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/table/sql.html#group-windows) 中的窗口,需要有关时间概念及其来源的信息。因此,表可以提供 _logical time attributes_ 用于指示时间和访问表程序中的相应时间戳。
Time attributes can be part of every table schema. They are defined when creating a table from a `DataStream` or are pre-defined when using a `TableSource`. Once a time attribute has been defined at the beginning, it can be referenced as a field and can be used in time-based operations.
时间属性可以是每个表模式的一部分。它们是在从 `DataStream` 创建表时定义的,或者是在使用 `TableSource` 时预定义的。一旦在开头定义了时间属性,它就可以作为字段引用,并且可以在基于时间的操作中使用。
As long as a time attribute is not modified and is simply forwarded from one part of the query to another, it remains a valid time attribute. Time attributes behave like regular timestamps and can be accessed for calculations. If a time attribute is used in a calculation, it will be materialized and becomes a regular timestamp. Regular timestamps do not cooperate with Flink’s time and watermarking system and thus can not be used for time-based operations anymore.
只要时间属性未被修改并且只是从查询的一部分转发到另一部分,它仍然是有效的时间属性 时间属性的行为类似于常规时间戳,可以访问以进行计算。如果在计算中使用了时间属性,则它将具体化并成为常规时间戳。常规时间戳不与 Flink 的时间和水印系统配合,因此不能再用于基于时间的操作。
Table programs require that the corresponding time characteristic has been specified for the streaming environment:
表程序要求为流式环境指定相应的时间特性:
......@@ -41,22 +41,22 @@ env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime); // default
```
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime) // default
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime) // default
// alternatively:
// env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime) // env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
```
## Processing time
## 处理时间
Processing time allows a table program to produce results based on the time of the local machine. It is the simplest notion of time but does not provide determinism. It neither requires timestamp extraction nor watermark generation.
处理时间允许表程序根据本地机器的时间产生结果。这是最简单的时间概念,但不提供决定论。它既不需要时间戳提取也不需要水印生成。
There are two ways to define a processing time attribute.
有两种方法可以定义处理时间属性。
### During DataStream-to-Table Conversion
### 在 DataStream 到 Table 转换期间
The processing time attribute is defined with the `.proctime` property during schema definition. The time attribute must only extend the physical schema by an additional logical field. Thus, it can only be defined at the end of the schema definition.
处理时间属性在模式定义期间使用 `.proctime` 属性定义。time 属性只能通过附加的逻辑字段扩展物理模式。因此,它只能在模式定义的末尾定义。
......@@ -83,9 +83,9 @@ val windowedTable = table.window(Tumble over 10.minutes on 'UserActionTime as 'u
### Using a TableSource
### 使用 TableSource
The processing time attribute is defined by a `TableSource` that implements the `DefinedProctimeAttribute` interface. The logical time attribute is appended to the physical schema defined by the return type of the `TableSource`.
处理时间属性由实现 `DefinedProctimeAttribute` 接口的 `TableSource` 定义。逻辑时间属性附加到由 `TableSource` 的返回类型定义的物理模式。
......@@ -154,26 +154,26 @@ val windowedTable = tEnv
## Event time
## 事件时间
Event time allows a table program to produce results based on the time that is contained in every record. This allows for consistent results even in case of out-of-order events or late events. It also ensures replayable results of the table program when reading records from persistent storage.
事件时间允许表程序根据每个记录中包含的时间生成结果。即使在无序事件或延迟事件的情况下,这也允许一致的结果。当从持久存储中读取记录时,它还确保表程序的可重放结果。
Additionally, event time allows for unified syntax for table programs in both batch and streaming environments. A time attribute in a streaming environment can be a regular field of a record in a batch environment.
此外,事件时间允许批处理和流式处理环境中的表程序的统一语法。流式处理环境中的时间属性可以是批处理环境中的记录的常规字段。
In order to handle out-of-order events and distinguish between on-time and late events in streaming, Flink needs to extract timestamps from events and make some kind of progress in time (so-called [watermarks](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html)).
为了处理乱序事件并区分流式数据中的准时和迟到事件,Flink 需要从事件中提取时间戳并及时取得某种进展(所谓的[watermarks](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html))。
An event time attribute can be defined either during DataStream-to-Table conversion or by using a TableSource.
可以在 DataStream 到 Table 转换期间或使用 TableSource 定义事件时间属性。
### During DataStream-to-Table Conversion
### DataStream 到 Table 转换期间
The event time attribute is defined with the `.rowtime` property during schema definition. [Timestamps and watermarks](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html) must have been assigned in the `DataStream` that is converted.
在模式定义期间使用 `.rowtime` 属性定义事件时间属性。[Timestamps and watermarks](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html) 必须在已转换的 `DataStream` 中分配。
There are two ways of defining the time attribute when converting a `DataStream` into a `Table`. Depending on whether the specified `.rowtime` field name exists in the schema of the `DataStream` or not, the timestamp field is either
`DataStream` 转换为 `Table` 时有两种定义时间属性的方法。根据指定的 `.rowtime` 字段名称是否存在于 `DataStream` 的模式中,时间戳字段要么是
* appended as a new field to the schema or
* replaces an existing field.
* 作为新字段附加到模式或
* 替换现有字段。
In either case the event time timestamp field will hold the value of the `DataStream` event time timestamp.
在任何一种情况下,事件时间时间戳字段都将保存 `DataStream` 事件时间戳的值。
......@@ -205,28 +205,28 @@ WindowedTable windowedTable = table.window(Tumble.over("10.minutes").on("UserAct
```
// Option 1:
// Option 1:
// extract timestamp and assign watermarks based on knowledge of the stream val stream: DataStream[(String, String)] = inputStream.assignTimestampsAndWatermarks(...)
// declare an additional logical field as an event time attribute val table = tEnv.fromDataStream(stream, 'Username, 'Data, 'UserActionTime.rowtime)
// Option 2:
// Option 2:
// extract timestamp from first field, and assign watermarks based on knowledge of the stream val stream: DataStream[(Long, String, String)] = inputStream.assignTimestampsAndWatermarks(...)
// the first field has been used for timestamp extraction, and is no longer necessary
// replace first field with a logical event time attribute val table = tEnv.fromDataStream(stream, 'UserActionTime.rowtime, 'Username, 'Data)
// Usage:
// Usage:
val windowedTable = table.window(Tumble over 10.minutes on 'UserActionTime as 'userActionWindow)
```
### Using a TableSource
### 使用 TableSource
The event time attribute is defined by a `TableSource` that implements the `DefinedRowtimeAttributes` interface. The `getRowtimeAttributeDescriptors()` method returns a list of `RowtimeAttributeDescriptor` for describing the final name of a time attribute, a timestamp extractor to derive the values of the attribute, and the watermark strategy associated with the attribute.
事件时间属性由实现 `DefinedRowtimeAttributes` 接口的 `TableSource` 定义。`getRowtimeAttributeDescriptors()` 方法返回一个用于描述时间属性的最终名称 `RowtimeAttributeDescriptor` 列表,一个用于派生属性值的时间戳提取器,以及与该属性相关的 watermark 策略。
Please make sure that the `DataStream` returned by the `getDataStream()` method is aligned with the defined time attribute. The timestamps of the `DataStream` (the ones which are assigned by a `TimestampAssigner`) are only considered if a `StreamRecordTimestamp` timestamp extractor is defined. Watermarks of a `DataStream` are only preserved if a `PreserveWatermarks` watermark strategy is defined. Otherwise, only the values of the `TableSource`’s rowtime attribute are relevant.
请确保 `getDataStream()` 方法返回的 `DataStream` 与定义的 time 属性对齐。只有在定义了 `StreamRecordTimestamp` 时间戳提取器时,才会考虑 `DataStream` 的时间戳(由 `TimestampAssigner` 分配的时间戳)。只有在定义了 `PreserveWatermarks` 水印策略时,才会保留 `DataStream` 的水印。 否则,只有 `TableSource` 的 rowtime 属性的值是相关的。
......@@ -306,6 +306,3 @@ val windowedTable = tEnv
.scan("UserActions")
.window(Tumble over 10.minutes on 'UserActionTime as 'userActionWindow)
```
# Joins in Continuous Queries
# Joins 连续查询
Joins are a common and well-understood operation in batch data processing to connect the rows of two relations. However, the semantics of joins on [dynamic tables](dynamic_tables.html) are much less obvious or even confusing.
Joins 是批数据处理中常见且易于理解的操作,用于连接两个关系的行。但是,[动态表](dynamic_tables.html)上的连接语义不太明显甚至令人困惑。
Because of that, there are a couple of ways to actually perform a join using either Table API or SQL.
因此,有几种方法可以使用 Table API 或 SQL 实际执行连接。
For more information regarding the syntax, please check the join sections in [Table API](../tableApi.html#joins) and [SQL](../sql.html#joins).
有关语法的更多信息,请查看 [Table API](../tableApi.html#joins)[SQL](../sql.html#joins) 中的连接部分。
## Regular Joins
## 常规联接
Regular joins are the most generic type of join in which any new records or changes to either side of the join input are visible and are affecting the whole join result. For example, if there is a new record on the left side, it will be joined with all of the previous and future records on the right side.
常规联接是最通用的联接类型,其中任何新记录或对联接输入任一侧的更改都是可见的并且影响整个联接结果。例如,如果左侧有新记录,则它将与右侧的所有先前和未来记录进行连接。
......@@ -22,13 +22,13 @@ ON Orders.productId = Product.id
These semantics allow for any kind of updating (insert, update, delete) input tables.
这些语义允许任何类型的更新(插入,更新,删除)输入表。
However, this operation has an important implication: it requires to keep both sides of the join input in Flink’s state forever. Thus, the resource usage will grow indefinitely as well, if one or both input tables are continuously growing.
但是,此操作具有重要意义:它需要将连接输入的两侧永久保持在Flink的状态。 因此,如果一个或两个输入表不断增长,资源使用也将无限增长。
## Time-windowed Joins
## 时间窗口连接
A time-windowed join is defined by a join predicate, that checks if the [time attributes](time_attributes.html) of the input records are within certain time constraints, i.e., a time window.
基于时间窗口的连接由连接谓词定义,该连接谓词检查输入记录的[时间属性](time_attributes.html)是否在某些时间限制内,即时间窗口。
......@@ -43,15 +43,15 @@ WHERE o.id = s.orderId AND
Compared to a regular join operation, this kind of join only supports append-only tables with time attributes. Since time attributes are quasi-monontic increasing, Flink can remove old values from its state without affecting the correctness of the result.
与常规连接操作相比,此类连接仅支持具有时间属性的仅追加表(append-only tables)。由于时间属性是 quasi-monontic 增加,Flink可以从其状态中删除旧值而不影响结果的正确性。
## Join with a Temporal Table
## 时态表连接
A join with a temporal table joins an append-only table (left input/probe side) with a temporal table (right input/build side), i.e., a table that changes over time and tracks its changes. Please check the corresponding page for more information about [temporal tables](temporal_tables.html).
具有时态表的连接将仅追加(append-only)表(左输入/探测侧)与时态表(右输入/构建侧)连接,即随时间变化并跟踪其变化的表。 有关[时态表](temporal_tables.html)的更多信息,请查看相应的页面。
The following example shows an append-only table `Orders` that should be joined with the continuously changing currency rates table `RatesHistory`.
以下示例显示了一个仅附加表 `Orders`,它应与不断变化的货币汇率表 `RatesHistory` 连接。
`Orders` is an append-only table that represents payments for the given `amount` and the given `currency`. For example at `10:15` there was an order for an amount of `2 Euro`.
`Orders` 是一个仅附加表,表示给定 `金额` 和给定 `货币` 的付款。 例如,在 `10:15`,订单金额为 `2欧元`
......@@ -69,7 +69,7 @@ rowtime amount currency
`RatesHistory` represents an ever changing append-only table of currency exchange rates with respect to `Yen` (which has a rate of `1`). For example, the exchange rate for the period from `09:00` to `10:45` of `Euro` to `Yen` was `114`. From `10:45` to `11:15` it was `116`.
`RatesHistory` 代表一个不断变化的仅追加货币汇率表,相对于 `Yen`(汇率为 `1`)。例如,从 `09:00``10:45``欧元``日元`期间的汇率为 `114`。 从 `10:45``11:15`,它是 `116`
......@@ -88,9 +88,9 @@ rowtime currency rate
Given that we would like to calculate the amount of all `Orders` converted to a common currency (`Yen`).
鉴于我们想要计算转换为通用货币(`日元`)的所有 `Orders` 的数量。
For example, we would like to convert the following order using the appropriate conversion rate for the given `rowtime` (`114`).
例如,我们希望使用给定 `rowtime``114`)的适当转换率转换以下顺序。
......@@ -102,7 +102,7 @@ rowtime amount currency
Without using the concept of [temporal tables](temporal_tables.html), one would need to write a query like:
如果不使用[时态表](temporal_tables.html)的概念,就需要编写如下查询:
......@@ -121,7 +121,7 @@ AND r.rowtime = (
With the help of a temporal table function `Rates` over `RatesHistory`, we can express such a query in SQL as:
借助于一个时间表函数 `Rates` 而不是 `RatesHistory`,我们可以在SQL中表达这样一个查询:
......@@ -136,21 +136,21 @@ WHERE r.currency = o.currency
Each record from the probe side will be joined with the version of the build side table at the time of the correlated time attribute of the probe side record. In order to support updates (overwrites) of previous values on the build side table, the table must define a primary key.
来自探测器侧的每个记录将在探测器侧记录的相关时间属性时与构建侧表的版本连接。为了支持构建端表上先前值的更新(覆盖),该表必须定义主键。
In our example, each record from `Orders` will be joined with the version of `Rates` at time `o.rowtime`. The `currency` field has been defined as the primary key of `Rates` before and is used to connect both tables in our example. If the query were using a processing-time notion, a newly appended order would always be joined with the most recent version of `Rates` when executing the operation.
在我们的例子中,来自 `Orders `的每条记录将在 `o.rowtime` 时加入 `Rates` 的版本。`currency` 字段已被定义为之前 `Rates` 的主键,用于连接我们示例中的两个表。如果查询使用处理时间概念,则在执行操作时,新添加的订单将始终与最新版本的 `Rates` 连接。
In contrast to [regular joins](#regular-joins), this means that if there is a new record on the build side, it will not affect the previous results of the join. This again allows Flink to limit the number of elements that must be kept in the state.
[常规连接](#regular-joins)相反,这意味着如果构建端有新记录,则不会影响以前的连接结果。这再次允许 Flink 限制必须保持在该状态的元素的数量。
Compared to [time-windowed joins](#time-windowed-joins), temporal table joins do not define a time window within which bounds the records will be joined. Records from the probe side are always joined with the build side’s version at the time specified by the time attribute. Thus, records on the build side might be arbitrarily old. As time passes, the previous and no longer needed versions of the record (for the given primary key) will be removed from the state.
[time-windowed join](#time-windowed-joins)相比,时态表连接不定义时间窗口(时间窗口内的数据将会被join)。探针端的记录始终与 time 属性指定的构建端版本连接。因此,构建方面的记录可能是任意旧的。随着时间的推移,将从状态中删除先前和不再需要的记录版本(对于给定的主键)。
Such behaviour makes a temporal table join a good candidate to express stream enrichment in relational terms.
这种行为使得时态表加入了一个很好的候选者来表达关系术语中的流富集。
### Usage
### 用法
After [defining temporal table function](temporal_tables.html#defining-temporal-table-function), we can start using it. Temporal table functions can be used in the same way as normal table functions would be used.
[定义时态表函数](temporal_tables.html#defining-temporal-table-function)后,我们可以开始使用它。时态表函数的使用方式与使用普通表函数的方式相同。
The following code snippet solves our motivating problem of converting currencies from the `Orders` table:
以下代码片段解决了我们从 `订单` 表转换货币的问题:
......@@ -186,23 +186,22 @@ val result = orders
**Note**: State retention defined in a [query configuration](query_configuration.html) is not yet implemented for temporal joins. This means that the required state to compute the query result might grow infinitely depending on the number of distinct primary keys for the history table.
**注意**:在时间连接中尚未实现在[查询配置](query_configuration.html)中定义的状态保留。这意味着计算查询结果所需的状态可能会无限增长,具体取决于历史记录表的不同主键的数量。
### Processing-time Temporal Joins
### 处理时间时态连接
With a processing-time time attribute, it is impossible to pass _past_ time attributes as an argument to the temporal table function. By definition, it is always the current timestamp. Thus, invocations of a processing-time temporal table function will always return the latest known versions of the underlying table and any updates in the underlying history table will also immediately overwrite the current values.
使用处理时间时间属性,不可能将 _过去_ 时间属性作为参数传递给时态表函数。根据定义,它始终是当前时间戳。因此,处理时间时间表函数的调用将始终返回基础表的最新已知版本,并且基础历史表中的任何更新也将立即覆盖当前值。
Only the latest versions (with respect to the defined primary key) of the build side records are kept in the state. Updates of the build side will have no effect on previously emitted join results.
只有构建端记录的最新版本(相对于已定义的主键)保留在该状态中。构建端的更新将不会影响先前发出的连接结果。
One can think about a processing-time temporal join as a simple `HashMap<K, V>` that stores all of the records from the build side. When a new record from the build side has the same key as some previous record, the old value is just simply overwritten. Every record from the probe side is always evaluated against the most recent/current state of the `HashMap`.
可以将处理时间时态连接视为一个简单的 `HashMap<K, V>;`,它存储来自构建方的所有记录。当构建端的新记录具有与先前记录相同的密钥 key 时,旧值只是被覆盖。探测器端的每条记录总是根据 `HashMap` 的最新/当前状态进行评估。
### Event-time Temporal Joins
### 事件时间时态连接
With an event-time time attribute (i.e., a rowtime attribute), it is possible to pass _past_ time attributes to the temporal table function. This allows for joining the two tables at a common point in time.
利用事件时间属性(即 rowtime 属性),可以将 _过去_ 时间属性传递给时态表函数。这允许在共同的时间点连接两个表。
Compared to processing-time temporal joins, the temporal table does not only keep the latest version (with respect to the defined primary key) of the build side records in the state but stores all versions (identified by time) since the last watermark.
与处理时间时态连接相比,时态表不仅保持状态中的构建侧记录的最新版本(相对于定义的主键),而且存储自上一个水印以来的所有版本(由时间标识)。
For example, an incoming row with an event-time timestamp of `12:30:00` that is appended to the probe side table is joined with the version of the build side table at time `12:30:00` according to the [concept of temporal tables](temporal_tables.html). Thus, the incoming row is only joined with rows that have a timestamp lower or equal to `12:30:00` with applied updates according to the primary key until this point in time.
By definition of event time, [watermarks](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html) allow the join operation to move forward in time and discard versions of the build table that are no longer necessary because no incoming row with lower or equal timestamp is expected.
例如,根据[时态表的概念](temporal_tables.html)附加到探测器侧表的事件时间时间戳为 `12:30:00` 的传入行与时间 `12:30:00` 的构建端表的版本连接在一起。因此,传入的行仅与时间戳小于或等于 `12:30:00` 的行连接,并根据主键应用更新,直到此时为止。
根据事件时间的定义,[watermarks](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html) 允许连接操作及时向前移动并丢弃不再需要的构建表的版本,因为不期望具有较低或相等时间戳的传入行。
# Temporal Tables
# 时态表
Temporal Tables represent a concept of a (parameterized) view on a changing history table that returns the content of a table at a specific point in time.
时态表表示改变的历史记录表上的(参数化)视图的概念,该表返回特定时间点的表的内容。
Flink can keep track of the changes applied to an underlying append-only table and allows for accessing the table’s content at a certain point in time within a query.
Flink 可以跟踪应用于追加表的更改,在查询中的特定时间点,允许访问表的内容。
## Motivation
## 动机
Let’s assume that we have the following table `RatesHistory`.
假设我们有下表 `RatesHistory`
......@@ -27,9 +27,9 @@ rowtime currency rate
`RatesHistory` represents an ever growing append-only table of currency exchange rates with respect to `Yen` (which has a rate of `1`). For example, the exchange rate for the period from `09:00` to `10:45` of `Euro` to `Yen` was `114`. From `10:45` to `11:15` it was `116`.
`RatesHistory` 代表一个不断增长的仅附加货币汇率表,相对于 `Yen`(其汇率为`1`)。例如,从 `09:00``10:45` 期间的`欧元``日元`汇率为 `114`。从 `10:45``11:15`,汇率是 `116`
Given that we would like to output all current rates at the time `10:58`, we would need the following SQL query to compute a result table:
鉴于我们希望在 `10:58` 时输出所有当前汇率,我们需要以下 SQL 查询来计算结果表:
......@@ -45,9 +45,9 @@ WHERE r.rowtime = (
The correlated subquery determines the maximum time for the corresponding currency that is lower or equal than the desired time. The outer query lists the rates that have a maximum timestamp.
相关子查询确定相应货币的最大时间小于或等于所需时间。外部查询列出具有最大时间戳的汇率。
The following table shows the result of such a computation. In our example, the update to `Euro` at `10:45` is taken into account, however, the update to `Euro` at `11:15` and the new entry of `Pounds` are not considered in the table’s version at time `10:58`.
下表显示了这种计算的结果。在我们的例子中,考虑了对 `10:45``Euro` 的更新,但是在时间 `10:58`,表格的版本中不会考虑在 `11:15``Euro`的更新以及新的 `Pounds` 条目。
......@@ -61,19 +61,19 @@ rowtime currency rate
The concept of _Temporal Tables_ aims to simplify such queries, speed up their execution, and reduce Flink’s state usage. A _Temporal Table_ is a parameterized view on an append-only table that interprets the rows of the append-only table as the changelog of a table and provides the version of that table at a specific point in time. Interpreting the append-only table as a changelog requires the specification of a primary key attribute and a timestamp attribute. The primary key determines which rows are overwritten and the timestamp determines the time during which a row is valid.
_Temporal Tables_ 的概念旨在简化此类查询,加快执行速度并减少 Flink 的状态使用。 _Temporal Table_ 是仅附加表的参数化视图,它将仅附加(append-only)表的行解释为表的更改日志,并在特定时间点提供该表的版本。将仅附加表解释为更改日志(changgelog)需要指定主键属性和时间戳属性。主键确定覆盖哪些行,时间戳确定行有效的时间。
In the above example `currency` would be a primary key for `RatesHistory` table and `rowtime` would be the timestamp attribute.
在上面的例子中,`currency` 将是 `RatesHistory` 表的主键,`rowtime` 将是 timestamp 属性。
In Flink, a temporal table is represented by a _Temporal Table Function_.
在 Flink 中,时态表由 _时态表函数(Temporal Table Function)_ 表示。
## Temporal Table Functions
## 时态表函数
In order to access the data in a temporal table, one must pass a [time attribute](time_attributes.html) that determines the version of the table that will be returned. Flink uses the SQL syntax of [table functions](../udfs.html#table-functions) to provide a way to express it.
为了访问时态表中的数据,必须传递一个[时间属性](time_attributes.html),该属性确定将返回的表的版本。Flink 使用[表函数](../udfs.html#table-functions)的 SQL 语法来提供表达它的方法。
Once defined, a _Temporal Table Function_ takes a single time argument `timeAttribute` and returns a set of rows. This set contains the latest versions of the rows for all of the existing primary keys with respect to the given time attribute.
定义后,_Temporal Table Function_ 采用单个时间参数 `timeAttribute` 并返回一组 rows。返回内容包含与给定时间属性相关的所有现有主键的最新行版本。
Assuming that we defined a temporal table function `Rates(timeAttribute)` based on `RatesHistory` table, we could query such a function in the following way:
假设我们基于 `RatesHistory` 表定义了一个时态表函数 `Rates(timeAttribute)`,我们可以通过以下方式查询这样的函数:
......@@ -97,15 +97,15 @@ rowtime currency rate
Each query to `Rates(timeAttribute)` would return the state of the `Rates` for the given `timeAttribute`.
对于 `Rates(timeAttribute)` 的每个查询都将返回给定 `timeAttribute``Rates` 的状态。
**Note**: Currently, Flink doesn’t support directly querying the temporal table functions with a constant time attribute parameter. At the moment, temporal table functions can only be used in joins. The example above was used to provide an intuition about what the function `Rates(timeAttribute)` returns.
**注意**:目前,Flink 不支持使用常量时间属性参数直接查询时态表函数。目前,时态表函数只能用于连接。上面的例子展示了函数 `Rates(timeAttribute)`的返回值。
See also the page about [joins for continuous queries](joins.html) for more information about how to join with a temporal table.
有关如何连接时态表的更多信息,另请参阅有关[连续查询的连接](joins.html)的页面。
### Defining Temporal Table Function
### 定义时态表函数
The following code snippet illustrates how to create a temporal table function from an append-only table.
以下代码段说明了如何从仅追加表创建时态表函数。
......@@ -166,7 +166,6 @@ tEnv.registerTable("RatesHistory", ratesHistory)
Line `(1)` creates a `rates` [temporal table function](#temporal-table-functions), which allows us to use the function `rates` in the [Table API](../tableApi.html#joins).
Line `(2)` registers this function under the name `Rates` in our table environment, which allows us to use the `Rates` function in [SQL](../sql.html#joins).
`(1)`行创建了一个 `rates` [时态表函数](#temporal-table-functions),它允许我们使用[表API](../tableApi.html#joins)中的函数 `rates`
`(2)`行在我们的表环境中以名称 `Rates` 注册此函数,这允许我们在[SQL](../sql.html#joins)中使用 `Rates` 函数。
此差异已折叠。
# Query Configuration
# 查询配置
Table API and SQL queries have the same semantics regardless whether their input is bounded batch input or unbounded stream input. In many cases, continuous queries on streaming input are capable of computing accurate results that are identical to offline computed results. However, this is not possible in general case because continuous queries have to restrict the size of the state they are maintaining in order to avoid to run out of storage and to be able to process unbounded streaming data over a long period of time. As a result, a continuous query might only be able to provide approximated results depending on the characteristics of the input data and the query itself.
无论表达式输入是有界批量输入还是无界流输入,Table API和 SQL 查询都具有相同的语义。在许多情况下,对流输入的连续查询能够计算与离线计算结果相同的准确结果。然而,这在一般情况下是不可能的,因为连续查询必须限制它们维护的状态的大小,以避免耗尽存储并且能够在很长一段时间内处理无界流数据。因此,连续查询可能只能提供近似结果,具体取决于输入数据的特征和查询本身。(这里需要说明的是,状态的本质其实还是存储,所以对于状态的维护,需要不断的清理)
Flink’s Table API and SQL interface provide parameters to tune the accuracy and resource consumption of continuous queries. The parameters are specified via a `QueryConfig` object. The `QueryConfig` can be obtained from the `TableEnvironment` and is passed back when a `Table` is translated, i.e., when it is [transformed into a DataStream](../common.html#convert-a-table-into-a-datastream-or-dataset) or [emitted via a TableSink](../common.html#emit-a-table).
Flink 的 Table API 和 SQ L接口提供参数来调整连续查询的准确性和资源消耗。参数通过 `QueryConfig` 对象指定。`QueryConfig` 可以从 `TableEnvironment` 获得,并在转换 `Table` 时传回,即,当它[转换为DataStream](../common.html#convert-a-table-into-a-datastream-or-dataset)[通过TableSink发出时](../common.html#emit-a-table)
......@@ -56,7 +56,7 @@ val tableEnv = TableEnvironment.getTableEnvironment(env)
"outputTable", // table name
Array[String](...), // field names
Array[TypeInformation[_]](...), // field types
sink) // table sink
sink) // table sink
// emit result Table via a TableSink result.insertInto("outputTable", qConfig)
// convert result Table into a DataStream[Row] val stream: DataStream[Row] = result.toAppendStream[Row](qConfig)
......@@ -64,13 +64,13 @@ val tableEnv = TableEnvironment.getTableEnvironment(env)
In the following we describe the parameters of the `QueryConfig` and how they affect the accuracy and resource consumption of a query.
在下文中,我们将描述 `QueryConfig` 的参数以及它们如何影响查询的准确性和资源消耗。
## Idle State Retention Time
## 空闲状态保留时间
Many queries aggregate or join records on one or more key attributes. When such a query is executed on a stream, the continuous query needs to collect records or maintain partial results per key. If the key domain of the input stream is evolving, i.e., the active key values are changing over time, the continuous query accumulates more and more state as more and more distinct keys are observed. However, often keys become inactive after some time and their corresponding state becomes stale and useless.
许多查询聚合或连接一个或多个键属性上的记录。在流上执行此类查询时,连续查询需要收集记录或维护每个 key 的部分结果。如果输入流的 key 域正在变化,即,active key 随时间变化,则随着越来越多的不同 key,连续查询累积越来越多的状态。但是,经常在一段时间后 key 变为非活动状态,并且它们的相应状态变得陈旧且无用。
For example the following query computes the number of clicks per session.
例如,以下查询计算每个会话(session)的单击次数。
......@@ -80,18 +80,18 @@ SELECT sessionId, COUNT(*) FROM clicks GROUP BY sessionId;
The `sessionId` attribute is used as a grouping key and the continuous query maintains a count for each `sessionId` it observes. The `sessionId` attribute is evolving over time and `sessionId` values are only active until the session ends, i.e., for a limited period of time. However, the continuous query cannot know about this property of `sessionId` and expects that every `sessionId` value can occur at any point of time. It maintains a count for each observed `sessionId` value. Consequently, the total state size of the query is continuously growing as more and more `sessionId` values are observed.
`sessionId` 属性用作分组 key,连续查询维护其观察到的每个 `sessionId` 的计数。`sessionId` 属性随着时间的推移而变化(进化),并且 `sessionId` 值仅在会话(session)结束之前有效,即,在有限的时间段内。但是,连续查询无法知道 `sessionId` 的此属性,并期望每个 `sessionId` 值都可以在任何时间点发生。它维护每个观察到的 `sessionId` 值的计数。因此,随着观察到越来越多的sessionId值,查询的总状态大小不断增长。
The _Idle State Retention Time_ parameters define for how long the state of a key is retained without being updated before it is removed. For the previous example query, the count of a `sessionId` would be removed as soon as it has not been updated for the configured period of time.
_空闲状态保留时间参数(Idle State Retention Time)_ 定义了在删除 key 之前保留 key 状态多长时间而不进行更新。对于前面的示例查询,只要在配置的时间段内没有更新 `sessionId`,就会删除它的计数。
By removing the state of a key, the continuous query completely forgets that it has seen this key before. If a record with a key, whose state has been removed before, is processed, the record will be treated as if it was the first record with the respective key. For the example above this means that the count of a `sessionId` would start again at `0`.
通过删除 key 的状态,连续查询完全忘记它之前已经看过这个key。如果处理了具有其状态已被删除的 key 的记录,则该记录将被视为具有相应 key 的第一个记录。对于上面的示例,这意味着 `sessionId` 的计数将再次从 `0` 开始。
There are two parameters to configure the idle state retention time:
配置空闲状态保留时间有两个参数:
* The _minimum idle state retention time_ defines how long the state of an inactive key is at least kept before it is removed.
* The _maximum idle state retention time_ defines how long the state of an inactive key is at most kept before it is removed.
* _minimum idle state retention time_ 定义了非活动key的状态在被删除之前至少保持多长时间。
* _maximum idle state retention time_ 义了非活动key的状态在被删除之前最多保留多长时间。
The parameters are specified as follows:
参数规定如下:
......@@ -114,5 +114,4 @@ val qConfig: StreamQueryConfig = ???
Cleaning up state requires additional bookkeeping which becomes less expensive for larger differences of `minTime` and `maxTime`. The difference between `minTime` and `maxTime` must be at least 5 minutes.
清理状态需要额外的簿记(bookkeeping),这对于 `minTime``maxTime` 的较大差异而言变得更实用(便宜)。`minTime``maxTime` 之间的差异必须至少为5分钟。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册