未验证 提交 0ec3e6d4 编写于 作者: 片刻小哥哥's avatar 片刻小哥哥 提交者: GitHub

Merge pull request #59 from apachecn/feature/flink_1.7_doc_zh_21

21 完成
# The Broadcast State Pattern
# The Broadcast State Pattern 广播状态模式
[Working with State](state.html) describes operator state which upon restore is either evenly distributed among the parallel tasks of an operator, or unioned, with the whole state being used to initialize the restored parallel tasks.
[With State](state.html)描述操作符状态,在恢复时,操作符状态在操作符的并行任务中均匀分布,或者统一,整个状态用于初始化还原的并行任务。
A third type of supported _operator state_ is the _Broadcast State_. Broadcast state was introduced to support use cases where some data coming from one stream is required to be broadcasted to all downstream tasks, where it is stored locally and is used to process all incoming elements on the other stream. As an example where broadcast state can emerge as a natural fit, one can imagine a low-throughput stream containing a set of rules which we want to evaluate against all elements coming from another stream. Having the above type of use cases in mind, broadcast state differs from the rest of operator states in that:
第三种类型的支持 _operator state_ 是 _Broadcast State_。引入广播状态以支持使用其中来自一个流的一些数据需要被广播到所有下游任务的使用情况,其中它被本地存储并且用于处理另一个流中的所有输入元素。作为广播状态可以作为自然拟合出现的示例,人们可以想象包含一组规则的低吞吐量流,我们希望对来自另一个流的所有元素进行评估。考虑到以上类型的使用情况,广播状态与运营商剩余部分的不同之处在于:
1. it has a map format,
2. it is only available to specific operators that have as inputs a _broadcasted_ stream and a _non-broadcasted_ one, and
3. such an operator can have _multiple broadcast states_ with different names.
1. 它具有MAP格式,
2. 它仅适用于具有以下输入的特定运算符 _broadcasted_ 和 _non-broadcasted_ ,并且
3. 这样的运算符可以具有不同名称的 _multiple broadcast states_ 。
## Provided APIs
## Provided APIs 提供API
To show the provided APIs, we will start with an example before presenting their full functionality. As our running example, we will use the case where we have a stream of objects of different colors and shapes and we want to find pairs of objects of the same color that follow a certain pattern, _e.g._ a rectangle followed by a triangle. We assume that the set of interesting patterns evolves over time.
为了显示所提供的API,我们将从一个示例开始,然后介绍它们的全部功能。作为我们正在运行的示例,我们将使用这样的情况:我们有一个具有不同颜色和形状的对象流,并且我们希望找到与某一模式相同颜色的对象对,例如一个矩形,后面跟着一个三角形。我们假设一组有趣的模式会随着时间的推移而演变。
In this example, the first stream will contain elements of type `Item` with a `Color` and a `Shape` property. The other stream will contain the `Rules`.
在本例中,第一个流将包含具有`Color``Shape`属性的`Item`类型的元素。另一条溪流将包含 `Rules`
Starting from the stream of `Items`, we just need to _key it_ by `Color`, as we want pairs of the same color. This will make sure that elements of the same color end up on the same physical machine.
`Items`的流开始,我们只需要 _key it_ by `Color`,因为我们想要相同颜色的对。这将确保相同颜色的元素在同一台物理机器上结束。
......@@ -28,7 +28,7 @@ KeyedStream<Item, Color> colorPartitionedStream = shapeStream
Moving on to the `Rules`, the stream containing them should be broadcasted to all downstream tasks, and these tasks should store them locally so that they can evaluate them against all incoming `Items`. The snippet below will i) broadcast the stream of rules and ii) using the provided `MapStateDescriptor`, it will create the broadcast state where the rules will be stored.
接下来是 `Rules`,包含这些内容的流应广播到所有下游任务,这些任务应在本地存储,以便能够对照所有传入的“Items”对它们进行评估。下面的代码片段将(I)广播规则流,(Ii)使用提供的`MapStateDescriptor`,它将创建存储规则的广播状态。
......@@ -46,19 +46,19 @@ BroadcastStream<Rule> ruleBroadcastStream = ruleStream
Finally, in order to evaluate the `Rules` against the incoming elements from the `Item` stream, we need to:
最后,为了评估`Item`流中对输入元素的`Rules` ,我们需要:
1. connect the two streams, and
2. specify our match detecting logic.
Connecting a stream (keyed or non-keyed) with a `BroadcastStream` can be done by calling `connect()` on the non-broadcasted stream, with the `BroadcastStream` as an argument. This will return a `BroadcastConnectedStream`, on which we can call `process()` with a special type of `CoProcessFunction`. The function will contain our matching logic. The exact type of the function depends on the type of the non-broadcasted stream:
可以通过在非广播流中调用“连接()”来实现将流(键控或非键控)与“广播流”连接,其中“广播流”作为参数。这将返回一个“broadcastConnectedstream”,我们可以用一种特殊类型的“cowprocessfunction”调用“进程()”。函数将包含我们的匹配逻辑。函数的确切类型取决于非广播流的类型:
* if that is **keyed**, then the function is a `KeyedBroadcastProcessFunction`.
* if it is **non-keyed**, the function is a `BroadcastProcessFunction`.
* 如果是**keyed**,则函数是`KeyedBroadcastProcessFunction`
* 如果它是 **non-keyed**,则该函数是`BroadcastProcessFunction`
Given that our non-broadcasted stream is keyed, the following snippet includes the above calls:
假定我们的非广播流被键入,以下代码片段包括上述调用:
**Attention:** The connect should be called on the non-broadcasted stream, with the BroadcastStream as an argument.
**注意:**应在非广播流中调用连接,并将广播流用作参数。
......@@ -81,9 +81,9 @@ DataStream<Match> output = colorPartitionedStream
### BroadcastProcessFunction and KeyedBroadcastProcessFunction
### BroadcastProcessFunction and KeyedBroadcastProcessFunction 广播过程功能与关键--广播过程功能
As in the case of a `CoProcessFunction`, these functions have two process methods to implement; the `processBroadcastElement()` which is responsible for processing incoming elements in the broadcasted stream and the `processElement()` which is used for the non-broadcasted one. The full signatures of the methods are presented below:
`CoProcessFunction`一样,这些函数有两种实现的处理方法: `processBroadcastElement()` 负责处理广播流中的传入元素,以及用于非广播的 `processElement()` 。方法的完整签名如下:
......@@ -113,34 +113,34 @@ public abstract class KeyedBroadcastProcessFunction<KS, IN1, IN2, OUT> {
The first thing to notice is that both functions require the implementation of the `processBroadcastElement()` method for processing elements in the broadcast side and the `processElement()` for elements in the non-broadcasted side.
首先要注意的是这两个功能都需要执行广播侧的元件的 `processElement()` 方法和在非广播侧的元素的`processBroadcastElement()`
The two methods differ in the context they are provided. The non-broadcast side has a `ReadOnlyContext`, while the broadcasted side has a `Context`.
两种方法在提供的上下文中不同。非广播侧具有 `ReadOnlyContext`,而广播侧具有 `Context`
Both of these contexts (`ctx` in the following enumeration):
以下枚举中的两个上下文(“CTX”):
1. give access to the broadcast state: `ctx.getBroadcastState(MapStateDescriptor&lt;K, V&gt; stateDescriptor)`
2. allow to query the timestamp of the element: `ctx.timestamp()`,
3. get the current watermark: `ctx.currentWatermark()`
4. get the current processing time: `ctx.currentProcessingTime()`, and
5. emit elements to side-outputs: `ctx.output(OutputTag&lt;X&gt; outputTag, X value)`.
1. 访问广播状态: `ctx.getBroadcastState(MapStateDescriptor&lt;K, V&gt; stateDescriptor)`
2. 允许查询元素的时间戳: `ctx.timestamp()`
3. 获取当前水印:`ctx.currentWatermark()`
4. 获取当前处理时间:`ctx.currentProcessingTime()`
5. 将元素发送到侧输出:`ctx.output(OutputTag&lt;X&gt; outputTag, X value)`
The `stateDescriptor` in the `getBroadcastState()` should be identical to the one in the `.broadcast(ruleStateDescriptor)` above.
`getBroadcastState()` 中的 `stateDescriptor` 应与上文 `.broadcast(ruleStateDescriptor)` 中的“相同”。
The difference lies in the type of access each one gives to the broadcast state. The broadcasted side has **read-write access** to it, while the non-broadcast side has **read-only access** (thus the names). The reason for this is that in Flink there is no cross-task communication. So, to guarantee that the contents in the Broadcast State are the same across all parallel instances of our operator, we give read-write access only to the broadcast side, which sees the same elements across all tasks, and we require the computation on each incoming element on that side to be identical across all tasks. Ignoring this rule would break the consistency guarantees of the state, leading to inconsistent and often difficult to debug results.
差异在于每个人给予广播状态的访问类型。广播侧具有**read-write access**,而非广播侧具有 **read-only access**(因此,名称)。这是因为在flink中没有交叉任务通信。因此,为了保证广播状态的内容在我们运营商的所有并行实例中是相同的,我们只对广播侧给出读-写访问,广播侧在所有任务中看到相同的元素,并且我们需要在该一侧上的每个输入元素的计算在所有任务上是相同的。忽略此规则将破坏状态的一致性保证,导致调试结果不一致且往往很困难。
**Attention:** The logic implemented in `processBroadcast()` must have the same deterministic behavior across all parallel instances!
**注意:**`processBroadcast()`中实现的逻辑必须在所有并行实例中具有相同的确定性行为!
Finally, due to the fact that the `KeyedBroadcastProcessFunction` is operating on a keyed stream, it exposes some functionality which is not available to the `BroadcastProcessFunction`. That is:
最后,由于 `KeyedBroadcastProcessFunction` 在键控流中操作,所以它暴露了一些不可用于 `BroadcastProcessFunction`的功能。即:
1. the `ReadOnlyContext` in the `processElement()` method gives access to Flink’s underlying timer service, which allows to register event and/or processing time timers. When a timer fires, the `onTimer()` (shown above) is invoked with an `OnTimerContext` which exposes the same functionality as the `ReadOnlyContext` plus
* the ability to ask if the timer that fired was an event or processing time one and
* to query the key associated with the timer.
2. the `Context` in the `processBroadcastElement()` method contains the method `applyToKeyedState(StateDescriptor&lt;S, VS&gt; stateDescriptor, KeyedStateFunction&lt;KS, S&gt; function)`. This allows to register a `KeyedStateFunction` to be **applied to all states of all keys** associated with the provided `stateDescriptor`.
1. `processElement()` 方法中的“ReadOnlyContext”方法允许访问Flink的底层计时器服务,该服务允许注册事件和/或处理时间计时器。当计时器触发时,使用‘OnTimerContext’调用 `onTimer()` (如上文所示),该`OnTimerContext`公开与`ReadOnlyContext`plus相同的功能。
* 询问触发的计时器是否为事件或处理时间的能力
* 查询与计时器关联的密钥。
2. `processBroadcastElement()` 方法中的`Context`包含`applyToKeyedState(StateDescriptor&lt;S, VS&gt; stateDescriptor, KeyedStateFunction&lt;KS, S&gt; function)`的方法。这允许向与所提供的`stateDescriptor`相关联的所有密钥**的所有状态注册 `KeyedStateFunction` to be。
**Attention:** Registering timers is only possible at `processElement()` of the `KeyedBroadcastProcessFunction` and only there. It is not possible in the `processBroadcastElement()` method, as there is no key associated to the broadcasted elements.
**注意:** 注册定时器仅在 `KeyedBroadcastProcessFunction``processElement()` 和此处仅有可能。在 `processBroadcastElement()`方法中不可能,因为没有与广播的元素相关联的密钥。
Coming back to our original example, our `KeyedBroadcastProcessFunction` could look like the following:
回到我们的原始示例,我们的`KeyedBroadcastProcessFunction`可以如下所示:
......@@ -211,15 +211,15 @@ new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
## Important Considerations
## Important Considerations 重要考虑
After describing the offered APIs, this section focuses on the important things to keep in mind when using broadcast state. These are:
在描述了提供的API之后,本节将重点介绍在使用广播状态时要记住的重要事项。它们是:
* **There is no cross-task communication:** As stated earlier, this is the reason why only the broadcast side of a `(Keyed)-BroadcastProcessFunction` can modify the contents of the broadcast state. In addition, the user has to make sure that all tasks modify the contents of the broadcast state in the same way for each incoming element. Otherwise, different tasks might have different contents, leading to inconsistent results.
* **没有交叉任务通信:** 如前所述,这就是为什么只有`(Keyed)-BroadcastProcessFunction`的广播侧才能修改广播状态的内容的原因。此外,用户必须确保所有任务以与每个输入元素相同的方式修改广播状态的内容。否则,不同的任务可能有不同的内容,导致结果不一致。
* **Order of events in Broadcast State may differ across tasks:** Although broadcasting the elements of a stream guarantees that all elements will (eventually) go to all downstream tasks, elements may arrive in a different order to each task. So the state updates for each incoming element _MUST NOT depend on the ordering_ of the incoming events.
* **广播状态下的事件顺序可能因任务不同而不同:** 虽然广播流的元素保证所有元素(最终)都会到达所有下游任务,但元素可能会以不同的顺序到达每个任务。因此,每个传入元素的状态更新不能依赖于传入事件的顺序。
* **All tasks checkpoint their broadcast state:** Although all tasks have the same elements in their broadcast state when a checkpoint takes place (checkpoint barriers do not overpass elements), all tasks checkpoint their broadcast state, and not just one of them. This is a design decision to avoid having all tasks read from the same file during a restore (thus avoiding hotspots), although it comes at the expense of increasing the size of the checkpointed state by a factor of p (= parallelism). Flink guarantees that upon restoring/rescaling there will be **no duplicates** and **no missing data**. In case of recovery with the same or smaller parallelism, each task reads its checkpointed state. Upon scaling up, each task reads its own state, and the remaining tasks (`p_new`-`p_old`) read checkpoints of previous tasks in a round-robin manner.
* **所有任务的广播状态:** 虽然在发生检查点时,所有任务的广播状态中都有相同的元素(检查点屏障不跨越元素),但所有任务的广播状态都是检查点,而不仅仅是其中的一个。这是一个设计决定,以避免在还原过程中从同一个文件中读取所有任务(从而避免热点),尽管代价是将校验状态的大小增加一倍(=并行性)。Flink保证在恢复/重新缩放时,将没有重复的**和**没有丢失的数据**。在具有相同或较小并行性的恢复情况下,每个任务都读取其校验点状态。升级后,每个任务读取自己的状态,其余任务(‘p_new-`p_old’)以循环方式读取以前任务的检查点。
* **No RocksDB state backend:** Broadcast state is kept in-memory at runtime and memory provisioning should be done accordingly. This holds for all operator states.
* **没有RocksDB状态后端:** 广播状态保存在运行时内存中,应该相应地进行内存配置。这适用于所有运算符状态。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册