提交 561eb3f9 编写于 作者: 噼里啪啦嘣

31 完成

上级 35c5a238
# Asynchronous I/O for External Data Access
# 用于外部数据访问的异步I/O访问
This page explains the use of Flink’s API for asynchronous I/O with external data stores. For users not familiar with asynchronous or event-driven programming, an article about Futures and event-driven programming may be useful preparation.
本页说明了Flink API用于具有外部数据存储的异步I / O的用法。对于不熟悉异步或事件驱动的编程的用户,有关期货和事件驱动的编程的文章可能是有用的准备。
Note: Details about the design and implementation of the asynchronous I/O utility can be found in the proposal and design document [FLIP-12: Asynchronous I/O Design and Implementation](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65870673).
注意:有关异步I / O实用程序的设计和实现的详细信息,可以在投标和设计文档FLIP-12:异步I / O设计和实现中找到。
## The need for Asynchronous I/O Operations
异步I / O操作的需求
When interacting with external systems (for example when enriching stream events with data stored in a database), one needs to take care that communication delay with the external system does not dominate the streaming application’s total work.
与外部系统进行交互时(例如,使用存储在数据库中的数据丰富流事件),需要注意与外部系统的通信延迟不会影响流应用程序的全部工作。
Naively accessing data in the external database, for example in a `MapFunction`, typically means **synchronous** interaction: A request is sent to the database and the `MapFunction` waits until the response has been received. In many cases, this waiting makes up the vast majority of the function’s time.
幼稚地访问外部数据库(例如)中的数据MapFunction通常意味着同步交互:将请求发送到数据库,并MapFunction等待直到收到响应。在许多情况下,这种等待占据了该功能的绝大部分时间。
Asynchronous interaction with the database means that a single parallel function instance can handle many requests concurrently and receive the responses concurrently. That way, the waiting time can be overlayed with sending other requests and receiving responses. At the very least, the waiting time is amortized over multiple requests. This leads in most cased to much higher streaming throughput.
与数据库的异步交互意味着单个并行函数实例可以同时处理许多请求并同时接收响应。这样,等待时间可以与发送其他请求和接收响应重叠。至少,等待时间将分摊到多个请求中。在大多数情况下,这导致更高的流吞吐量。
![](../img/async_io.svg)
_Note:_ Improving throughput by just scaling the `MapFunction` to a very high parallelism is in some cases possible as well, but usually comes at a very high resource cost: Having many more parallel MapFunction instances means more tasks, threads, Flink-internal network connections, network connections to the database, buffers, and general internal bookkeeping overhead.
注意:MapFunction在某些情况下,也可以通过仅将scaling扩展到很高的并行度来提高吞吐量,但这通常会付出很高的资源成本:拥有更多的并行MapFunction实例意味着更多的任务,线程,Flink内部网络连接,网络与数据库的连接,缓冲区和一般内部簿记开销。
## Prerequisites
## Prerequisites 先决条件
As illustrated in the section above, implementing proper asynchronous I/O to a database (or key/value store) requires a client to that database that supports asynchronous requests. Many popular databases offer such a client.
如上一节所述,对数据库(或键/值存储)实现适当的异步I / O要求该数据库的客户端支持异步请求。许多流行的数据库都提供了这样的客户端。
In the absence of such a client, one can try and turn a synchronous client into a limited concurrent client by creating multiple clients and handling the synchronous calls with a thread pool. However, this approach is usually less efficient than a proper asynchronous client.
在没有这样的客户端的情况下,可以尝试通过创建多个客户端并使用线程池处理同步调用,将同步客户端转变为有限的并发客户端。但是,这种方法通常不如适当的异步客户端有效。
## Async I/O API
## Async I/O API 异步I / O API
Flink’s Async I/O API allows users to use asynchronous request clients with data streams. The API handles the integration with data streams, well as handling order, event time, fault tolerance, etc.
Flink的异步I / O API允许用户将异步请求客户端与数据流一起使用。API处理与数据流的集成,以及处理顺序,事件时间,容错等。
Assuming one has an asynchronous client for the target database, three parts are needed to implement a stream transformation with asynchronous I/O against the database:
假设其中一个具有目标数据库的异步客户端,则需要三个部分来对数据库执行具有异步I / O的流转换:
* An implementation of `AsyncFunction` that dispatches the requests
* A _callback_ that takes the result of the operation and hands it to the `ResultFuture`
* Applying the async I/O operation on a DataStream as a transformation
* An implementation of `AsyncFunction` that dispatches the requests 的实现AsyncFunction调度请求
* A _callback_ that takes the result of the operation and hands it to the `ResultFuture` 一个回调,是以操作并把它的结果ResultFuture
* Applying the async I/O operation on a DataStream as a transformation 在数据流上应用异步I / O操作作为转换
The following code example illustrates the basic pattern:
以下代码示例说明了基本模式:
```
......@@ -131,56 +143,73 @@ class AsyncDatabaseRequest extends AsyncFunction[String, (String, String)] {
**Important note**: The `ResultFuture` is completed with the first call of `ResultFuture.complete`. All subsequent `complete` calls will be ignored.
重要说明:ResultFuture的第一次调用完成ResultFuture.complete。所有后续complete呼叫将被忽略。
The following two parameters control the asynchronous operations:
以下两个参数控制异步操作​​:
* **Timeout**: The timeout defines how long an asynchronous request may take before it is considered failed. This parameter guards against dead/failed requests.
* 超时:超时定义异步请求在被视为失败之前可能需要花费多长时间。此参数防止无效/失败的请求
* **Capacity**: This parameter defines how many asynchronous requests may be in progress at the same time. Even though the async I/O approach leads typically to much better throughput, the operator can still be the bottleneck in the streaming application. Limiting the number of concurrent requests ensures that the operator will not accumulate an ever-growing backlog of pending requests, but that it will trigger backpressure once the capacity is exhausted.
* 容量:此参数定义同时可以处理多少个异步请求。即使异步I / O方法通常可以带来更好的吞吐量,但操作员仍然可能成为流式应用程序的瓶颈。限制并发请求的数量可确保操作员不会积累未决请求的不断增长的积压,但是一旦容量用尽,它将触发背压。
### Timeout Handling
### Timeout Handling 超时处理
When an async I/O request times out, by default an exception is thrown and job is restarted. If you want to handle timeouts, you can override the `AsyncFunction#timeout` method.
当异步I / O请求超时时,默认情况下会引发异常并重新启动作业。如果要处理超时,则可以覆盖该AsyncFunction#timeout方法。
### Order of Results
### Order of Results 结果顺序
The concurrent requests issued by the `AsyncFunction` frequently complete in some undefined order, based on which request finished first. To control in which order the resulting records are emitted, Flink offers two modes:
AsyncFunction频繁发出的并发请求以某些未定义的顺序完成,基于哪个请求首先完成。为了控制结果记录的发出顺序,Flink提供了两种模式:
* **Unordered**: Result records are emitted as soon as the asynchronous request finishes. The order of the records in the stream is different after the async I/O operator than before. This mode has the lowest latency and lowest overhead, when used with _processing time_ as the basic time characteristic. Use `AsyncDataStream.unorderedWait(...)` for this mode.
* 无序:。在异步I / O运算符之后,流中记录的顺序与之前不同。当将处理时间用作基本时间特征时,此模式具有最低的延迟和最低的开销。使用AsyncDataStream.unorderedWait(...)此模式。
* **Ordered**: In that case, the stream order is preserved. Result records are emitted in the same order as the asynchronous requests are triggered (the order of the operators input records). To achieve that, the operator buffers a result record until all its preceding records are emitted (or timed out). This usually introduces some amount of extra latency and some overhead in checkpointing, because records or results are maintained in the checkpointed state for a longer time, compared to the unordered mode. Use `AsyncDataStream.orderedWait(...)` for this mode.
* 已排序:在这种情况下,将保留流顺序。结果记录的发射顺序与异步请求的触发顺序相同(操作员输入记录的顺序)。为此,操作员对结果记录进行缓冲,直到发出其所有先前的记录(或将其超时)为止。这通常会带来一些额外的延迟和检查点的开销,因为与无序模式相比,记录或结果在检查点状态下的保存时间更长。使用AsyncDataStream.orderedWait(...)此模式。
### Event Time
### Event Time 活动时间
When the streaming application works with [event time](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html), watermarks will be handled correctly by the asynchronous I/O operator. That means concretely the following for the two order modes:
当流应用程序处理事件时间时,异步I / O运算符将正确处理水印。对于两种订购模式,这具体意味着以下几点:
* **Unordered**: Watermarks do not overtake records and vice versa, meaning watermarks establish an _order boundary_. Records are emitted unordered only between watermarks. A record occurring after a certain watermark will be emitted only after that watermark was emitted. The watermark in turn will be emitted only after all result records from inputs before that watermark were emitted.
* 无序:水印不会超过记录,反之亦然,这意味着水印会建立顺序边界。记录仅在水印之间无序发出。在某个水印之后发生的记录仅在该水印被发出之后才被发出。反过来,仅在发出水印之前输入的所有结果记录之后,才发出水印。
That means that in the presence of watermarks, the _unordered_ mode introduces some of the same latency and management overhead as the _ordered_ mode does. The amount of that overhead depends on the watermark frequency.
这意味着,在水印的存在,将无序的方式介绍了一些相同的延迟和管理开销的订购模式一样。该开销的量取决于水印频率。
* **Ordered**: Order of watermarks an records is preserved, just like order between records is preserved. There is no significant change in overhead, compared to working with _processing time_.
* 有序:保留记录的水印顺序,就像保留记录之间的顺序一样。与处理时间相比,开销没有明显变化。
Please recall that _Ingestion Time_ is a special case of _event time_ with automatically generated watermarks that are based on the sources processing time.
请回想一下,“ 提取时间”是事件时间的特例,它具有基于源处理时间自动生成的水印。
### Fault Tolerance Guarantees
### Fault Tolerance Guarantees 容错保证
The asynchronous I/O operator offers full exactly-once fault tolerance guarantees. It stores the records for in-flight asynchronous requests in checkpoints and restores/re-triggers the requests when recovering from a failure.
异步I / O操作员提供完全的一次容错保证。它将正在进行的异步请求的记录存储在检查点中,并在从故障中恢复时恢复/重新触发请求。
### Implementation Tips
### Implementation Tips 实施技巧
For implementations with _Futures_ that have an _Executor_ (or _ExecutionContext_ in Scala) for callbacks, we suggests to use a `DirectExecutor`, because the callback typically does minimal work, and a `DirectExecutor` avoids an additional thread-to-thread handover overhead. The callback typically only hands the result to the `ResultFuture`, which adds it to the output buffer. From there, the heavy logic that includes record emission and interaction with the checkpoint bookkeeping happens in a dedicated thread-pool anyways.
对于使用具有Executor(或Scala中的ExecutionContext)进行回调的Futures实现,我们建议使用,因为回调通常完成的工作最少,并且避免了额外的线程间切换开销。回调通常仅将结果传递给,这会将其添加到输出缓冲区。从那里开始,繁重的逻辑(包括记录发出以及与检查点簿记的交互)始终在专用线程池中进行。
A `DirectExecutor` can be obtained via `org.apache.flink.runtime.concurrent.Executors.directExecutor()` or `com.google.common.util.concurrent.MoreExecutors.directExecutor()`.
### Caveat
### Caveat 警告
**The AsyncFunction is not called Multi-Threaded**
**The AsyncFunction is not called Multi-Threaded AsyncFunction不称为多线程**
A common confusion that we want to explicitly point out here is that the `AsyncFunction` is not called in a multi-threaded fashion. There exists only one instance of the `AsyncFunction` and it is called sequentially for each record in the respective partition of the stream. Unless the `asyncInvoke(...)` method returns fast and relies on a callback (by the client), it will not result in proper asynchronous I/O.
我们要在这里明确指出的一个常见混淆是,AsyncFunction它不是以多线程方式调用的。仅存在的一个实例,AsyncFunction并且在流的各个分区中为每个记录依次调用它。除非该asyncInvoke(...)方法快速返回并依赖于(由客户端)回调,否则它将不会导致正确的异步I / O。
For example, the following patterns result in a blocking `asyncInvoke(...)` functions and thus void the asynchronous behavior:
例如,以下模式导致阻塞asyncInvoke(...)功能,从而使异步行为无效:
* Using a database client whose lookup/query method call blocks until the result has been received back
* 使用数据库客户端,其查询/查询方法调用将阻塞,直到收到结果为止
* Blocking/waiting on the future-type objects returned by an asynchronous client inside the `asyncInvoke(...)` method
* 阻止/等待asyncInvoke(...)方法内部的异步客户端返回的将来类型的对象
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册