未验证 提交 63db0c42 编写于 作者: 片刻小哥哥's avatar 片刻小哥哥 提交者: GitHub

Merge pull request #58 from apachecn/feature/flink_1.7_doc_zh_22

22 完成
# Checkpointing
# Checkpointing 检验指示
Every function and operator in Flink can be **stateful** (see [working with state](state.html) for details). Stateful functions store data across the processing of individual elements/events, making state a critical building block for any type of more elaborate operation.
弗林克中的每个函数和运算符都可以是*状态*(有关详细信息,请参阅[与状态一起工作](state.html)。状态函数存储数据跨单个元素/事件的处理,使状态成为任何类型的更精细操作的关键构建块。
In order to make state fault tolerant, Flink needs to **checkpoint** the state. Checkpoints allow Flink to recover state and positions in the streams to give the application the same semantics as a failure-free execution.
为了使状态容错,Flink需要到**检查点**状态。检查点允许Flink恢复流中的状态和位置,以赋予应用程序与无故障执行相同的语义。
The [documentation on streaming fault tolerance](//ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html) describes in detail the technique behind Flink’s streaming fault tolerance mechanism.
[关于流式容错的文档](//ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html)详细介绍了flink的流容错机制背后的技术。
## Prerequisites
## Prerequisites 先决条件
Flink’s checkpointing mechanism interacts with durable storage for streams and state. In general, it requires:
Flink的检查点机制与流和状态的持久存储交互。一般而言,它要求:
* A _persistent_ (or _durable_) data source that can replay records for a certain amount of time. Examples for such sources are persistent messages queues (e.g., Apache Kafka, RabbitMQ, Amazon Kinesis, Google PubSub) or file systems (e.g., HDFS, S3, GFS, NFS, Ceph, …).
* A persistent storage for state, typically a distributed filesystem (e.g., HDFS, S3, GFS, NFS, Ceph, …)
* a _persistent_ (或 _durable_ )数据源,可以重放特定时间量的记录。这样的源的示例是持久消息队列(例如,ApacheKafka、RabbitMQ、AmazonKinesis、Googlepubsub)或文件系统(例如,HDFS、S3、GFS、NFS、CEH等)。)。
* 用于状态的持久存储,通常是分布式文件系统(例如,HDFS、S3、GFS、NFS、CEH等)。)
## Enabling and Configuring Checkpointing
## Enabling and Configuring Checkpointing 启用和配置检查点
By default, checkpointing is disabled. To enable checkpointing, call `enableCheckpointing(n)` on the `StreamExecutionEnvironment`, where _n_ is the checkpoint interval in milliseconds.
默认情况下,禁用检查点。要启用检查点,请在`StreamExecutionEnvironment`上调用`enableCheckpointing(n)`,其中_n_是以毫秒为单位的检查点间隔。
Other parameters for checkpointing include:
检查点操作的其他参数包括:
* _exactly-once vs. at-least-once_: You can optionally pass a mode to the `enableCheckpointing(n)` method to choose between the two guarantee levels. Exactly-once is preferable for most applications. At-least-once may be relevant for certain super-low-latency (consistently few milliseconds) applications.
* _exactly-once vs. at-least-once_:您可以选择将模式传递到 `enableCheckpointing(n)`方法以在两个保证级别之间进行选择。对于大多数应用而言,精确-一次是优选的。至少-一次对于某些超低延迟(持续几毫秒)应用而言可能是相关的。
* _checkpoint timeout_: The time after which a checkpoint-in-progress is aborted, if it did not complete by then.
* _checkpoint timeout_: 进程检查点终止的时间,如果该检查点到那时还没有完成的话.
* _minimum time between checkpoints_: To make sure that the streaming application makes a certain amount of progress between checkpoints, one can define how much time needs to pass between checkpoints. If this value is set for example to _5000_, the next checkpoint will be started no sooner than 5 seconds after the previous checkpoint completed, regardless of the checkpoint duration and the checkpoint interval. Note that this implies that the checkpoint interval will never be smaller than this parameter.
* _minimum time between checkpoints_: 为了确保流应用程序在检查点之间取得一定的进展,可以定义在检查点之间传递所需的时间。例如,如果将此值设置为 _5000_,则下一个检查点将在上一个检查点完成后5秒钟内启动,而不管检查点持续时间和检查点间隔如何。请注意,这意味着检查点间隔永远不会小于此参数。
It is often easier to configure applications by defining the “time between checkpoints” than the checkpoint interval, because the “time between checkpoints” is not susceptible to the fact that checkpoints may sometimes take longer than on average (for example if the target storage system is temporarily slow).
通过定义“检查点之间的时间”通常比定义检查点间隔更容易配置应用程序,因为“检查点之间的时间”不容易受到检查点有时比平均时间长的影响(例如,如果目标存储系统暂时慢的话)。
Note that this value also implies that the number of concurrent checkpoints is _one_.
注意,这个值还意味着并发检查点的数目是 _one_。
* _number of concurrent checkpoints_: By default, the system will not trigger another checkpoint while one is still in progress. This ensures that the topology does not spend too much time on checkpoints and not make progress with processing the streams. It is possible to allow for multiple overlapping checkpoints, which is interesting for pipelines that have a certain processing delay (for example because the functions call external services that need some time to respond) but that still want to do very frequent checkpoints (100s of milliseconds) to re-process very little upon failures.
* _number of concurrent checkpoints_: 默认情况下,当另一个检查点仍在进行中时,系统不会触发另一个检查点。这可以确保拓扑不会花费太多时间在检查点上,并且不会在处理流方面取得进展。它可以允许多个重叠检查点,这对于具有一定处理延迟(例如,函数调用需要一定时间响应的外部服务)但仍然希望执行非常频繁的检查点(100毫秒)来重新处理故障时很少处理的管道来说是很有趣的。
This option cannot be used when a minimum time between checkpoints is defined.
当定义检查点之间的最小时间时,无法使用此选项。
* _externalized checkpoints_: You can configure periodic checkpoints to be persisted externally. Externalized checkpoints write their meta data out to persistent storage and are _not_ automatically cleaned up when the job fails. This way, you will have a checkpoint around to resume from if your job fails. There are more details in the [deployment notes on externalized checkpoints](//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/state/checkpoints.html#externalized-checkpoints).
* _externalized checkpoints_: 您可以在外部配置定期检查点。外部化检查点将其元数据写入永久存储,并在作业失败时自动清除。这样,如果您的作业失败,您将有一个检查点以恢复。[外部化检查点的部署说明](//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/state/checkpoints.html#externalized-checkpoints)中有更多详细信息。
* _fail/continue task on checkpoint errors_: This determines if a task will be failed if an error occurs in the execution of the task’s checkpoint procedure. This is the default behaviour. Alternatively, when this is disabled, the task will simply decline the checkpoint to the checkpoint coordinator and continue running.
* _fail/continue task on checkpoint errors_: 这将确定如果任务的检查点过程执行中发生错误,任务是否会失败。这是默认行为。或者,当此操作被禁用时,任务将简单地将检查点拒绝给检查点协调器并继续运行。
......@@ -88,9 +88,9 @@ val env = StreamExecutionEnvironment.getExecutionEnvironment()
### Related Config Options
### Related Config Options 相关配置选项
Some more parameters and/or defaults may be set via `conf/flink-conf.yaml` (see [configuration](//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html) for a full guide):
可以通过`conf/flink-conf.yaml`设置更多参数和/或默认值(请参阅[配置](//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html)作为完整指南):
| Key | Default | Description |
| --- | --- | --- |
......@@ -98,63 +98,63 @@ Some more parameters and/or defaults may be set via `conf/flink-conf.yaml` (see
##### state.backend
| (none) | The state backend to be used to store and checkpoint state. |
| (none) | 用于存储和检查点状态的状态后端。|
|
##### state.backend.async
| true | Option whether the state backend should use an asynchronous snapshot method where possible and configurable. Some state backends may not support asynchronous snapshots, or only support asynchronous snapshots, and ignore this option. |
|
| true | 选项是否应在可能的情况下使用异步快照方法并进行配置。有些状态后端可能不支持异步快照,或者只支持异步快照,而忽略此选项。|
|
##### state.backend.fs.memory-threshold
| 1024 | The minimum size of state data files. All state chunks smaller than that are stored inline in the root checkpoint metadata file. |
| 1024 | 状态数据文件的最小大小。所有小于该状态块的状态块都内联存储在根检查点元数据文件中。 |
|
##### state.backend.incremental
| false | Option whether the state backend should create incremental checkpoints, if possible. For an incremental checkpoint, only a diff from the previous checkpoint is stored, rather than the complete checkpoint state. Some state backends may not support incremental checkpoints and ignore this option. |
| false | 选项状态后端是否应创建增量检查点(如果可能)。对于增量检查点,仅存储与上一个检查点不同的值,而不是完整的检查点状态。某些状态后端可能不支持增量检查点并忽略此选项。|
|
##### state.backend.local-recovery
| false | This option configures local recovery for this state backend. By default, local recovery is deactivated. Local recovery currently only covers keyed state backends. Currently, MemoryStateBackend does not support local recovery and ignore this option. |
| false | 选项状态后端是否应创建增量检查点(如果可能)。对于增量检查点,仅存储与上一个检查点不同的值,而不是完整的检查点状态。某些状态后端可能不支持增量检查点并忽略此选项。|
|
##### state.checkpoints.dir
| (none) | The default directory used for storing the data files and meta data of checkpoints in a Flink supported filesystem. The storage path must be accessible from all participating processes/nodes(i.e. all TaskManagers and JobManagers). |
| (none) | 用于在FLink支持的文件系统中存储检查点数据文件和元数据的默认目录。必须从所有参与进程/节点访问存储路径(即,所有任务管理器和作业管理器)。|
|
##### state.checkpoints.num-retained
| 1 | The maximum number of completed checkpoints to retain. |
| 1 | 要保留的已完成检查点的最大数量。 |
|
##### state.savepoints.dir
| (none) | The default directory for savepoints. Used by the state backends that write savepoints to file systems (MemoryStateBackend, FsStateBackend, RocksDBStateBackend). |
| (none) | 保存点的默认目录。用于将保存点写入文件系统的状态后端(MemoryStateBackend、FsStateBackend、RocksDBStateBackend)。 |
|
##### taskmanager.state.local.root-dirs
| (none) | The config parameter defining the root directories for storing file-based state for local recovery. Local recovery currently only covers keyed state backends. Currently, MemoryStateBackend does not support local recovery and ignore this option |
| (none) | 配置参数定义了用于存储用于本地恢复的基于文件的状态的根目录。本地恢复当前仅覆盖键控状态后端。当前,MemoryStateBackend不支持本地恢复并忽略此选项|
## Selecting a State Backend
## Selecting a State Backend 选择状态后端
Flink’s [checkpointing mechanism](//ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html) stores consistent snapshots of all the state in timers and stateful operators, including connectors, windows, and any [user-defined state](state.html). Where the checkpoints are stored (e.g., JobManager memory, file system, database) depends on the configured **State Backend**.
flink的[checkpoint机制](//ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html)存储定时器和有状态运算符的所有状态的一致快照,包括连接器、Windows和任何[用户定义的状态](state.html)。其中存储检查点(例如,JobManager内存、文件系统、数据库)取决于所配置的**状态后端**
By default, state is kept in memory in the TaskManagers and checkpoints are stored in memory in the JobManager. For proper persistence of large state, Flink supports various approaches for storing and checkpointing state in other state backends. The choice of state backend can be configured via `StreamExecutionEnvironment.setStateBackend(…)`.
默认情况下,状态保存在TaskManager中的内存中,检查点存储在JobManager中的内存中。为了适当地保持大状态,Flink支持在其他状态后端存储和检查点状态的各种方法。可以通过`StreamExecutionEnvironment.setStateBackend(…)配置状态后端的选择。``
See [state backends](//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/state/state_backends.html) for more details on the available state backends and options for job-wide and cluster-wide configuration.
有关作业范围和集群范围内配置的可用状态后端和选项的详细信息,请参见[State backends](//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/state/state_backends.html)]。
## State Checkpoints in Iterative Jobs
## State Checkpoints in Iterative Jobs 迭代作业中的状态检查点
Flink currently only provides processing guarantees for jobs without iterations. Enabling checkpointing on an iterative job causes an exception. In order to force checkpointing on an iterative program the user needs to set a special flag when enabling checkpointing: `env.enableCheckpointing(interval, force = true)`.
Flink目前只为没有迭代的作业提供处理保证。在迭代作业上启用检查点会导致异常。为了在迭代程序上强制检查点,用户在启用检查点时需要设置一个特殊标志:‘env.enableCheckpoint(Interval,force=true)’。
Please note that records in flight in the loop edges (and the state changes associated with them) will be lost during failure.
请注意,在循环边缘的飞行记录(以及与它们相关的状态更改)将在失败期间丢失。
## Restart Strategies
## Restart Strategies 重启策略
Flink supports different restart strategies which control how the jobs are restarted in case of a failure. For more information, see [Restart Strategies](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/restart_strategies.html).
Flink支持不同的重新启动策略,这些策略控制在发生故障时如何重新启动作业。有关更多信息,请参见[重新启动Strategies](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/restart_strategies.html).
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册