继续理解以下的数据库中的操作When a vertex execution fails for any reason the job manager is informed. If the vertex reported an error cleanly the process forwards it via the daemon before exiting; if the process crashes the daemon notifies the job manager; and if the daemon fails for any reason the job manager receives a heartbeat timeout. If the failure was due to a read error on an input channel (which is be reported cleanly) the default policy also marks the execution record that generated that version of the channel as failed and terminates its process if it is running. This will cause the vertex that created the failed input channel to be re-executed, and will lead in the end to the offending channel being re-created. Though a newly-failed execution record may have non-failed successor records, errors need not be propagated forwards: since vertices are deterministic two successors may safely compute using the outputs of different execution versions. Note however that under this policy an entire connected component of vertices connected by pipes or shared-memory FIFOs will fail as a unit since killing a running vertex will cause it to close its pipes, propagating errors in both directions along those edges. Any vertex whose execution record is set to failed is immediately considered for re-execution.
时间: 2024-02-14 18:19:45 浏览: 28
这段文字描述了在一个数据库中,顶点执行失败时的处理方式。如果顶点执行失败,作业管理器会被通知。如果顶点干净地报告了错误,进程会在退出前通过守护程序将错误信息转发给作业管理器。如果进程崩溃,守护程序会通知作业管理器。如果守护程序由于任何原因失败,作业管理器会接收到心跳超时。如果失败是由于输入通道上的读取错误(这将被干净地报告),则默认策略还会将生成该通道版本的执行记录标记为失败,并在其正在运行时终止其进程。这将导致创建失败输入通道的顶点被重新执行,并最终导致有问题的通道被重新创建。
尽管新失败的执行记录可能有未失败的后继记录,但错误不必向前传播:由于顶点是确定性的,两个后继者可以安全地使用不同执行版本的输出进行计算。但是,需要注意的是,在此策略下,由管道或共享内存FIFO连接的整个连接组件将作为一个单元失败,因为杀死运行的顶点将导致它关闭其管道,在这些边缘上向两个方向传播错误。任何执行记录被设置为失败的顶点都会立即被视为需要重新执行。
总之,这段文字描述了数据库中对顶点执行失败时的处理方式,包括错误通知、错误转发、重新执行等策略。同时也提到了顶点的确定性,以及管道或共享内存FIFO连接的组件将作为一个单元失败的情况。
相关问题
Failures are to be expected during the execution of any distributed application. Our default failure policy is suitable for the common case that all vertex programs are deterministic.1 Because our communication graph is acyclic, it is relatively straightforward to ensure that every terminating execution of a job with immutable inputs will compute the same result, regardless of the sequence of computer or disk failures over the course of the execution.在数据库中理解进行理解
该段描述了在分布式应用程序执行过程中出现故障的情况,并且指出了适用于默认情况的故障策略,即假设所有顶点程序都是确定性的。由于该应用程序的通信图是无环的,因此相对容易确保对于具有不变输入的作业的每个终止执行,无论执行期间发生了计算机或磁盘故障的序列如何,都将计算出相同的结果。
对于数据库而言,也存在着故障容错的问题。在数据库中,一些常见的故障类型包括硬件故障、网络故障、软件故障等。为了确保数据的完整性和可靠性,数据库通常会采用一些故障容错策略,例如备份和恢复、冗余存储、故障转移等。在应用这些策略时,需要考虑到数据的一致性和可用性,以及故障发生的概率和影响等因素。因此,在设计和实现数据库时,需要综合考虑各种因素,为不同的应用场景提供合适的故障容错策略。
The scheduler inside the job manager keeps track of the state and history of each vertex in the graph. At present if the job manager’s computer fails the job is terminated, though the vertex scheduler could employ checkpointing or replication to avoid this. A vertex may be executed multiple times over the length of the job due to failures, and more than one instance of a given vertex may be executing at any given time. Each execution of the vertex has a version number and a corresponding “execution record” that contains the state of that execution and the versions of the predecessor vertices from which its inputs are derived. Each execution names its file-based output channels uniquely using its version number to avoid conflicts among versions. If the entire job completes successfully then each vertex selects a successful execution and renames its output files to their correct final forms.请理解这些话
这段话是在描述一个作业管理器(job manager)的调度器(scheduler)如何管理和跟踪图中每个顶点(vertex)的状态和历史记录。该调度器可以使用检查点(checkpointing)或复制(replication)等方式避免作业在作业管理器计算机出现故障时被终止。
在作业执行过程中,某一顶点可能由于失败而被执行多次,而且同一顶点的多个实例可能同时在执行。每个顶点的每次执行都有一个版本号,相应的“执行记录”包含该执行的状态以及其输入来自前驱顶点的版本。为了避免版本之间的冲突,每个执行使用其版本号唯一地命名基于文件的输出通道。如果整个作业成功完成,则每个顶点将选择一个成功的执行,并将其输出文件重命名为其正确的最终形式。
总之,这段话描述了一个作业管理器调度器的工作流程和一些相关的概念,如检查点、复制、版本号、执行记录等。通过这样的方式,作业管理器可以在计算机故障等情况下提供容错机制,确保作业能够成功完成。