The scheduler inside the job manager keeps track of the state and history of each vertex in the graph. At present if the job manager’s computer fails the job is terminated, though the vertex scheduler could employ checkpointing or replication to avoid this. A vertex may be executed multiple times over the length of the job due to failures, and more than one instance of a given vertex may be executing at any given time. Each execution of the vertex has a version number and a corresponding “execution record” that contains the state of that execution and the versions of the predecessor vertices from which its inputs are derived. Each execution names its file-based output channels uniquely using its version number to avoid conflicts among versions. If the entire job completes successfully then each vertex selects a successful execution and renames its output files to their correct final forms.请理解这些话
时间: 2024-02-14 09:19:50 浏览: 120
这段话是在描述一个作业管理器(job manager)的调度器(scheduler)如何管理和跟踪图中每个顶点(vertex)的状态和历史记录。该调度器可以使用检查点(checkpointing)或复制(replication)等方式避免作业在作业管理器计算机出现故障时被终止。
在作业执行过程中,某一顶点可能由于失败而被执行多次,而且同一顶点的多个实例可能同时在执行。每个顶点的每次执行都有一个版本号,相应的“执行记录”包含该执行的状态以及其输入来自前驱顶点的版本。为了避免版本之间的冲突,每个执行使用其版本号唯一地命名基于文件的输出通道。如果整个作业成功完成,则每个顶点将选择一个成功的执行,并将其输出文件重命名为其正确的最终形式。
总之,这段话描述了一个作业管理器调度器的工作流程和一些相关的概念,如检查点、复制、版本号、执行记录等。通过这样的方式,作业管理器可以在计算机故障等情况下提供容错机制,确保作业能够成功完成。
阅读全文