...
首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference
【24h】

Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference

机译:通过概率进度依赖推理诊断大型MPI应用程序中的性能故障

获取原文
获取原文并翻译 | 示例
           

摘要

Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel applications. This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated. Our technique combines scalable runtime analysis with static analysis to determine the least-progressed task(s) and to identify the code lines at which the failure arose. We present a novel algorithm that infers probabilistically progress dependence among MPI tasks using a globally constructed Markov model that represents tasks’ control-flow behavior. In comparison to previous work, our algorithm infers more precisely the least-progressed task. We combine this technique with static backward slicing analysis, further isolating the code responsible for the current state. A blind study demonstrates that our technique isolates the root cause of a concurrency bug in a molecular dynamics simulation, which only manifests itself at 7,996 tasks or more. We extensively evaluate fault coverage of our technique via fault injections in 10 HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks.
机译:调试大型并行应用程序具有挑战性。大多数现有技术几乎没有提供有关故障根本原因的信息。此外,大多数调试器会大大减慢程序的执行速度,并在大型并行应用程序中运行缓慢。本文提出了一种新颖的技术,该技术可伸缩地推断出发生故障的并行程序中的任务及其产生的代码。我们的技术将可伸缩的运行时分析与静态分析相结合,以确定进度最慢的任务,并确定发生故障的代码行。我们提出了一种新颖的算法,该算法使用代表任务的控制流行为的全局构造的马尔可夫模型来推断MPI任务之间的进度依赖。与以前的工作相比,我们的算法可以更精确地推断出进度最慢的任务。我们将此技术与静态向后切片分析相结合,进一步隔离了负责当前状态的代码。一项盲目的研究表明,我们的技术在分子动力学仿真中隔离了并发错误的根本原因,而这种错误仅在7996个任务或更多任务中表现出来。我们通过10个HPC基准测试中的故障注入来广泛评估我们的技术的故障覆盖率,并表明我们的分析在数千个并行任务上花费的时间不到几秒钟。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号