HP / Tandem NonStop如何在没有备件的情况下实现单故障FT?(How does HP/Tandem NonStop achieve single failure FT without spares?)

就我可以从维基百科和令人兴奋的HPE网站收集的信息来看,NonStop系统体系结构声名远扬,它可以实现单一故障FT而不必分配过多的备用容量(即,在一步一步的架构中,您通常会需要过度提供3倍)。

这似乎是一个理想的属性,但我无法找到他们使用的方法和警告的更多细节。 即他们对网络所做的假设是什么,他们容忍的失败类型,假设的客户行为,可以接受的恢复时间,他们运行的工作流等等。

任何人都可以简要描述NonStop系统如何解决故障检测和故障修正的典型问题? 它是系统级别的通用魔法解决方案吗?还是需要编写应用程序以使用某些交易设施和检查点数据和通信?

非常感谢!

As far as I could gather from Wikipedia and the mindboggling HPE website, the claim to fame of the NonStop system architecture is that it can achieve a single-failure FT without having to allocate excessive amounts of spare capacity (i.e. in lockstepped architecture you would typically need to overprovision by 3x).

This seems a desirable property, yet I couldn't find more details about the approach they use and the caveats. I.e. what are the assumptions they make about the network, the kind of failures they tolerate, assumed client behavior, the acceptable time to recover, the workflows they run, etc.

Could anybody describe in brief how does the NonStop system solve the typical problems with failure detection and failure correction? Is it a generic magical solution on system level, or does it require that the applications are written to use certain transaction facilities and checkpoint data and communications?

Thanks a lot!

最满意答案

惠普的这篇论文概念性地涵盖了您的问题:

http://www.hpl.hp.com/techreports/tandem/TR-86.2.pdf

This paper from HP conceptually cover your questions:

http://www.hpl.hp.com/techreports/tandem/TR-86.2.pdf

更多推荐