初步研究CAPI的加速原理,理解cache一致性,对比CAPI和一般PCIE加速设备的优势和劣势。部分总结CAPI 1.0的使用,并简单列举CAPI现状,网站以及2.0的对比。简单介绍现今三个新的开放的高速一致性接口(CCIX,Gen-Z,OpenCAPI)

CAPI的原理含CAPI2.0的总线接口,流程以及仿真步骤(可以指出历史和自己的弯路)

为了满足加速accelerators,业界正在为CPU高性能一致性接口(high performance coherence

interface)定义开放的标准,2016年出现了openCAPI/Gen-Z/CCIX三种open标准,本文也会略微提及

说是初步研究,是因为缺少CAPI的软件分析,比如具体如何减少了I/O overhead,相对于IO加速的优势没有性能对比。尤其是cache coherent带来的优势没有自己的具体指标,虽然引用了Power自己的数据。

CAPI全称coherent

acceleration processor interface(一致性加速处理器接口),作为Power处理器架构的一个重要加速功能,提供用户一个可订制、高效易用、分担CPU负荷的硬件加速的解决方案,其实现载体是。Power8的时候,CAPI的PSL(加密的IP核)是在ALTERA的FPGA上实现,自从ALTERA为intel收购后,改为Xilinx上的IP核,PSL的资源占用情况需要自行查询,本人手上有的资料是CAPI1.0在Altera的资源使用情况。由于CAPI2.0和1.0基础原理一致,加之自己主要接触到1.0,所以本文CAPI如无特殊说明,均是。

限于精力和资源,也没有深入研究OpenCAPI

限于能力和时间,文中定有不少错误,欢迎指出,期待讨论。由于绝大部分是原创,即使拷贝也指明了出处,所以转载请表明出处

http://wwwblogs/e-shannon/p/7495618.html

1)

2)

3)

Ahead.pdf>

4)网址来源

www-304.ibm/webapp/set2/sas/f/capi/CAPI_POWER8.pdf

5)

https://www.ibm/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b779-61ff0266d32a/page/1cb956e8-4160-4bea-a956-e51490c2b920/attachment/56cea2a9-a574-4fbb-8b2c-675432367250/media/POWER9-VUG.pdf

CAPI :

Coherent Accelerator Processor Interface

POWER: Performance Optimization With Enhanced

RISC

HDK:   Hardware development kit

SDK:   Software development

kit

CCIX:   Cache Coherent Internconnect for

Accelerators.   wwwixconsortium

OpenCAPI: Open Coherent Accelerator Processor Interfae  opencapi

Gen-Z:

genzconsortium

LRU:   least

recent used

HPC:

High Performace Computing

DMI:   Durable Memory interface(OpenPOWER and the Roadmap Ahead.pdf)

QPI:   The Intel QuickPath Interconnect (QPI) is a

point-to-point processor interconnect developed by Intel which replaced the

front-side bus (FSB) in Xeon, Itanium, and certAIn desktop platforms starting

in 2008.(wiki),与AMD的HyperTransport(HT)竞争

SMP:  Symmetric Multi-Processor,一种UMA结构,多核CPU共享所有资源,SMP在POWER架构中采用

NUMA: Non-Uniform. Memory Access与SMP结构对比,多CPU分成几组,本地的内存访问速度快于远端的内存访问,所以是Non-Uniform. The trend in hardware has been towards more than one

system bus, each serving a small set of processors. Each group of processors

has its own memory and possibly its own I/O channels. However, each CPU can

access memory associated with the other groups in a coherent way. Each group is

called a NUMA node. The number of CPUs within a NUMA node depends on the

hardware vendor. It is faster to access local memory than the memory associated

with other NUMA nodes. This is the reason for the name, non-uniform. memory

access architecture.

https://technet.microsoft/en-us/library/ms178144(v=sql.105).aspx

MPP:  Massive Parallel Processing多组SMP CPU组,组和组之间内存不能访问,通过网络节点互联,可以无限扩展

NUMA与MPP的区别

http://wwwblogs/yubo/archive/2010/04/23/1718810.html

从架构来看,NUMA与MPP具有许多相似之处:它们都由多个节点组成,每个节点都具有自己的CPU、内存、I/O,节点之间都可以通过节点互联机制进行信息交互。那么它们的区别在哪里?通过分析下面NUMA和MPP服务器的内部架构和工作原理不难发现其差异所在。

首先是节点互联机制不同,NUMA的节点互联机制是在同一个物理服务器内部实现的,当某个CPU需要进行远地内存访问时,它必须等待,这也是NUMA服务器无法实现CPU增加时性能线性扩展的主要原因。而MPP的节点互联机制是在不同的SMP服务器外部通过I/O实现的,每个节点只访问本地内存和存储,节点之间的信息交互与节点本身的处理是并行进行的。因此MPP在增加节点时性能基本上可以实现线性扩展。

其次是内存访问机制不同。在NUMA服务器内部,任何一个CPU可以访问整个系统的内存,但远地访问的性能远远低于本地内存访问,因此在开发应用程序时应该尽量避免远地内存访问。在MPP服务器中,每个节点只访问本地内存,不存在远地内存访问的问题。

NUCA:非对称缓存架构,Power9 L3

ISA:   instruction set architechture

CAIA :

Coherent Accelerator Interface Architecture defines a coherent accelerator

interface structure for coherently attaching accelerators to the POWER systems

using a standard PCIe bus. The intent is to allow implementation of a wide

range of accelerator in order to optimally address many different market segments.

CAPP :

Coherent Accelerator Processor Proxy

Design unit that snoops the

PowerBus commands and provides coherency responses reflecting the state of the

caches in PSL.  Issues commands to PSL so

that it can provide data responses.

PSL :  Power Service Layer

The PSL provides the address

translation and system memory cache for the AFUs. In addition, the PSL provides

miscellaneous facilities for the host processor to manage the virtualization of

the AFUs, interrupts, and memory management.

AFU :  Accelerator Function Unit

Effective

Address(EA)/Real Address(RA)….power ISA book III

AFU使用有效地址即CPU的地址空间(业界也称为虚拟地址),PSL则将有效地址翻译为实际地址(业界也称为物理地址)The AFU uses

Effective Addressing, which is the process’s address space (industry calls this

“virtual”).  The PSL translates the

Effective Address into a Real Address (industry calls this “physical”) for accessing memory within the

PowerPC system.

MMIO: Memory-mapped

input/output.

WED: work element discriptor工作单元描述符。当应用程序申请使用AFU时,一个处理单元被加入到处理单元链上,这个处理单元链描述了整个应用的处理状态。处理单元同时含有一个WED,工作单元描述符,这个WED可以是描述job也可以是一个指针,指向更丰富的描述,来告知AFU的工作内容。When an application requests

use of an AFU, a process element is added to the process-element linked list that

describes the application’s process state. The process element also contains a

work element descriptor (WED) provided by the application. The WED can contain

the full description of the job to be performed or a pointer to other main

memory structures in the application’s memory space. Several programming models

are described providing for an AFU to be used by any application or for an AFU

to be dedicated to a single application.

更多推荐

php capi,CAPI 初探 及使用小结(1)