ECCV2020的文章，中山大学和微软亚研院
把transformer应用于video inpainting，解决逐帧修复造成的时序不一致的问题。(时域不一致（前后帧由于局部像素变化大导致播放时产生异常抖动的效果,视频中出现模糊伪影）)
paper：https://arxiv/abs/2007.10247
github：https://github/researchmm/STTN

总述

SOTA方法采用注意力模块在reference帧中寻找缺失区域进行逐帧的内容填充，但是这种方法可能会在空间和时间维度上受到不一致的注意力结果的困扰，导致视频中出现模糊和时间伪影。这篇论文提出STTN（spatial temporal transformer network），同时填充所有输入帧的缺失区域。
Motivation：使用3D卷积的方法受限于时间窗口的大小(它直接将时序上的帧按通道数堆积在一起，形成一个大型矩阵进行卷积计算，这种做法需要进行一定的改良才能避免高维卷积带来的潜在隐患（譬如计算浪费高、数据依赖严重等等）)。使用attention机制的方法可以避免以上问题，能从相距较远的帧获取有效信息。但这种方法是建立在全局简单变换或者同类型运动的假设上，所以在处理复杂运动时会出现每帧/步之间的不一致匹配，并且补全是逐帧进行的，没有特地为保证时序一致设计优化。虽然后处理过程可以对生成的video进行稳定处理，但对于虚影过多的情况会失效。
本论文把视频补全建模成一个“multi-to-multi” 的问题，把相邻帧和远距离帧作为输入，同时补全所有输入帧的缺失区域。

Network

整体网络由三部分构成，
1、frame-level encoder: 提取每帧的深层特征
2、spatial-temporal transformer: 核心部分，在深层特征空间，联合学习缺失区域的spatial-temporal transformation
3、frame-level decoder: 从特征decode回frame

spatial-temporal transformer

沿空间和时间维度用基于patch的多头注意力模块寻找coherent contents。transformer的不同head计算不同尺度的空间块attention。这样就可以处理由复杂运动引起的外观变化。
For example, on one hand, attentions for patches of large sizes (e.g., frame size H × W) aim at completing stationary backgrounds. On the other hand, attentions for patches of small sizes (e.g., H/10 × W/10 ) encourage capturing deep correspondences in any locations of videos for moving foregrounds.
多头transformer并行地对不同尺寸的块多次执行“Embedding-Matching-Attending”步骤。
1、Embedding：每帧的特征映射为query和memory(key-value对)
2、Matching：分块后，按attention的方法处理就可以了（矩阵乘法计算针对patch的query和key相似度，softmax处理权重）
3、Attending：计算相关patches的value加权和得到输出patch的query。得到所有patch的query后，重新拼接并且reshape回T个frame和原始的空间尺寸。不同头输出的特征进行拼接后经过一个2D残差块。通过叠加层可以增强transformer的能力

优化目标

逐像素重建损失：简单的L1 loss（缺失区域和有效区域分开计算，权重不同）
perceptual和时空一致：训练T-patchGAN作为discriminator

=================================================================================================

论文翻译：

In summary, our main contribution is to learn joint spatial and temporal transformations for video inpainting, by a deep generative model with adversarial training along spatial-temporal dimensions.（通过深度生成模型学习视频修复的联合空间和时间变换，该模型具有沿时空维度的对抗性训练）
Furthermore, the proposed multi-scale patch-based video frame representations can enable fast training and inference.（多尺度 patch 的视频帧表示能够实现快速训练和推理）

3 Spatial-Temporal Transformer Networks

3.1 Overall design

Problem formulation:

Let X 1 T : = { X 1 , X 2 , … , X T } X_{1}^{T}:=\left\{X_{1}, X_{2}, \ldots, X_{T}\right\} X1T:={X1,X2,…,XT} be a corrupted video sequence of height H, width W and frames length T T T.【 X 1 T X_{1}^{T} X1T：是一个受损视频】
M 1 T : = { M 1 , M 2 , … , M T } M_{1}^{T}:=\left\{M_{1}, M_{2}, \ldots, M_{T}\right\} M1T:={M1,M2,…,MT} denotes the corresponding frame-wise masks. 【 M 1 T M_{1}^{T} M1T：是受损图像逐帧的mask】
For each mask M i M_{i} Mi , value “0” indicates known pixels, and value “1” indicates missing regions. We formulate deep video inpainting as a self-supervised task that randomly creates ( X 1 T , M 1 T ) (X_{1}^{T} , M_{1}^{T} ) (X1T,M1T) pairs as input and reconstruct the original video frames Y 1 T = { Y 1 , Y 2 , … , Y T } Y_{1}^{T}=\left\{Y_{1}, Y_{2}, \ldots, Y_{T}\right\} Y1T={Y1,Y2,…,YT}. 【 Y 1 T Y_{1}^{T} Y1T：受损视频对应的原始视频， Y ^ 1 T \hat{Y}_{1}^{T} Y^1T：受损视频对应的经过模型修复好了的视频】
Specifically, we propose to learn a mapping function from masked video ( X 1 T ) (X_{1}^{T} ) (X1T) to the output Y ^ 1 T : = { Y ^ 1 , Y ^ 2 , … , Y ^ T } \hat{Y}_{1}^{T}:=\left\{\hat{Y}_{1}, \hat{Y}_{2}, \ldots, \hat{Y}_{T}\right\} Y^1T:={Y^1,Y^2,…,Y^T}, such that the conditional distribution of the real data p ( Y 1 T ∣ X 1 T ) p(Y_{1}^{T} |X_{1}^{T} ) p(Y1T∣X1T) can be approximated by the one of generated data p ( Y ^ 1 T ∣ X 1 T ) p(\hat{Y}_{1}^{T} |X_{1}^{T} ) p(Y^1T∣X1T).

The intuition is that an occluded region in a current frame would probably be revealed in a region from a distant frame, especially when a mask is large or moving slowly. To fill missing regions in a target frame, it is more effective to borrow useful contents from the whole video by taking both neighboring frames and distant frames as conditions. To simultaneously complete all the input frames in a single feed-forward process, we formulate the video inpainting task as a “multi-to-multi” problem. Based on the Markov assumption [11], we simplify the “multi-to-multi” problem and denote it as:
直觉告诉我们，当前帧中被遮挡的区域可能会在远处帧的区域中显示出来，尤其是当遮罩较大或移动缓慢时。为了填充目标帧中缺失的区域，通过将相邻帧和远处帧都作为条件，从整个视频中借用有用的内容更有效。为了在单个前馈过程中同时完成所有输入帧，我们将视频修复任务表述为“多对多”问题。基于马尔可夫假设[11]，我们简化了“多对多”问题，并将其表示为:

where X t − n t + n X_{t-n}^{t+n} Xt−nt+n denotes a short clip of neighboring frames with a center moment t t t and a temporal radius n n n.【 X t − n t + n X_{t-n}^{t+n} Xt−nt+n ：以时间 t t t 为中心， s s s 为半径的帧片段】
X 1 , s T X_{1,s}^{T} X1,sT denotes distant frames that are uniformly sampled from the videos X 1 T X_{1}^{T} X1T in a sampling rate of s s s. 【 X 1 , s T X_{1,s}^{T} X1,sT：取样率为 s s s 的远处帧】
Since X 1 T , s X_{1}^{T},s X1T,s can usually cover most key frames of the video, it is able to describe “the whole story” of the video. Under this formulation, video inpainting models are required to not only preserve temporal consistency in neighboring frames, but also make the completed frames to be coherent with “the whole story” of the video.（所以此时有了待修复帧的周围帧以及远处帧）

Network design:

Fig. 2. Overview of the STTN.
STTN consists of ：
（1) a frame-level encoder is built by stacking several 2D convolution layers with strides, which aims at encoding deep features from low-level pixels for each frame.帧级编码器，由很多卷积层组成，是为了将每一帧的特征编码
（2) multi-layer multi-head spatial-temporal transformers , which aims at learning joint spatial-temporal transformations for all missing regions in the deep encoding space 多层多头的时空transformer，是为了学习丢失区域的联合时空转换
（3）a frame-level decoder is designed to decode features back to frames 帧级解码器，将features转换回帧
The transformers are designed to simultaneously fill holes in all input frames with coherent contents. Specifically, a transformer matches the queries (Q) and keys (K) on spatial patches across different scales in multiple heads, thus the values (V) of relevant regions can be detected and transformed for the holes. Moreover, the transformers can be fully exploited by stacking multiple layers to improve attention results based on updated region features.

3.2 Spatial-temporal transformer

To fill missing regions in each frame, spatial-temporal transformers are designed to search coherent contents from all the input frames. Specifically, we propose to search by a multi-head patch-based attention module along both spatial and temporal dimensions. Different heads of a transformer calculate attentions on spatial patches across different scales. （transformer的不同的头通过不同scales在空间patchs计算注意力）
Such a design allows us to handle appearance changes caused by complex motions. For example,
①、on one hand, attentions for patches of large sizes (e.g., frame size H × W H × W H×W) aim at completing stationary backgrounds. （大尺寸帧的patches主要是为了完成复杂的静态背景）
②、On the other hand, attentions for patches of small sizes (e.g., H 10 × W 10 \frac{H}{10} × \frac{W}{10} 10H×10W ) encourage capturing deep correspondences in any locations of videos for moving foregrounds.（小尺寸帧的patches主要是为了捕获整个视频中移动前景的深层关系）

A multi-head transformer runs multiple “Embedding-Matching-Attending” steps for different patch sizes in parallel.
①、 In the Embedding step, features of each frame are mapped into query and memory (i.e., key-value pair) for further retrieval. （在Embedding阶段，每一帧的feature被嵌入到query和memory为了进一步的检索）
②、In the Matching step, region affinities are calculated by matching queries and keys among spatial patches that are extracted from all the frames. （在Matching 步骤中，区域近似性通过从所有的帧的空间patch之间的K、Q来计算）
③、Finally, relevant regions are detected and transformed for missing regions in each frame in the Attending step.（在Attending 步骤中，检测相关区域并针对每帧中的缺失区域进行变换）

①、Embedding:

We use f 1 T = { f 1 , f 2 , … , f T } f_{1}^{T}=\left\{f_{1}, f_{2}, \ldots, f_{T}\right\} f1T={f1,f2,…,fT}, where f i ∈ R h × w × c f_{i} \in R^{h \times w \times c} fi∈Rh×w×c to denote the features encoded from the frame-level encoder or former transformers, which is the input of transformers in Fig. 2（ f 1 T f_{1}^{T} f1T是transformer的输入）. Similar to many sequence modeling models, mapping features into key and memory embeddings is an important step in transformers [9,28]. Such a step enables modeling deep correspondences for each region in different semantic spaces:

where 1 ≤ i ≤ T , M q ( ⋅ ) , M k ( ⋅ ) 1 \leq i \leq T, M_{q}(\cdot), M_{k}(\cdot) 1≤i≤T,Mq(⋅),Mk(⋅) and M v ( ⋅ ) M_{v}(\cdot) Mv(⋅) denote the 1 × 1 2D convolutions that embed input features into query and memory (i.e., key-value pair) feature spaces while maintaining the spatial size of features.

②、Matching :

We conduct patch-based matching in each head. In practice, we first extract spatial patches of shape r 1 × r 2 × c r_{1} × r_{2} × c r1×r2×c from the query feature of each frame,（我们首先从每个帧的query特征中提取空间patch） and we obtain N = T × h / r 1 × w / r 2 N = T × h/r_{1} × w/r_{2} N=T×h/r1×w/r2 patches. Similar operations are conducted to extract patches in the memory (i.e., key-value pair in the transformer).
Such an effective multi-scale patch-based video frame representation can avoid redundant patch matching and enable fast training and inference.
Specifically, we reshape the query patches and key patches into 1-dimension vectors separately, so that patch-wise similarities can be calculated by matrix multiplication. （具体来说，我们将query和key的patch reshape成1维的向量，所以逐像素的相似性可以被矩阵计算）The similarity between i-th patch and j-th patch is denoted as:

where 1 ≤ i , j ≤ N 1 ≤ i, j ≤ N 1≤i,j≤N, p i q p^{q}_{i} piq denotes the i-th query patch, （ p i q p^{q}_{i} piq 是第i个 query patch）
p j k p^{k}_{j} pjk denotes the j-th key patch. （ p j k p^{k}_{j} pjk 是第j个 key patch）
The similarity value is normalized by the dimension of each vector to avoid a small gradient caused by subsequent softmax function [28]. Corresponding attention weights for all patches are calculated by a softmax function:

where Ω Ω Ω denotes visible regions outside masks, （其中 Ω Ω Ω表示mask外的可见区域）
and Ω ˉ \bar{\Omega } Ωˉ denotes missing regions. （ Ω ˉ \bar{\Omega} Ωˉ表示缺失区域）
Naturally, we only borrow features from visible regions for filling holes.（自然地，我们只是借用可见区域的特征来填补空洞。）

③、Attending :

After modeling the deep correspondences for all spatial patches, the output for the query of each patch can be obtained by weighted summation of values from relevant patches:（对所有空间的patch建模深度对应关系后，通过对相关patch的value进行加权求和得到每个patch query的输出）

where p j v p^{v}_{j} pjv denotes the j-th value patch. （ p j v p^{v}_{j} pjv是第j个 value patch. ）
After receiving the output for all patches, we piece all patches together and reshape them into T T T frames with original spatial size h × w × c h × w × c h×w×c. （得到所有的patch后，我们拼接所有的patch，并将他们reshape成 h × w × c h × w × c h×w×c 大小的 T T T 个帧）
The resultant features from different heads are concatenated and further passed through a subsequent 2D residual block [12].（再通过一个残差块）
This subsequent processing is used to enhance the attention results by looking at the context within the frame itself.（这种随后的处理是通过观察帧内容本身来增强注意力结果）

Fig. 3. Illustration of the attention maps for missing regions learned by STTN.
For completing the dog corrupted by a random mask in a target frame (e.g., t=10), our model is able to “track” the moving dog over the video in both spatial and temporal dimensions. Attention regions are highlighted in bright yellow.

The power of the proposed transformer can be fully exploited by stacking multiple layers, so that attention results for missing regions can be improved based on updated region features in a single feed-forward process. Such a multi-layer design promotes learning coherent spatial-temporal transformations for filling in missing regions. As shown in Fig. 3, we highlight the attention maps learned by STTN in the last layer in bright yellow. For the dog partially occluded by a random mask in a target frame, spatial-temporal transformers are able to “track” the moving dog over the video in both spatial and temporal dimensions and fill missing regions in the dog with coherent contents.
（multi-layer 的设计非常好）

3.3 Optimization objectives

As outlined in Section 3.1, we optimize the proposed STTN in an end-to-end manner by taking the original video frames as ground truths without any other labels. The principle of choosing optimization objectives is to ensure per-pixel reconstruction accuracy, perceptual rationality and spatial-temporal coherence in generated videos . To this end, we select a pixel-wise reconstruction loss and a spatial-temporal adversarial loss as our optimization objectives.

①、In particular, we include L1 hole losses： calculated between generated frames and original frames for ensuring per-pixel reconstruction accuracy in results. The L1 losses for hole regions are denoted as:

②、and corresponding L1 losses for valid regions are denoted as:

where ⊙ \odot ⊙ indicates element-wise multiplication, and the values are normalized by the size of corresponding regions.

Inspired by the recent studies that adversarial training can help to ensure high-quality content generation results, we propose to use a Temporal PatchGAN (T-PatchGAN) as our discriminator . Such an adversarial loss has shown promising results in enhancing both perceptual quality and spatial-temporal coherence in video inpainting [5,6]. In particular, the T-PatchGAN is composed of six layers of 3D convolution layers. The T-PatchGAN learns to distinguish each spatial-temporal feature as real or fake, so that spatial-temporal coherence and local-global perceptual details of real data can be modeled by STTN. The detailed optimization function for the TPatchGAN discriminator is shown as follows:

③、and the adversarial loss for STTN is denoted as:

The overall optimization objectives are concluded as below:

We empirically set the weights for different losses as: λ h o l e λ_{hole} λhole = 1, L v a l i d L_{valid} Lvalid = 1, L a d v L_{adv} Ladv = 0.01. Since our model simultaneously complete all the input frames in a single feed-forward process, our model runs at 24.3 fps on a single GPU NVIDIA V100. More details are provided in the Section D of our supplementary material.

更多推荐

（STTN）Learning Joint Spatial-TemporalTransformations for Video Inpainting