在场景理解中场景图生成因其语义表达和应用而备受关注。所谓的场景图生成,就是自动从图片或视频中提取对象的语义图结构表达,通常需要正确标注对象并提取对象间的关系。
关注全局感知和信息的有效表达–>场景图,scene graph,节点对应对象,边对应对象之间的关系

常见的SGG方法:CRF, TransE, CNN, RNN/LSTM, GNN,可能需要额外的先验知识。
本文主要对2D SGG的特征表达和优化(融合先验知识,结合消息传递、注意力、VTransE等)及各类方法在2D、时空、3D场景等多类数据集上的表现进行了综述。
A scene graph is a structural representation, which can capture detailed semantics by explicitly modeling objects, attributes of objects, and relations between paired objects.
场景图就是视觉关系三元组集合,一个三元组可以是<subject, relationship, object>或<object, is, attribute*>,场景图就是一个有向图。
(后者其实就是节点与其自身的relationship,就是节点本身的属性,本质上前者已涵盖了后者)
对于静止场景,三元组可以表达为$R_S\subseteq O_S\times P_S \times (O_S\cup A_S)$,$O_S,P_S,A_S$分别对应对象集合、属性集合和关系集合,每个对象都有label和bbox两个属性。
与知识图谱的比较:场景图中蕴含的视觉关系与知识图谱中对象之间的关系本质上是不同的,前者是特定场景下的特定关系(image-specific, situation-specific)。
Visual phrases: integrate linguistic representations of relationship triplets encode the interactions between objects and scenes.
2D图,3D图和时空图的关系。
通常是一个自下而上的过程,给定场景$S$,其视觉图$T_S$的概率可以视作如下分量的组合:
| $p(T_S | S)=p(B_S | S)p(O_S | B_S,S)p(A_S | O_S,B_S,S)p(R_S | O_S,B_S,S)$ | 
主流方法:检测对象+所有对象间的关系分类(两阶段)。也可以基于object region直接同时进行对象检测和关系建立。通常使用联合区域(union area)(称之为relation features,关系特征)作为微词推断的基本表达。主要步骤包括目标检测、特征表达、特征优化和关系预测。本部分主要关注:
主要关注:appearance feature(CNN提取), semantic feature(对象或关系的语义特征), spatial feature(对象的照片坐标), contextual feature(上下文信息)

(s, p, o)之间是存在语义相关性的,可以从以下两方面进行考虑
knowledge beyond the training data: Commonsense knowledge includes information about events that occur in time, about the effects of actions, about physical objects and how they are perceived, and about their properties and relationships with one another. 通常需要考虑以下三项:

Individual predictions of objects and relationships are influenced by their surrounding context. 可以从如下三个角度理解上下文:
因此对象之间、三元组之间的消息传递对于视觉关系探测是有价值的。
先验的布局结构:triplet set、chain(–>RNN/LSTM/GRU)、tree(–>TreeLSTM)、fully conected graph(–>GNN/CRF, Conditional Random Fields)。

通过注意力机制关注输入中最重要的部分:优化局部特征+融合全局特征。在特征表示阶段(多个特征的处理使用attention, self-attention)和特征优化阶段(利用全局信息更新对象和关系的特征表达, context-aware attention)均可以引入注意力机制。

存在的问题:偏度大(存在着低频分布的共现关系),长尾分布问题影响着模型的泛化能力;类内偏差大(同一谓词对应的主谓语可能相当不同):appearance存在巨大差异,其需要被正确表达;数据的稀疏性问题–>视觉嵌入方法: Visual embedding approaches aim at learning a compositional representation for subject, object and predicate by learning separate visual-language embedding spaces, where each of these entities is mapped close to the language embedding of its associated annotation.

动态的,存在时间依赖性,通常集成图片SGG和多目标追踪(multi-obkect tracking)技术。
object detection in videos still suffers from a low accuracy, because of the presence of blur, camera motion and occlusion in videos, which hamper an accurate object localization with bounding box trajectories.
basic pipeline:
The ultimate goal is to be able to accurately obtain the shapes, positions and attributes of the objects in the three dimensional space, so as to realize detection, recognition, tracking and interaction of objects in the real world.
3D信息的不同表达形式:multiple views, point clouds, polygon meshes, wireframe meshes…
略。这里介绍了若干2D数据集、视频数据集和3D数据集。
略。
挑战:
前景: