Viet-Anh on Software Logo

What is: Non-Local Operation?

SourceNon-local Neural Networks
Data SourceCC BY-SA -

A Non-Local Operation is a component for capturing long-range dependencies with deep neural networks. It is a generalization of the classical non-local mean operation in computer vision. Intuitively a non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, or spacetime, implying that these operations are applicable for image, sequence, and video problems.

Following the non-local mean operation, a generic non-local operation for deep neural networks is defined as:

y_i=1C(x)_jf(x_i,x_j)g(x_j)\mathbb{y}\_{i} = \frac{1}{\mathcal{C}\left(\mathbb{x}\right)}\sum\_{\forall{j}}f\left(\mathbb{x}\_{i}, \mathbb{x}\_{j}\right)g\left(\mathbb{x}\_{j}\right)

Here ii is the index of an output position (in space, time, or spacetime) whose response is to be computed and jj is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and yy is the output signal of the same size as xx. A pairwise function ff computes a scalar (representing relationship such as affinity) between ii and all jj. The unary function gg computes a representation of the input signal at the position jj. The response is normalized by a factor C(x)C\left(x\right).

The non-local behavior is due to the fact that all positions (j\forall{j}) are considered in the operation. As a comparison, a convolutional operation sums up the weighted input in a local neighborhood (e.g., i1ji+1i − 1 \leq j \leq i + 1 in a 1D case with kernel size 3), and a recurrent operation at time ii is often based only on the current and the latest time steps (e.g., j=ij = i or i1i − 1).

The non-local operation is also different from a fully-connected (fc) layer. The equation above computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between x_jx\_{j} and x_ix\_{i} is not a function of the input data in fc, unlike in nonlocal layers. Furthermore, the formulation in the equation above supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input/output and loses positional correspondence (e.g., that from x_ix\_{i} to y_iy\_{i} at the position ii).

A non-local operation is a flexible building block and can be easily used together with convolutional/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information.

In terms of parameterisation, we usually parameterise gg as a linear embedding of the form g(x_j)=W_gx_jg\left(x\_{j}\right) = W\_{g}\mathbb{x}\_{j} , where W_gW\_{g} is a weight matrix to be learned. This is implemented as, e.g., 1×1 convolution in space or 1×1×1 convolution in spacetime. For ff we use an affinity function, a list of which can be found here.