The MADGRAD method contains a series of modifications to the [AdaGrad](https://paperswithcode.com/method/adagrad)-DA method to improve its performance on deep learning optimization problems. It gives state-of-the-art generalization performance across a diverse set of problems, including those that [Adam](https://paperswithcode.com/method/adam) normally under-performs on.

**HRNet**, or **High-Resolution Net**, is a general purpose convolutional neural network for tasks like semantic segmentation, object detection and image classification. It is able to maintain high resolution representations through the whole process. We start from a high-resolution [convolution](https://paperswithcode.com/method/convolution) stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several ($4$ in the paper) stages and
the $n$th stage contains $n$ streams corresponding to $n$ resolutions. The authors conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.

HRNet

Deep High-Resolution Representation Learning for Visual Recognition

MADGRAD

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

**Longformer** is a modified [Transformer](https://paperswithcode.com/method/transformer) architecture. Traditional [Transformer-based models](https://paperswithcode.com/methods/category/transformers) are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this, **Longformer** uses an attention pattern that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. The attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.

The attention patterns utilised include: [sliding window attention](https://paperswithcode.com/method/sliding-window-attention), [dilated sliding window attention](https://paperswithcode.com/method/dilated-sliding-window-attention) and global + sliding window. These can be viewed in the components section of this page.

Source	Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Momentumized, adaptive, dual averaged gradient?

Viet-Anh on Software