Viet-Anh on Software Logo

What is: TSDAE?

SourceTSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning
Data SourceCC BY-SA -

TSDAE is an unsupervised sentence embedding method. During training, TSDAE encodes corrupted sentences into fixed-sized vectors and requires the decoder to reconstruct the original sentences from this sentence embedding. For good reconstruction quality, the semantics must be captured well in the sentence embedding from the encoder. Later, at inference, we only use the encoder for creating sentence embeddings.

The model architecture of TSDAE is a modified encoder-decoder Transformer where the key and value of the cross-attention are both confined to the sentence embedding only. Formally, the formulation of the modified cross-attention is:

H(k)= Attention (H(k1),[sT],[sT])H^{(k)}=\text { Attention }\left(H^{(k-1)},\left[s^{T}\right],\left[s^{T}\right]\right)
Attention(Q,K,V)=softmax(QKTd)V\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d}}\right) V

where H(k)Rt×dH^{(k)} \in \mathbb{R}^{t \times d} is the decoder hidden states within tt decoding steps at the kk-th layer, dd is the size of the sentence embedding, [sT]R1×d\left[s^{T}\right] \in \mathbb{R}^{1 \times d} is a one-row matrix including the sentence embedding vector and Q,KQ, K and VV are the query, key and value, respectively. By exploring different configurations on the STS benchmark dataset, the authors discover that the best combination is: (1) adopting deletion as the input noise and setting the deletion ratio to 0.6,(2)0.6,(2) using the output of the [CLS] token as fixed-sized sentence representation (3) tying the encoder and decoder parameters during training.