July 18, 2022·Rosie·2 min read·Archive 2022

Block-Recurrent Transformer

Today will be a bit more technical. Some of you may have noticed that in March, a research paper was released by Google and AI LAB IDSA about...

Today will be a bit more technical. Some of you may have noticed that in March, a research paper was released by Google and AI LAB IDSA titled Block-Recurrent Transformer. In brief, I will explain why I think it is worth knowing about.

Previously, the AI community believed that transformers were the architecture for almost omnipotent models for all deep learning tasks. However, over time it became apparent that transformers also have their weaknesses, leading Google to develop a hybrid model that combines the advantages of the good old LSTM with the new transformers. Thus, Transformer-LSTM was born – at that time the SOTA for time series prediction. This sparked a series of research efforts combining the power of transformers with tried-and-true models like CNN (Vision Transformers), RNN (RWKV-v2-RNN), and others.

The main advantages of transformers:

– Parallelism – unlike traditional RNNs and LSTMs, which are sequential, transformers require fewer steps and can utilise GPU hardware acceleration much more efficiently.

– Long-term memory – traditional RNNs suffered from the "vanishing gradient" problem, and even improved LSTMs still faced "exploding gradients"; in contrast, transformers can attend to every single input word.

– Better attention mechanism – while the concept of Attention was introduced before transformers in Bi-LSTM, Self-Attention, which allows each input word to refer to every other word, was a significant improvement, enabling much better context retention over long distances.

The main disadvantage of transformers:

– High attention costs O(n²) – transformers can process approximately 512 to 4096 tokens. However, the attention costs grow quadratically with the length of the sentence, which complicates scalability for longer texts. Fortunately, newer transformers like Longformer or Transformer XL mitigate full attention costs using various "sliding window" techniques.

So, what new features does the Block-Recurrent Transformer bring?

Primarily, it introduces the "Recurrent Cell". In short, it leverages parallelism at the block level, handles large attention sizes (4096), and due to the concept of sliding attention, it only has linear complexity O(n). According to the paper, this architecture significantly outperforms existing models like Transformer XL in terms of complexity and speed.

Sources:

Inspiration: https://towardsdatascience.com/block-recurrent-transformer-lstm-and-transformer-combined-ec3e64af9

Attention Is All You Need: https://arxiv.org/abs/1706.03762?fbclid=IwAR38YHs-4oEVcr8C–5QLtY9HqOqa1CjHxHx94h1GYPTYwR96h0U9GMiTBk

Paper: http://chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://arxiv.org/pdf/2203.07852.pdf

Original source: wordpress

Související články

June 2022

A Few Interesting Facts from the Current World of AI

Read

March 2023

DigiKomenský – The Greatest Teacher of All Time in the Form of Jan Amos Komenský – For Teaching Children and Self-Learners

Read

July 2022

Google's Minerva Will Solve a Third of Practice Problems in University Mathematics, Physics, Chemistry, Economics, and Biology!

Read