Block-Recurrent Transformer
Block-Recurrent Transformer Today will be a bit more technical. Some of you may have noticed that in March, a research paper titled Block-Recurrent Transformer was published by Google and AI LAB IDSA. In short, I will explain why I think it is worth knowing about.

Block-Recurrent Transformer
Today will be a bit more technical. Some of you may have noticed that in March, a research paper titled Block-Recurrent Transformer was published by Google and AI LAB IDSA. In short, I will explain why I think it is worth knowing about.
Previously, the AI community believed that transformers were the architecture for nearly omnipotent models for all deep learning tasks. Over time, however, it became apparent that transformers also have their weaknesses, leading Google to develop a hybrid model that combines the advantages of the good old LSTM with the new transformers. Thus, the Transformer-LSTM was born – at that time, the state-of-the-art for time series prediction. This sparked a series of research efforts that combine the power of transformers with the tried-and-true models like CNN (Vision Transformers), RNN (RWKV-v2-RNN), and others.
The main advantages of transformers:
- Parallelism – unlike traditional RNNs and LSTMs, which are sequential, transformers require fewer steps and can utilise GPU hardware acceleration much more efficiently.
- Long-term memory – traditional RNNs suffered from the "vanishing gradient" problem, and even improved LSTMs still faced "exploding gradients"; in contrast, transformers can attend to every single input word.
- Better attention mechanism – while the concept of Attention was introduced before transformers in Bi-LSTM, Self-Attention, which allows each input word to refer to every other word, was a significant improvement, enabling much better retention of context over long distances.
The main disadvantage of transformers:
- High attention costs O(n²) – transformers can process approximately 512 to 4096 tokens. However, the attention costs grow quadratically with the length of the sentence, which significantly complicates scalability for longer texts. Fortunately, newer transformers like Longformer or Transformer XL mitigate full attention costs through various "sliding window" techniques.
So, what new features does the Block-Recurrent Transformer bring? Primarily, it introduces the "Recurrent Cell". In short, it leverages parallelism at the block level, manages large attention sizes (4096), and, thanks to the sliding attention concept, has only linear complexity O(n). According to the paper, this architecture appears to significantly outperform existing models like Transformer XL in terms of complexity and speed.
Sources: Inspiration: https://towardsdatascience.com/block-recurrent-transformer-lstm-and-transformer-combined-ec3e64af9 Attention Is All You Need: https://arxiv.org/abs/1706.03762 Paper: https://arxiv.org/pdf/2203.07852.pdf
Originally published on Facebook — link to post
Původní zdroj: facebook