Transformer (deep learning architecture)

In <a href="/facts/Deep_learning/JLuwD3ea">deep learning</a>, transformer is an architecture based on the multi-head <a href="/facts/Attention_(machine_learning)/AQ566C3b">attention</a> mechanism, in which text is converted to numerical representations called <a href="/facts/Large_language_model/WnogWVJY">tokens</a>, and each token is converted into a vector via lookup from a <a href="/facts/Word_embedding/7uRcBPqo">word embedding</a> table. At each layer, each <a href="/facts/Tokenization_(lexical_analysis)/T1JYWpIf">token</a> is then <a href="/facts/Contextualization_(computer_science)/1aV3m1FE">contextualized</a> within the scope of the <a href="/facts/Context_window/WnogWVJY">context window</a> with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished. 
Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier <a href="/facts/Recurrent_neural_network/bx7hBVB1">recurrent neural architectures</a> (RNNs) such as <a href="/facts/Long_short-term_memory/NBBVEoft">long short-term memory</a> (LSTM). Later variations have been widely adopted for training <a href="/facts/Large_language_model/WnogWVJY">large language models</a> (LLMs) on large (language) <a href="/facts/Training%2c_validation%2c_and_test_data_sets/V3mMQb2G">datasets</a>.

The modern version of the transformer was proposed in the 2017 paper "<a href="/facts/Attention_Is_All_You_Need/W6jRVM7m">Attention Is All You Need</a>" by researchers at <a href="/facts/Google/GT9Sugza">Google</a>. Transformers were first developed as an improvement over previous architectures for <a href="/facts/Machine_translation/DGF3NwuI">machine translation</a>, but have found many applications since. They are used in large-scale <a href="/facts/Natural_language_processing/1hjMKsSN">natural language processing</a>, <a href="/facts/Computer_vision/Tl2Yyk66">computer vision</a> (<a href="/facts/Vision_transformer/J7VbF9ae">vision transformers</a>), <a href="/facts/Reinforcement_learning/NrgPPS0Q">reinforcement learning</a>, <a href="/facts/Audio_signal_processing/HXvBpspA">audio</a>, <a href="/facts/Multimodal_learning/mvzR8clX">multimodal learning</a>, <a href="/facts/Robotics/wkmvodJz">robotics</a>, and even playing <a href="/facts/Computer_chess/2wE8AU5d">chess</a>. It has also led to the development of <a href="/facts/Transfer_learning/pNz4P2KP">pre-trained systems</a>, such as <a href="/facts/Generative_pre-trained_transformer/U6JNeDMe">generative pre-trained transformers</a> (GPTs) and <a href="/facts/BERT_(language_model)/rDSueM4E">BERT</a> (bidirectional encoder representations from transformers).

Transformer (deep learning architecture) open-in-new

Transformer (deep learning architecture)