BERT (language model)

Bidirectional encoder representations from transformers (BERT) is a <a href="/facts/Language_model/FntSpg0j">language model</a> introduced in October 2018 by researchers at <a href="/facts/Google/GT9Sugza">Google</a>. It learns to represent text as a sequence of vectors using <a href="/facts/Self-supervised_learning/51WjBKrn">self-supervised learning</a>. It uses the <a href="/facts/Transformer_(machine_learning_model)/cDbjx6a8">encoder-only transformer</a> architecture. BERT dramatically improved the <a href="/facts/State_of_the_art/yWnC7yQI">state-of-the-art</a> for <a href="/facts/Large_language_model/WnogWVJY">large language models</a>. As of 2020<a href="https://en.wikipedia.org/w/index.php?title=BERT_(language_model)&action=edit">[update]</a>, BERT is a ubiquitous baseline in <a href="/facts/Natural_language_processing/1hjMKsSN">natural language processing</a> (NLP) experiments. 
BERT is trained by masked token prediction and next sentence prediction. As a result of this training process, BERT learns contextual, <a href="/facts/Latent_space/IDYfRCU7">latent representations</a> of tokens in their context, similar to <a href="/facts/ELMo/kXIdvng8">ELMo</a> and <a href="/facts/GPT-2/U6JNeDMe">GPT-2</a>. It found applications for many natural language processing tasks, such as <a href="/facts/Coreference_resolution/UH4fUcND">coreference resolution</a> and <a href="/facts/Polysemy/en5BnYGW">polysemy</a> resolution. It is an evolutionary step over <a href="/facts/ELMo/kXIdvng8">ELMo</a>, and spawned the study of "BERTology", which attempts to interpret what is learned by BERT.
BERT was originally implemented in the English language at two model sizes, BERTBASE (110 million parameters) and BERTLARGE (340 million parameters). Both were trained on the Toronto <a href="/facts/BookCorpus/IUf51cTw">BookCorpus</a> (800M words) and <a href="/facts/English_Wikipedia/IgtWlI0g">English Wikipedia</a> (2,500M words).: 5  The weights were released on <a href="/facts/GitHub/n1hNCsc3">GitHub</a>. On March 11, 2020, 24 smaller models were released, the smallest being BERTTINY with just 4 million parameters.

BERT (language model) open-in-new

BERT (language model)