XLNet

<h2 id="architecture">Architecture</h2>
The main idea of XLNet is to model language autoregressively like the <a href="/facts/Generative_pre-trained_transformer/myah4Tjj">GPT models</a>, but allow for all possible <a href="/facts/Permutation/JSw9ukmv">permutations</a> of a sentence.<a class="footnote-ref" id="fnref:2" href="#fn:2">2</a> Concretely, consider the following sentence:<blockquote>My dog is cute.</blockquote>In standard autoregressive language modeling, the model would be tasked with predicting the probability of each word, conditioned on the previous words as its context:
We factorize the joint probability of a sequence of words 
 
 
 
 
 x
 
 1
 
 
 ,
 …
 ,
 
 x
 
 T
 
 
 
 
 {\displaystyle x_{1},\ldots ,x_{T}}
 
 using the chain rule:
 
 
 
 Pr
 (
 
 x
 
 1
 
 
 ,
 …
 ,
 
 x
 
 T
 
 
 )
 =
 Pr
 (
 
 x
 
 1
 
 
 )
 Pr
 (
 
 x
 
 2
 
 
 
 |
 
 
 x
 
 1
 
 
 )
 Pr
 (
 
 x
 
 3
 
 
 
 |
 
 
 x
 
 1
 
 
 ,
 
 x
 
 2
 
 
 )
 …
 Pr
 (
 
 x
 
 T
 
 
 
 |
 
 
 x
 
 1
 
 
 ,
 …
 ,
 
 x
 
 T
 −
 1
 
 
 )
 .
 
 
 {\displaystyle \Pr(x_{1},\ldots ,x_{T})=\Pr(x_{1})\Pr(x_{2}|x_{1})\Pr(x_{3}|x_{1},x_{2})\ldots \Pr(x_{T}|x_{1},\ldots ,x_{T-1}).}

For example, the sentence "My dog is cute" is factorized as:

 
 
 
 Pr
 (
 
 My
 
 ,
 
 dog
 
 ,
 
 is
 
 ,
 
 cute
 
 )
 =
 Pr
 (
 
 My
 
 )
 Pr
 (
 
 dog
 
 
 |
 
 
 My
 
 )
 Pr
 (
 
 is
 
 
 |
 
 
 My
 
 ,
 
 dog
 
 )
 Pr
 (
 
 cute
 
 
 |
 
 
 My
 
 ,
 
 dog
 
 ,
 
 is
 
 )
 .
 
 
 {\displaystyle \Pr({\text{My}},{\text{dog}},{\text{is}},{\text{cute}})=\Pr({\text{My}})\Pr({\text{dog}}|{\text{My}})\Pr({\text{is}}|{\text{My}},{\text{dog}})\Pr({\text{cute}}|{\text{My}},{\text{dog}},{\text{is}}).}

Schematically, we can write it as

 
 
 
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 →
 
 My 
 
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 →
 
 My dog 
 
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 →
 
 My dog is 
 
 
 
 <MASK>
 
 
 →
 
 My dog is cute
 
 .
 
 
 {\displaystyle {\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}\to {\text{My }}{\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}\to {\text{My dog }}{\texttt {<MASK>}}{\texttt {<MASK>}}\to {\text{My dog is }}{\texttt {<MASK>}}\to {\text{My dog is cute}}.}

However, for XLNet, the model is required to predict the words in a randomly generated order. Suppose we have sampled a randomly generated order 3241, then schematically, the model is required to perform the following prediction task:

 
 
 
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 →
 
 
 <MASK>
 
 
 
 
 <MASK>
 
 
 
 is 
 
 
 
 <MASK>
 
 
 →
 
 
 <MASK>
 
 
 
 dog is 
 
 
 
 <MASK>
 
 
 →
 
 
 <MASK>
 
 
 
 dog is cute
 
 →
 
 My dog is cute
 
 
 
 {\displaystyle {\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}\to {\texttt {<MASK>}}{\texttt {<MASK>}}{\text{is }}{\texttt {<MASK>}}\to {\texttt {<MASK>}}{\text{dog is }}{\texttt {<MASK>}}\to {\texttt {<MASK>}}{\text{dog is cute}}\to {\text{My dog is cute}}}

By considering all permutations, XLNet is able to capture longer-range dependencies and better model the bidirectional context of words.

<h3>Two-Stream Self-Attention</h3>
To implement permutation language modeling, XLNet uses a two-stream self-attention mechanism. The two streams are:

<ul><li>Content stream: This stream encodes the content of each word, as in standard causally masked self-attention.</li>
<li>Query stream: This stream encodes the content of each word in the context of what has gone before. In more detail, it is a masked cross-attention mechanism, where the queries are from the query stream, and the key-value pairs are from the content stream.</li></ul>
The content stream uses the causal mask
 
 
 
 
 M
 
 causal
 
 
 =
 
 
 [
 
 
 
 0
 
 
 −
 ∞
 
 
 −
 ∞
 
 
 …
 
 
 −
 ∞
 
 
 
 
 0
 
 
 0
 
 
 −
 ∞
 
 
 …
 
 
 −
 ∞
 
 
 
 
 0
 
 
 0
 
 
 0
 
 
 …
 
 
 −
 ∞
 
 
 
 
 ⋮
 
 
 ⋮
 
 
 ⋮
 
 
 ⋱
 
 
 ⋮
 
 
 
 
 0
 
 
 0
 
 
 0
 
 
 …
 
 
 0
 
 
 
 ]
 
 
 
 
 {\displaystyle M_{\text{causal}}={\begin{bmatrix}0&-\infty &-\infty &\dots &-\infty \\0&0&-\infty &\dots &-\infty \\0&0&0&\dots &-\infty \\\vdots &\vdots &\vdots &\ddots &\vdots \\0&0&0&\dots &0\end{bmatrix}}}
 
permuted by a random <a href="/facts/Permutation_matrix/xC1ran5s">permutation matrix</a> to 
 
 
 
 P
 
 M
 
 causal
 
 
 
 P
 
 −
 1
 
 
 
 
 {\displaystyle PM_{\text{causal}}P^{-1}}
 
.
The query stream uses the cross-attention mask 
 
 
 
 P
 (
 
 M
 
 causal
 
 
 −
 ∞
 I
 )
 
 P
 
 −
 1
 
 
 
 
 {\displaystyle P(M_{\text{causal}}-\infty I)P^{-1}}
 
, where the diagonal is subtracted away specifically to avoid the model "cheating" by looking at the content stream for what the current masked token is.
Like the causal masking for GPT models, this two-stream masked architecture allows the model to train on all tokens in one forward pass.

<h2 id="training">Training</h2>
Two models were released:<a class="footnote-ref" id="fnref:3" href="#fn:3">3</a><a class="footnote-ref" id="fnref:4" href="#fn:4">4</a>

<ul><li>XLNet-Large, cased: 110M parameters, 24-layer, 1024-hidden, 16-heads</li>
<li>XLNet-Base, cased: 340M parameters, 12-layer, 768-hidden, 12-heads.</li></ul>
It was trained on a dataset that amounted to 32.89 billion tokens after tokenization with SentencePiece. The dataset was composed of <a href="/facts/BookCorpus/IUf51cTw">BooksCorpus</a>, and English Wikipedia, Giga5, ClueWeb 2012-B, and <a href="/facts/Common_Crawl/z0k2mxv3">Common Crawl</a>.
It was trained on 512 TPU v3 chips, for 5.5 days. At the end of training, it still under-fitted the data, meaning it could have achieved lower loss with more training. It took 0.5 million steps with an <a href="/facts/Stochastic_gradient_descent/HbcaYqQP">Adam optimizer</a>, linear learning rate decay, and a batch size of 8192.<a class="footnote-ref" id="fnref:5" href="#fn:5">5</a>

<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/BERT_(language_model)/rDSueM4E">BERT (language model)</a></li>
<li><a href="/facts/Transformer_(machine_learning_model)/cDbjx6a8">Transformer (machine learning model)</a></li>
<li><a href="/facts/Generative_pre-trained_transformer/myah4Tjj">Generative pre-trained transformer</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">"xlnet". GitHub. Retrieved 2 January 2024. <a href="https://github.com/zihangdai/xlnet/" target="_blank">https://github.com/zihangdai/xlnet/</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">"Pretrained models — transformers 2.0.0 documentation". huggingface.co. Retrieved 2024-08-05. <a href="https://huggingface.co/transformers/v2.0.0/pretrained_models.html" target="_blank">https://huggingface.co/transformers/v2.0.0/pretrained_models.html</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">"xlnet". GitHub. Retrieved 2 January 2024. <a href="https://github.com/zihangdai/xlnet/" target="_blank">https://github.com/zihangdai/xlnet/</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
<li id="fn:4">"Pretrained models — transformers 2.0.0 documentation". huggingface.co. Retrieved 2024-08-05. <a href="https://huggingface.co/transformers/v2.0.0/pretrained_models.html" target="_blank">https://huggingface.co/transformers/v2.0.0/pretrained_models.html</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></li>
<li id="fn:5">Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. (2 January 2020). "XLNet: Generalized Autoregressive Pretraining for Language Understanding". arXiv:1906.08237 [cs.CL]. <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></li>
</ol>

XLNet open-in-new

XLNet