Understanding XLNet Usage

7 min readFeb 6, 2021

In this post, I will show simple XLNet usage and try to explain with simple visualizations. There are already pretty nice tutorials. So here I assume you have basic knowledge about XLNet. Like my other posts(attention,Bert) I try to show meaningful examples with simplest dataset possible.

The code is at github. ( Link )

The need for XLNet
Very briefly :
RNN -> unidirectional text sequence
ELMO -> combined left-right and right-left training
Transformer Encoder -> Bidirectional context

With these architectures we were having models of AE(AutoEncoding, Bert) or AR(Autoregresive, GPT-2) models.
There were 2 problems with Bert:
1)Bert corrupts input,(masks only on training), it is called this pretrain-finetune discrepancy
2)Bert guesses masks independently.

And problem with AR(GPT-2) was it was trained on unidirectional context, cannot model deep bidirectional contexts.

XLNet came with idea of Permutation Language Model(PLM). PLM is training network with different permutation of words.(Search for tutorials if you do not know) XLNet achieved AR style pretraining with AE style power.(It can see context from both side)

XLNet uses SentencePiece tokenizer.Bert uses WordPiece.You can check the details from Link.

Simple Usage

With XLNet pretrained, we can either create word embedding, sentence embedding or use trained models (XLNetForSequenceClassification ) for special purposes. So let’s try a simple usage first. Below at line 5 we get token representation of our input(sentence). Then give it to model and get the XLNet output.

Sample Code

The output of above code is a below. We can see the step by step outputs. We give an input sentence, we get :
input_ids(the token ids in XLNet vocabulary)
token_type_ids(Segment token indices 0 -> sentence A,1 -> sentenceB)
attention_mask(Mask to apply attention or not,on padding token indices, 1 not masked).
At output we have [CLS] keyword which we can use instead of whole sentence embedding. We can also dump individual word vectors.

inputs vectors : {
'input_ids': tensor([[   17,   150,  2514, 10782,     4,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 2]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
output dictionary keys :  odict_keys(['last_hidden_state', 'mems'])
last hidden state : torch.Size([1, 6, 768])
[CLS] torch.Size([1, 768])
input_ids tokens  tensor([[   17,   150,  2514, 10782,     4,     3]])
id -> token ['▁', 'i', '▁eat', '▁apple', '<sep>', '<cls>']
tokens :  i eat apple<sep><cls>
 -> torch.Size([768])
i -> torch.Size([768])
eat -> torch.Size([768])
apple -> torch.Size([768])
<sep> -> torch.Size([768])
<cls> -> torch.Size([768])

Now I will create a sample very simple dataset and try different visualizations. I created a set of 30 sentence with word “right”. In 1–10 it means direction, 11–20 suitable for, 21–30 correct,true action.

For sentence embedding we can use special token [CLS] or mean of output layers. The quality of embedding changes according to your problem, so try this for your special problem.

1st I will try simple embedding with [CLS] token.Below is the code calling method output by [CLS]. Sentences are distributed randomly of course. Network is not trained for our dataset. But you can still see some sentences are nicely aligned with similar context sentences. Also here I am getting the whole embedding of sentences, I do not expect them to align perfectly according to different meanings of “right”.

**Get embedding of sentence by special token [CLS]**

Let’s try the other approach. Get sentence embedding by mean of all output tokens. Again sentence are distributed over randomly.(Check the code for details of methods)

**Get sentence embedding by mean of outputs**

Now we get an embedding of sentences in 2 ways. At the end of post, we will train network and embeddings will change.

*** At above visuals, embeddings are dimension of 768. I must reduce this to lower dimension for comparison. When we make dimension reduction with PCA, TSNE or both we loose some of data. As I read best practice is first reduce high dimension to 50 with PCA and then apply TSNE. Here I cannot do that because this is syntactic small dataset. But in code you can see this block and try it in a real problem.

XLNet is a Permutation Language Model, which very simply means training network with different orders of words in a sentence.(Check for tutorials which explains this long concept). But do not forget, we do not use every combination,because it will generate too much combination, only a subset is used. So here I am making a hypothesis, if context is long, since there will lots of combinations, and we are not using all, quality of embedding will decrease. Now I will try this idea. I will begin with simple sentences and make them more complicated and show that the quality of embedding is diminishing.(In a pure attention model,like transformer, i do not expect this, because it sees all inputs at the same time.)

Simple Sentences with 3 different **“right”** meaning and their distribution.

**Cosine table for sentence x sentence**

We can see sentences are near to their most similar sentence. Now I claim if i make sentences longer without changing meaning, similarity of embedding for “right” will diminish even if the content increases.(“Right” meaning will be more scattered, near items will go far away, or unrelated items will come closer)

As you can see for first 2 sentences quality increased among 2 but also increased for others. So their distribution get worse. If we make sentences even longer we get worse embedding. Check below to see, how sentences get more far away from each other. This was a small POC to understand the concept. Of course this is syntactic set just for testing, I do not claim this to be a general case.

Training XLNet

Now I will train this network with sentences i showed before.I group the sentence according to the meaning of word “right” and at the end I expect “right” embedding to align much better than before.

I will have a very simple network architecture and get an embedding of word of “right” (Line 10) and network will learn better and better embedding via training. You can also try to test with using full sentence embedding. It will be slower. Also be careful, when I say better embedding, I do not mean generally better, I mean better for this specific problem, dataset. I am teaching how to align vectors according to my lost function.

Now after training we can check what happened to “right” embedding after training.

Code for generating sentence embedding by [CLS] with trained network.

At above you can see embeddings aligned so well, according to meaning( our label). We can also check the sentence embedding by mean at below. We can see that mean is not as good as using [CLS].
What we achieved with training is, we showed samples of sentences having word “right”, and penalized the network for it’s wrong guesses. So it learned what kind of vector must it create when he gets a context as in our sentences. At the end, our right vectors became totally separable.

**Code for generating sentence embedding by mean with trained network.**

In this post , I tried to show building blocks of XLNet usage. The methods here are optimized for visualizations not for usage.(Methods must process batch input not 1 sentence at a time). But this could a good beginning to see how your real dataset will generate embedding with XLNet. I suggest to try something simple like my approach when you begin a problem. 1st try your set without a trained model, then train the model and do the same again and check if network works as you expect or not.

Understanding XLNet Usage

Written by mustafac