Source: https://c4.wallpaperflare.com/wallpaper/670/883/685/plexus-atoms-neutrons-electrons-wallpaper-preview.jpg

Analysing time complexity of sentence-transformers’ model.encode

neha tamore

--

sentence-transformers have been the talk of the NLP town since the initial release of Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. The sentence-transformers have seen wide adoption due to the ability to embed phrases, sentences and even paragraphs optimally. This feature has enabled sentence-transformers to be used in a variety of applications and has made them a popular choice in the industry. After having an appropriate vector representation of the text data, the opportunities are limitless, be it text-similarity, semantic search, or paraphrase mining.

This blog is intended to help NLP practitioners to squeeze out performance from the sentence-transformer library while working on huge datasets. Specifically will cover —

  • HuggingFace transformer vs sentence-transformer
  • Using sentence-transformers on CPU/GPU
  • Using parallel processing
  • Effect of sequence length
  • Effect of batch_size

HuggingFace transformer vs sentence-transformer

sentence-transformers use models from the HuggingFace/transformer library and provides an efficient and reliable solution for encoding sentences.

e.g. ~30 line implementation of mean-pooling on token embeddings using bert-base-uncased model from HuggingFace/transformer can be replaced by mere 3 lines of code from sentence-transformers.

Code for generating sentence embeddings

Here is the performance comparison of two approaches:

Time taken to embed 100,000 sentences using sentence-transformers/all-MiniLM-L6-v2

Clearly, the sentence-transformer library has been optimized for creating sentence-level representations, leading to more than 2x speedup. One of the crucial contributing factors behind this efficiency is smart-batching.

Using sentence-transformers on CPU/GPU

While using sentence-transformers to encode billions of sentences, speed and memory considerations need to be evaluated.

GPUs provide huge speedup, and can be directly enabled using device parameter, for any pytorch devices (like CPU, cuda, cuda:0 etc.)

For mammoth datasets, parallel processing can be handy to lower overall compute time. Multiprocessing in sentence-transformers can be enabled by -

  • on multi-core machine: spawning 1 process per CPU core
  • Multi-GPU machine: starting multiple processes (1 per GPU). This approach gives nearly linear speed-up when encoding large text collections.
from sentence_transformers import SentenceTransformer
import time
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

sentences_list = ["This is a sentence"] * 1000_000
start_time = time.time()

#Start the multi-process pool on all available CUDA devices or CPU cores
pool = model.start_multi_process_pool()

#Compute the embeddings using the multi-process pool
emb = model.encode_multi_process(sentences_list, pool)

model.stop_multi_process_pool(pool)
print("Time taken:", time.time()-start_time)

Effect of batch_size

In the sentence-transformers all data is by default broken into batches of 32, which can be further tuned based on CPU/GPU, and memory availability. Increasing the batch_size to maximise memory utilisation, might not always be a good strategy, as sequence lengths and padding required for the smart batching also plays a role in deciding total compute time.

Time taken to encode ~1.31M sentences using sentence-transformers/all-MiniLM-L6-v2 for varying batch sizes

Effect of Sequence length:

If smart batching has 2x speedup, (this is measured on a very small dataset of 10k examples), sequence length definitely plays a pivotal role in compute time. But what’s the time complexity? For transformer models like BERT, RoBERTa, DistilBERT, etc. the runtime and the memory requirement grows almost quadratic with the input length (attributing to the attention mechanism).

Another important aspect is, every model has a different maximum sequence length (based on the training architecture). A rule of thumb could be for 512(max_seq_length for BERT) tokens approximately correspond to 300–400 words (in English). Longer texts, than max_seq_length, are truncated. Thus, while encoding larger sentences or paragraphs this truncation should be kept in mind.

Time taken to encode 100K sentences using sentence-transformers/all-MiniLM-L6-v2 for varying seq lengths

Summing it up:

  1. Sequence_lengths (number of tokens) of sentences contribute to memory requirement and compute time of the embeddings.
  2. Sentence-transformers are better optimized for encoding sentences, compared to extending plain token embeddings using transformers
  3. Increasing batch_size lowers the time taken to encode by improving GPU utilization, and should be chosen based on available CUDA memory. The default batch size is 32.
  4. After a threshold, increasing the batch_size adversely impacts the time taken to encode the data
  5. max_sequence_length vary for different pre-trained models and longer sentences are truncated.
  6. Sentence-encoder library tries to use GPU, if available by default
  7. Using multiple GPUs is possible for encoding massive datasets. Also, it’s possible to make use of multiple cores on a multi-core machine.
  8. Using sentence-transformers on streaming data is possible. This is especially handy when you dont want to wait for an extremely large dataset to download, or if you want to limit the amount of memory used. More info about dataset streaming: https://huggingface.co/docs/datasets/stream

That’s it for today! Thanks for reading :)

--

--

neha tamore

A curious neural network of 86 billion neurons, wandering the search space to find a good reward function!