Sentence Embeddings¶

We'll see how we can represent sentences using vectors in a high-dimensional space, and how we measure and visualize similarity in that space.

Example based on https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py

Install and Import¶

Press the Run button below (next to "3 cells hidden")

In [ ]:
# Install the needed libraries
!pip install -q sentence-transformers
In [ ]:
# Set up TensorBoard to view the embeddings.

import tensorflow as tf
import tensorboard as tb
from torch.utils.tensorboard import SummaryWriter

%load_ext tensorboard
In [ ]:
# Import libraries we'll need
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time

Load Model and Data¶

In this example, we download a large set of questions from Quora and then find similar questions in this set.

Press the Run button below.

In [ ]:
# Load the model for computing sentence embeddings. We use one trained for similar questions detection
model = SentenceTransformer('all-MiniLM-L6-v2')
#model = SentenceTransformer('all-mpnet-base-v2')
In [ ]:
# We download the Quora Duplicate Questions Dataset (https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
# and find similar questions in it
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 5000 # We limit our corpus to only the first 5k questions


# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        corpus_sentences.add(row['question2'])
        if len(corpus_sentences) >= max_corpus_size:
            break
corpus_sentences = list(corpus_sentences)

Compute Sentence Vectors¶

We tell the model to compute the embeddings for each sentence. This will take about a minute.

In [ ]:
corpus_sentences[:3]
In [ ]:
len(corpus_sentences)
In [ ]:
corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)
In [ ]:
corpus_embeddings[2].shape

Visualize Sentence Vectors¶

Run the two cells below to launch a viewer to show these embeddings.

Switch to UMAP mode. (bottom left pane)

Try rotating the view by dragging. Notice that some points that appeared on top of each other actually were in different places but just looked nearby because we were taking a 2D picture of a 3D space. By analogy, even the 3D space is a picture of a much higher dimensional space (384 dimensions in this case).

Rotate the view around until you can clearly see a clump of points that isn't overlapped with some other points. It's easiest to see these on the outside edges of the "ball" of data. Mouse around that clump to see what the sentences are. Try to identify a characteristic that those sentences have in common. Also think about what's different among those sentences: what does the embedding projection not capture?

Next, try clicking on an individual sentence. Look on the right pane: this is "getting a tape measure out" and looking at distances (or similarities) in the original space.

In [ ]:
# Write the embeddings to a file so that the projector can view them.
writer = SummaryWriter()
writer.add_embedding(corpus_embeddings, corpus_sentences)
writer.close()
In [ ]:
%tensorboard --logdir=runs

Find Clusters¶

The approach we'll use here looks for "communities" of sentences. It tries to find groups of highly-similar sentences. It doesn't try to assign every sentence to a community.

There are two parameters that we can configure:

  1. How similar do sentences need to be? If a sentence isn't similar enough to a community, it won't get included.
  2. How big do communities need to be? If a community is too small, it won't get reported.
In [ ]:
start_time = time.time()

# Two parameters to tune:
# min_cluster_size: Only consider cluster that have at least a certain number of sentences
# threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(corpus_embeddings, min_community_size=5, threshold=0.75)

print("Clustering done after {:.2f} sec".format(time.time() - start_time))

# Print for all clusters the top 3 and bottom 3 elements
for i, cluster in enumerate(clusters):
    print("\nCommunity {} ({} sentences)".format(i+1, len(cluster)))
    for sentence_id in cluster[0:3]:
        print("\t", corpus_sentences[sentence_id])
    print("\t", "...")
    for sentence_id in cluster[-3:]:
        print("\t", corpus_sentences[sentence_id])

How does it work?¶

Now we have a vector for each sentence (in this case, each question). They are stored in an object called a tensor. Each row of the tensor corresponds to a sentence. The elements in that row are the vector for that sentence.

In [ ]:
corpus_embeddings
In [ ]:
corpus_embeddings.shape

Here is how we can get out the vector for a sentence.

In [ ]:
# let's look for a few example sentences by keywords
gmail_sents = [(i, sent) for i, sent in enumerate(corpus_sentences) if 'password' in sent and 'gmail' in sent.lower()]
gmail_sents
In [ ]:
sentence_idx = gmail_sents[0][0]
print("Getting the vector for sentence {}: \"{}\"".format(sentence_idx, corpus_sentences[sentence_idx]))
vec = corpus_embeddings[sentence_idx]

print("The vector has", len(vec), "elements.")

Looking for Similar Vectors¶

Let's try to compute the similarity of this vector with every other vector. We do this by multiplying corresponding elements of the two vectors and adding up the result. (This is called a dot product.) It turns out that we can do this with a matrix multiplication.

In [ ]:
similarity_scores = corpus_embeddings.matmul(vec)
similarity_scores

Which sentences have vectors that are most similar? Let's look at which sentences correspond to the top k most similar vectors.

In [ ]:
[corpus_sentences[i] for i in similarity_scores.topk(15).indices]
In [ ]: