Skip to main content

Example fasttext text similarity deployment

In this example we'll use fasttext's pre-trained word embeddings and scipy's cosine similarity to score the similarity of pairs of text. Unlike a bag-of-words model, these word embeddings allow the similarity score to account for synonyms and related words.

Setting up the environment

To get started install the latest versions of fasttext and modelbit:

pip install --upgrade fasttext modelbit

Then download the fasttext pre-trained word embeddings, and reduce the dimensionality from 300 to 30 so the model can fit in RAM:

import fasttext, fasttext.util

fasttext.util.download_model('en', if_exists='ignore')
model = fasttext.load_model('cc.en.300.bin')
fasttext.util.reduce_model(model, 30)
model.save_model("cc.en.30.bin")

Finally, log in to Modelbit from the notebook you'll use for feature engineering and deployment:

import modelbit
mb = modelbit.login()

Building the similarity function

First, we load the model we just created, cc.en.30.bin. We use our model's get_sentence_vector to get a numpy.ndarray representing the combined word embeddings for each sentence. Then we call scipy's spatial.distance.cosine to get the similarity score of the two sentence vectors. Sentences with lower scores are more similar.

To deploy to Modelbit, we send our fasttext_similar function along with the required Python and System packages in mb.deploy():

import fasttext, scipy

model = fasttext.load_model("cc.en.30.bin")

def fasttext_similar(text1: str, text2: str) -> float:
return scipy.spatial.distance.cosine(
model.get_sentence_vector(text1),
model.get_sentence_vector(text2))

mb.deploy(fasttext_similar,
python_packages=["fasttext==0.9.2", "scipy==1.9.3"],
system_packages=["g++"])

Using the similarity function

We can then call our similarity function from SQL to get the similarity scores for text content in our warehouse:

with example_text as (
select 'the cat is lazy' as text_col_1, 'the kitten is sleeping' as text_col_2
union all select 'a monster ate my lunch', 'a bicycle has two wheels'
)

select
text_col_1,
text_col_2,
fasttext_similar_latest(text_col_1, text_col_2) as similarity_score
from example_text

Which shows that the first pair of sentences is more similar than the second:

text_col_1text_col_2similarity_score
the cat is lazythe kitten is sleeping0.170992136
a monster ate my luncha bicycle has two wheels0.440790832