Example fasttext text similarity deployment
In this example we'll use fasttext
's pre-trained word embeddings and scipy
's cosine similarity to score the similarity of pairs of text. Unlike a bag-of-words model, these word embeddings allow the similarity score to account for synonyms and related words.
Setting up the environment
To get started install the latest versions of fasttext
and modelbit
:
pip install --upgrade fasttext modelbit
Then download the fasttext
pre-trained word embeddings, and reduce the dimensionality from 300 to 30 so the model can fit in RAM:
import fasttext, fasttext.util
fasttext.util.download_model('en', if_exists='ignore')
model = fasttext.load_model('cc.en.300.bin')
fasttext.util.reduce_model(model, 30)
model.save_model("cc.en.30.bin")
Finally, log in to Modelbit from the notebook you'll use for feature engineering and deployment:
import modelbit
mb = modelbit.login()
Building the similarity function
First, we load the model we just created, cc.en.30.bin
. We use our model's get_sentence_vector
to get a numpy.ndarray
representing the combined word embeddings for each sentence. Then we call scipy
's spatial.distance.cosine
to get the similarity score of the two sentence vectors. Sentences with lower scores are more similar.
To deploy to Modelbit, we send our fasttext_similar
function along with the required Python and System packages in mb.deploy()
:
import fasttext, scipy
model = fasttext.load_model("cc.en.30.bin")
def fasttext_similar(text1: str, text2: str) -> float:
return scipy.spatial.distance.cosine(
model.get_sentence_vector(text1),
model.get_sentence_vector(text2))
mb.deploy(fasttext_similar,
python_packages=["fasttext==0.9.2", "scipy==1.9.3"],
system_packages=["g++"])
Using the similarity function
We can then call our similarity function from SQL to get the similarity scores for text content in our warehouse:
with example_text as (
select 'the cat is lazy' as text_col_1, 'the kitten is sleeping' as text_col_2
union all select 'a monster ate my lunch', 'a bicycle has two wheels'
)
select
text_col_1,
text_col_2,
fasttext_similar_latest(text_col_1, text_col_2) as similarity_score
from example_text
Which shows that the first pair of sentences is more similar than the second:
text_col_1 | text_col_2 | similarity_score |
---|---|---|
the cat is lazy | the kitten is sleeping | 0.170992136 |
a monster ate my lunch | a bicycle has two wheels | 0.440790832 |