Startec

Startec

Semantic Search Using Vectors/Embeddings For Noobs

Mai 17, às 15:12

·

5 min de leitura

·

0 leituras

If you are someone like me who is hearing about semantic search, vectors and embeddings after LLM(Large Language Model) was launched and finds these terms confusing then I hope this blog brings some clarity...
Semantic Search Using Vectors/Embeddings For Noobs

Cover image for Semantic Search Using Vectors/Embeddings For Noobs

If you are someone like me who is hearing about semantic search, vectors and embeddings after LLM(Large Language Model) was launched and finds these terms confusing then I hope this blog brings some clarity to you.

What is Semantic Search

Semantic search in Natural Language Processing (NLP) refers to the process of understanding the meaning or intent behind a user's search query and retrieving relevant information based on that understanding. Unlike traditional keyword-based search, which matches queries to documents based on exact word matches, semantic search aims to comprehend the context and semantics of the query to generate more accurate and contextually relevant results.

The next question is how to make computers understand the semantic information... Humans have very high cognitive capabilities so they can easily understand semantics in multiple languages but to make a computer understand semantics is challenging.

In this blog, we are going to see how semantic information is understood using vectors/embeddings. In my previous blog, I have shown how CountVectorizer & TFIDF works now we are going to see an even more advanced yet simple and easy way to do semantic search

What is a vector

Mathematically a vector is a value which has both magnitude and direction.

Single Dimensional Vector
Here vectors A,B,C,D have magnitudes 4, 2 and A,B,D have same direction but C has a different direction. These are single dimensional vectors.

Multi Dimensional Vector
In mathematics unlike in physics, there could be n dimensions for a vector and these are called multi-dimensional vectors(each arrow in the above figure is a dimension). The all-MiniLM-L6-v2 model that we are going to use in this blog generates a vector with dimension 384. The information stored in these dimensions is used to find semantic similarity.

Sentence Transformers

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. It provides easy methods to compute embeddings (dense vector representations) for sentences, paragraphs and images.

Now we are going to see how to generate embeddings and do a semantic search using a pre-trained model from sentence transformers.

Pretrained Sentence Transformer Model - all-MiniLM-L6-v2

We have a Python library to access the model

pip install -U sentence-transformers

Enter fullscreen mode Exit fullscreen mode

We are going to use the all-MiniLM-L6-v2 model which is a lightweight yet powerful model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

Enter fullscreen mode Exit fullscreen mode

Now we can take a few question-answer sentences and generate embeddings from them and do a semantic search on them.

# Q&A sentences
question_answers = [
 "Q : What is this software used for? A : This software is used to handle you finances and provide useful suggestion",
 "Q : How much does it cost per year? A : It costs 5000 rupees per year",
 "Q : Is there a premium version available? A : Yes it is available for a cost of 7000 rupees per year",
 "Q : Why should I choose this rather than product Y? A : Our product outperforms in W and Z" 
 ]
#Sentences are encoded by calling model.encode()
question_answer_embeddings = model.encode(question_answers, convert_to_tensor=True)

Enter fullscreen mode Exit fullscreen mode

The encode function will generate embeddings and further, we are converting the embeddings into a pytorch tensor.

Image description
That was easy !!!!

Now we will ask a question and find semantically relevant content from the embeddings generated.

question = ['Can you explain the use of this software']
question_embeddings = model.encode(question, convert_to_tensor=True)

Enter fullscreen mode Exit fullscreen mode

We should also generate embedding for our question...

from sentence_transformers.util import semantic_search
hits = semantic_search(question_embeddings, question_answer_embeddings, top_k=1)

Enter fullscreen mode Exit fullscreen mode

We are using a utility function called semantic_search which internally uses cosine similarity by default to find the similarity between the two embeddings and returns a similarity score, you can also use any other metric for comparing the vectors like dot product.

Hits

print([question_answers[hits[0][i]['corpus_id']] for i in range(len(hits[0]))])

Enter fullscreen mode Exit fullscreen mode

Image description

question = ['How much do you charge?']
question_embeddings = model.encode(question, convert_to_tensor=True)
from sentence_transformers.util import semantic_search
hits = semantic_search(question_embeddings, question_answer_embeddings, top_k=1)
print([question_answers[hits[0][i]['corpus_id']] for i in range(len(hits[0]))])

Enter fullscreen mode Exit fullscreen mode

Image description

You could observe from the above examples that the question asked is not exactly matching to any input in question_answers but we are able to find the one that closely matches our input.

There are many other models to generate even more powerful embeddings and the quality of embeddings is directly proportional to the semantic similarity.
Happy Learning :))
www.linkedin.com/in/praveenr2998


Continue lendo

TabNews

Como criar um git/github (e as primeras configs) obs: no windows e com o vscode · NicolasdevNx
Olá, este "artigo" tem como objetivo ensinar como baixar e usar o git eo o github(para este não é neseçario o dowload) então vomos lá. 1:Acesse o site https://git-scm.com/downloads escolh...

Hoje, às 02:32

TabNews

DUVIDAS SOBRE VUEJS E ARRAY PODE ME AJUDAR? · heuderdev
Boa noite a Todos! Pessoal, como eu posso percorrer esse array verificando se o numero que vem na function setActiveNumber(6), é igua a do array se for marcar o active como true, import {...

Hoje, às 00:27

DEV

CSS code refactoring
To refactor means to restructure the source code of an application or piece of software in order to improve operation without affecting functionality. Programmers should abide by the D.R.Y. (Don’t Repeat...

Mai 27, às 23:23

TabNews

Por que só sendo um bom programador não é possível ganhar dinheiro? · OzzyGomes
Ok, antes de tudo, eu sei que o título talvez pareça sensacionalista. Você deve estar pensando, eu sou programador, tenho um emprego que me dá dinheiro em troca dos meus códigos. E sim é...

Mai 27, às 23:16

Hacker News

Google account deleted after 2 hours of Aurora
Recommended alternatives for all the Google products, software and services NOTE: We're trying to recommend you alternatives which are FOSS (or mostly so) and privacy-respecting. This is by no means an...

Mai 27, às 22:20

Discovery

Google Maps completa 18 anos: saiba como tudo começou
O Google Maps, um dos serviços mais icônicos do Google, celebra seu 18º aniversário. Desde o seu lançamento em maio de 2005, ele tem sido uma ferramenta essencial para pessoas ao redor do mundo. Mas como tudo...

Mai 27, às 22:03

Discovery

Como criar um site gratuito no Google Sites
O Google Sites é uma ferramenta gratuita que permite criar um site sem a necessidade de conhecimentos em programação ou design. Com ela, é possível criar um site simples em questão de minutos e compartilhar...

Mai 27, às 21:53

Discovery

Google alerta para dispositivos 'Android TV' vendidos online que não são seguros
O Google está alertando os usuários sobre problemas de segurança relacionados a dispositivos “Android TV” vendidos online que não são seguros. Muitas desses hardwares usam o Projeto de Código Aberto do...

Mai 27, às 21:42

Discovery

Google celebra a vida de Evelyn Ruth Scott AO com Doodle
O Doodle de hoje celebra a ativista social, educadora e defensora dos direitos indígenas australiana, Evelyn Ruth Scott AO. Durante a Semana Nacional da Reconciliação, homenageamos Evelyn, que lutou...

Mai 27, às 21:30

DEV

How to use Firebase Authentication in Next.js 13, Server Side with Admin SDK
I'm new to this world of Full Stack Frameworks like Next.js, SvelteKit, Remix... But I know all the advantages it has so I wanted to use it to create a project I'm working on. I love Firebase and I wanted to...

Mai 27, às 20:20