Semantic Search for Customer Search Queries

Nish · October 13, 2024

Table of Contents

Using Semantic Search to Understand Customer Search Queries and Need States

In today’s data-driven landscape, understanding customer needs is crucial to delivering personalized experiences. In Marketing defined persona’s exist, otherwise known as need states, which are used to identify the core reasons why consumers would like to buy your products. Once you understand your customers needs you can then try mapping these to need states which allows a more complete understanding of your customer base.

One way to approach this challenge is by performing semantic search on customer search queries to map them against customer need states. This method helps us better capture the intent behind each query, driving more relevant outcomes for both businesses and consumers.

A typical workflow of applying semantic search functionality to customer search queries.
A typical workflow of applying semantic search functionality to customer search queries.

The Challenge

Traditionally, matching search queries to customer needs was done using keyword-based systems. However, this approach often fails to account for the nuances of human language, such as synonyms, contextual meanings, or more complex queries. Customers might express the same need in a variety of ways, making it hard for standard systems to deliver accurate results. This gap motivated us to explore a more advanced solution.

Semantic search takes customer queries to the next level by analyzing the meaning behind the words, rather than relying on an exact match. We implemented a system that aligns search terms with underlying customer need states, which are personas capturing a set of customer motivations, behaviors, and expectations.

To achieve this, we used advanced natural language processing (NLP) models that identify and interpret the intent behind different queries. By doing so, our system can match customers with the most relevant need state, whether they use industry jargon or casual language. This approach not only bridges the gap between diverse search queries but also ensures more accurate, personalized results.

Libraries

We decided to use faiss developed by Meta’s Fundamental AI Research group, check here. It provides powerful similarity search algorithms that we can use out of the box. The sentence_transformers library is also useful since we are dealing with search queries which can typically be in the form of short sentences. Converting these into embedding vectors is a critical part of the search and this can be performed by sentence_transformers, see here. numpy and pandas are used for standard data-preprocessing and computation.

import faiss
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd

Code Implementation

The idea here is to define some functionality that flexibly allows us to embed and add data to a corpus then perform similirarity search on the defined corpus and an input query. Making a class for this makes sense as there is a logical grouping aswell as making things more modular and maintainable/extensible down the line.

class SemanticSearch:
    """
    A class for performing semantic search using sentence embeddings and FAISS.

    Parameters
    ----------
    model_name : str, optional
        The name of the SentenceTransformer model to use.
        Default is 'all-MiniLM-L6-v2'.

    Attributes
    ----------
    model : SentenceTransformer
        The SentenceTransformer model used for encoding texts.
    indices : dict
        A dictionary of FAISS indices for each corpus.
    embeddings : dict
        A dictionary of embeddings for each corpus.
    data : dict
        A dictionary of original data for each corpus.
    """

    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.indices = {}
        self.embeddings = {}
        self.data = {}

    def add_corpus(self, corpus_name, corpus_data):
        """
        Adds a corpus to the semantic search system.

        Parameters
        ----------
        corpus_name : str
            The name of the corpus.
        corpus_data : list of str
            The data (texts) of the corpus.

        Returns
        -------
        None
        """
        embeddings = self.model.encode(corpus_data, convert_to_numpy=True)
        # Normalize embeddings for cosine similarity
        faiss.normalize_L2(embeddings)
        index = faiss.IndexFlatIP(embeddings.shape[1])
        index.add(embeddings)
        self.indices[corpus_name] = index
        self.embeddings[corpus_name] = embeddings
        self.data[corpus_name] = corpus_data

    def search(self, query_texts, target_corpus, k=1, threshold=None):
        """
        Searches for the most similar texts in the target corpus to the given query texts.

        Parameters
        ----------
        query_texts : list of str
            The list of query texts.
        target_corpus : str
            The name of the corpus to search in.
        k : int, optional
            The number of top results to return. Default is 1.
        threshold : float, optional
            The minimum similarity score to consider. If None, all results are returned.

        Returns
        -------
        list of list of tuple
            A list where each element corresponds to a query text and contains
            a list of tuples (matched_text, similarity_score).
        """
        # Encode and normalize query texts
        query_embeddings = self.model.encode(query_texts, convert_to_numpy=True)
        faiss.normalize_L2(query_embeddings)
        index = self.indices[target_corpus]

        # Perform the search
        distances, indices = index.search(query_embeddings, k)
        results = []
        for i in range(len(query_texts)):
            res = []
            for j in range(k):
                sim = distances[i][j]  # Cosine similarity
                idx = indices[i][j]
                if threshold is None or sim >= threshold:
                    res.append((self.data[target_corpus][idx], sim))
            results.append(res)
        return results

# Initialize the semantic search
semantic_search = SemanticSearch()

# Assume 'loaded_list' contains your search queries (list of strings)
# Assume 'needslist' contains your need states (list of strings)

# Add corpora to the semantic search
semantic_search.add_corpus('search_queries', loaded_list)
semantic_search.add_corpus('need_states', needslist)
  • Model Selection: Uses SentenceTransformer for encoding texts into embeddings, leveraging pre-trained models like ‘all-MiniLM-L6-v2’ for fast, efficient and accurate semantic understanding, see here. This can easily be switched out for alternatives; the following might be better in this case, check this.
  • Normalization: Embeddings are normalized using faiss.normalize_L2 to ensure cosine similarity calculations are accurate and consistent. Cosine similiarity is used to ensure the choosen threshold has more meaning, this explains how it works.
  • Indexing: Utilizes FAISS (faiss.IndexFlatIP) for exact brute force similarity search, allowing fast retrieval of top-k similar items based on cosine similiarity. Other options are available but potentially less accurate, check here for a guide on how to choose an index along with this for more info.
  • Corpus Management: Maintains separate indices, embeddings, and original data for each corpus, enabling flexible and scalable handling of multiple datasets.
  • Threshold Filtering: Allows optional thresholding on similarity scores to filter out less relevant results, enhancing the quality of search results.
  • Versatile Search: Supports searching both ways (need states to search queries and vice versa), demonstrating the flexibility of the search functionality.

Example 1: Given a need state, find top k closest search queries

For exploratory analysis it can be worthwhile seeing the top customer search queries for each need state. This enables a greater understanding of what types of search queries you can expect to link to a need state allowing you to change your need states if it’s not capturing the search queries you’d like.

# Parameters
threshold = 0.8  # Adjust the threshold as needed
k = 3           # Number of top results to retrieve

results_need_to_queries = semantic_search.search(
    query_texts=needslist,
    target_corpus='search_queries',
    k=k,  # Number of closest search queries to retrieve
    threshold=threshold
)

# Create a DataFrame with need states and their top k search queries
need_states = needslist
top_search_queries_list = []
similarity_scores_list = []

for res in results_need_to_queries:
    top_queries = [item[0] for item in res]  # Extract the search queries
    similarity_scores = [item[1] for item in res]  # Extract similarity scores
    top_search_queries_list.append(top_queries)
    similarity_scores_list.append(similarity_scores)

# Create the DataFrame
df_need_to_queries = pd.DataFrame({
    'need_state': need_states,
    'top_search_queries': top_search_queries_list,
    'similarity_scores': similarity_scores_list
})

# Display the first few rows of the DataFrame
print(df_need_to_queries.head())

Example 2: Given a search query, find the closest need state

This is the ultimate aim as it allows you to tie a customer search query to the closest matching need state.

# Parameters
threshold = 0.8  # Adjust the threshold as needed
k = 1           # Number of top results to retrieve

results_query_to_need = semantic_search.search(
    query_texts=loaded_list,
    target_corpus='need_states',
    k=k,  # Only the closest need state
    threshold=threshold
)

# Create a DataFrame with search queries and their closest need state
search_queries = loaded_list
closest_need_states = []
similarity_scores = []

for res in results_query_to_need:
    if res:
        closest_need_states.append(res[0][0])  # The need state text
        similarity_scores.append(res[0][1])    # The similarity score
    else:
        closest_need_states.append(None)
        similarity_scores.append(None)

# Create the DataFrame
df_query_to_need = pd.DataFrame({
    'search_query': search_queries,
    'closest_need_state': closest_need_states,
    'similarity_score': similarity_scores
})

# Display the first few rows of the DataFrame
print(df_query_to_need.head())

Impact on Customer Personalization

By integrating semantic search, we have been able to significantly improve how customer queries are mapped to need states. This allows businesses to provide more tailored search results, product recommendations, and solutions that align with what customers are actually looking for. This enhanced understanding of customer intent leads to a more personalized experience, increasing satisfaction and conversion rates.

Conclusion

Through the use of semantic search, we’ve been able to better connect customer search queries with the underlying customer need states, enabling more relevant and personalized interactions. As this technology evolves, we look forward to refining our models further to capture even more subtle nuances in customer behavior and intent.

Resources

Citation Information

If you find this content useful & plan on using it, please consider citing it using the following format:

@misc{nish-blog,
  title = {Semantic Search for Customer Search Queries},
  author = {Nish},
  howpublished = {\url{https://www.nishbhana.com/Semantic-Search/}},
  note = {[Online; accessed]},
  year = {2024}
}

x.com, Facebook