Michael Schmitt

Software Developer in Illinois

Fuzzy Text Search with MongoDB and Python

· 02/07/2018 · MongoDB ·

So you need to add fuzzy search matching to your Python & MongoDB project. Though MongoDB offers quite a few handy text search features out of the box , any search for "Dostoyevsky" requires you spell good ol' Fyodor's name exactly right.

There are likely dozens of solutions to this situation, including switching to another database that supports fuzzy matching. That may be the right answer for you, but for most it's a tall order just to implement one feature.

Here's the solution I use for Serial Reader's search. It may not be the best, but it works well enough.

New search term collection

The goal is to end up with a new database collection, call it search_terms where we can store the keywords we want to search against, the original "clean" keyword, and the ObjectId of the source item.


{
  "keyword": "SEARCH_TERM",
  "original": "Search Term",
  "item": ObjectId("item_id")
}

In this way we can have any number of keywords pointing to any given source item, and we need only index the 'keyword' field for speedy queries.

(I opted to store only the ObjectId of the target object but you could obviously store more data to avoid an additional database hit to get the target object's information.)

Double Metaphone and NYSIIS

To make sure "colour" matches "color" in our search_terms collection, we can use phonetic algorithms to create representations of our search terms that will match a wider net of spellings.

There are many options available, but I've seen the best results using the Double Metaphone and New York State Identification and Intelligence System (NYSIIS), both available in the Fuzzy python package.

Here's our color/colour example...


import fuzzy

dmeta = fuzzy.DMetaphone()
dmeta("color")
>>>> ['KLR', None]
dmeta("colour")
>>>> ['KLR', None]

fuzzy.nysiis("color")
>>>> u'CALAR'
fuzzy.nysiis("colour")
>>>> u'CALAR'

Building the search term collection

We'll run through each search term to create a megaphone and NYSIIS representation of each. Those will serve as the keyword in our terms collection.


for word in terms:
    if len(word) <= 2 or word in stop_words:
        # Skip short words or ignorable words
        continue
    fuzzy_terms = []
    fuzzy_terms.append(dmeta(word)[0]) # doblemetaphone
    fuzzy_terms.append(fuzzy.nysiis(word)) # NYSIIS
    for term in fuzzy_terms:
        search_terms_collection.insert({
            "keyword": term,
            "original": word,
            "item": item["_id"]
        })

Your strategy for determining which words to use as your terms will vary. For Serial Reader, I use book titles, authors, and a group of manually added terms (helpful for getting certain titles to show up for "Sherlock Holmes" queries, for example). I split each term into single words and throw out stop words.

It's also important to consider how to keep this collection maintained going forward. If your list of items is hundreds or thousands long, I've found rebuilding the entire collection via cron job during low usage times takes a few seconds at most.

Searching our fuzzy collection

When a user provides a search query, we transform the words in the query the same way we transformed the original terms.


search_words = search_query.split(" ")
fuzzy_terms = []
for word in search_words:
    if word in stop_words:
        continue 
    fuzzy_terms.append(dmeta(word)[0])
    fuzzy_terms.append(fuzzy.nysiis(word))
results = search_terms_collection.find(
    {"$or": [
        {"keyword": {"$in": fuzzy_terms}},
        {"original": {
            "$regex": search_query, "$options": "i"}}
    ]}
)

Notice I also regex search the original search query against the original keywords to cast an even wider net.

Sorting our results

You may find you need a bit of extra work to bubble the right results to the top. A search for "War Worlds" should return "The War of the Worlds" before "The Lost World" and "War & Peace" to provide the best UX.

I found a good way to achieve this is to use the Levenshtein python module to calculate distance values between the user's search query and the results' original keywords.

We can find the best (lowest) distance for each returned item. I build a map of these values to append to the original items fetched from the database, allowing me to sort the whole list by ascending distance.


result_map = {}
for result in results:
    result_item = str(result["item"])
    keyword_distance = float(
        distance(search_phrase, result['original']
    )
    if not result_map.has_key(result_item):
        result_map[result_item] = keyword_distance
    else:
        result_map[result_item] = min(
            keyword_distance, result_map[result_item]
        )

Sorting by this distance value allows you to further weight the value by other important stats, like popularity or relevance to the particular user.

And that's it! You can try out the search feature in Serial Reader to see how this approach performs.

I've found building your own fuzzy search collection this way allows an enjoyable amount of control over what terms are searchable and how order results are returned.

Previous Next
Reaching nearly 200 reviews in 8 days with SKStoreReviewController Analytics of Delight
MongoDB: The Definitive Guide