Solr Sorting Tidbit

The basic requirement was a fuzzy search on a series of business names and locations using a Levenshtein fuzzy string to handle typos, with results being returned with those whose name started with the query term displayed first followed by others that may have matched.

For our purposes, let’s say I had the following data set:

**Example Data Set**
ID	Name	Location
1	Boone Moving Service	Raleigh, NC
2	Bone’s Moving	New York, NY
3	Boone Moving	Los Angeles, CA
4	Jimmy’s Movers	Boone, NC

If a user was to search using the phrase “Boone Moving”, we would expect all of these records to match because the two terms in the phrase “Boone” and “Moving” should match a term in either the Name or Location fields.

My application is built in Rails and I am using the Sunspot gem for doing searches. Sunspot gives us some nice functionality including a nice DSL for indexing and searching for ActiveRecord model objects. I was using the default schema.xml from Sunspot, so my initial work looked something like:

So my first question, why did we only get 2 results instead of all four. Well, the default Solr parser uses the Lucene parser which uses Levenshtein for fuzzy searches. However, as I mentioned, I am using Sunspot, which uses the dismax parser and dismax does not support fuzzy searching like this. So I changed my code to manually set the Solr parameters to sort on

Let’s dig into this code a little bit. In the custom_search method, I am using Sunspot’s search method, but instead of setting up keywords, I am manually setting up the Solr parameters. I break up a search query into tokens and then require that each token be found in at least one field. The “~” appended to each token tells Lucene to allow for fuzzy searching on this term. The result is that this matches all of our records. At this point, it is sort of ugly, I’m shortcutting some Sunspot functionality, but not too bad.

Now on to the next problem, sorting. The client requirements for the searching was that we want to give priority to those businesses whose names start with the query terms. So if I searched for “Boone Moving”, I would want the businesses named “Boone Moving” and “Boone Moving Services” to show up higher than “Jimmy’s Movers” located in Boone.

In the default schema.xml, the name_text field uses the StandardTokenizerFactory which breaks up our names into individual tokens based on word boundary rules. But in order to search on the whole term, we do not want to have terms broken up. So, the first thing that I needed was to create a field that is not tokenized at all. To do this, we want to use a [The basic requirement was a fuzzy search on a series of business names and locations using a Levenshtein fuzzy string to handle typos, with results being returned with those whose name started with the query term displayed first followed by others that may have matched.

For our purposes, let’s say I had the following data set:

**Example Data Set**
ID	Name	Location
1	Boone Moving Service	Raleigh, NC
2	Bone’s Moving	New York, NY
3	Boone Moving	Los Angeles, CA
4	Jimmy’s Movers	Boone, NC

In the default schema.xml, the name_text field uses the StandardTokenizerFactory which breaks up our names into individual tokens based on word boundary rules. But in order to search on the whole term, we do not want to have terms broken up. So, the first thing that I needed was to create a field that is not tokenized at all. To do this, we want to use a](http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory) and copy our business name into that field.

With this new schema.xml, we need to restart Solr and reindex all data to get the field indexed. Now we can tweak our custom_search method to use a highly weighted prefix query to boost the score for businesses with an exact match of the name.

Success! The key change that we have added an OR clause (_query_:”{!prefix f=name_untokenized v=$qq}”)^10 and an additional parameter into our hash :qq=>query.downcase. This new clause tells Solr to give a big boost to any results where the term matches the untokenized version of our name. This is somewhat similar to the boost query option in the dismax parser. Because we are using this as an OR, records that do not match this prefix query are not excluded, but those that do match get a big boost. This results in a successful ordering of the records.