<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="https://blog.aabech.no/rss/xslt"?>
<rss xmlns:a10="http://www.w3.org/2005/Atom" version="2.0">
  <channel>
    <title>Lars-Erik's blog</title>
    <link>https://blog.aabech.no/</link>
    <description>Ramblings about Umbraco, .net and JavaScript development. With a sprinkle of other stuff.</description>
    <generator>Articulate, blogging built on Umbraco</generator>
    <item>
      <guid isPermaLink="false">1070</guid>
      <link>https://blog.aabech.no/archive/building-a-spell-checker-for-search-in-umbraco/</link>
      <category>umbraco</category>
      <category>search</category>
      <category>lucene</category>
      <category>examine</category>
      <title>Building a spell checker for search in Umbraco</title>
      <description>&lt;p&gt;I spent the day building a spell checker for a search UI I've been improving. Turns out it was pretty easy when leaning on &lt;a href="https://lucenenet.apache.org/index.html"&gt;Lucene.Net Contrib&lt;/a&gt;.
I'll show you the gist of it in this article.&lt;/p&gt;
&lt;p&gt;Make sure you have some nice proofed data to work with tough. While writing the tests for my checker, I kept getting silly suggestions and wrong results. Turned out the data itself had quite a few typos. On the bright side, it allowed us to fix the typos.&lt;/p&gt;
&lt;h3&gt;Building a dictionary&lt;/h3&gt;
&lt;p&gt;So the first thing you need to provide spell checking is a dictionary. For my solution I went with using all the words that exists in the site so I wouldn't suggest words that don't have results. We already have something that provides this functionality, namely &lt;a href="https://github.com/Shazwazza/Examine"&gt;Examine&lt;/a&gt;. Examine lets you add indexes with custom indexers, and that's a perfect fit for our needs. This dictionary however only needs one field: word.&lt;/p&gt;
&lt;p&gt;Here's the basic code for an indexer that indexes all the words in a site:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;public class SpellCheckIndexer : BaseUmbracoIndexer
{
    // May be extended to find words from more types
    protected override IEnumerable&amp;lt;string&amp;gt; SupportedTypes
    {
        get
        {
            yield return IndexTypes.Content;
        }
    }

    protected override void AddDocument(Dictionary&amp;lt;string, string&amp;gt; fields, IndexWriter writer, int nodeId, string type)
    {
        var doc = new Document();
        List&amp;lt;string&amp;gt; cleanValues = new List&amp;lt;string&amp;gt;();

        // This example just cleans HTML, but you could easily clean up json too
        CollectCleanValues(fields, cleanValues);
        var allWords = String.Join(&amp;quot; &amp;quot;, cleanValues);

        // Make sure you don't stem the words. You want the full terms, but no whitespace or punctuation.
        doc.Add(new Field(&amp;quot;word&amp;quot;, allWords, Field.Store.NO, Field.Index.ANALYZED));

        writer.UpdateDocument(new Term(&amp;quot;__id&amp;quot;, nodeId.ToString(CultureInfo.InvariantCulture)), doc);
    }

    protected override IIndexCriteria GetIndexerData(IndexSet indexSet)
    {
        var indexCriteria = indexSet.ToIndexCriteria(DataService);
        return indexCriteria;
    }

    private void CollectCleanValues(Dictionary&amp;lt;string, string&amp;gt; fields, List&amp;lt;string&amp;gt; cleanValues)
    {
        var values = fields.Values;
        foreach (var value in values)
            cleanValues.Add(CleanValue(value));
    }

    private static string CleanValue(string value)
    {
        return HttpUtility.HtmlDecode(value.StripHtml());
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To make it actually create an index you have to add it and an IndexSet to Examine's configuration. &lt;/p&gt;
&lt;p&gt;/config/ExamineSettings.config:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;Examine&amp;gt;
    &amp;lt;ExamineIndexProviders&amp;gt;
        &amp;lt;providers&amp;gt;

                  &amp;lt;!-- Existing providers... --&amp;gt;

                  &amp;lt;add name=&amp;quot;SpellCheckIndexer&amp;quot; type=&amp;quot;YourAssembly.SpellCheckIndexer, YourAssembly&amp;quot;
                      analyzer=&amp;quot;Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net&amp;quot;
                  /&amp;gt;
        &amp;lt;/providers&amp;gt;
    &amp;lt;/ExamineIndexProviders&amp;gt;
    &amp;lt;!-- Search Providers - configured later --&amp;gt;
&amp;lt;/Examine&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;/config/ExamineIndex.config:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;ExamineLuceneIndexSets&amp;gt;
    &amp;lt;IndexSet SetName=&amp;quot;SpellCheckIndexSet&amp;quot; IndexPath=&amp;quot;~/App_Data/TEMP/ExamineIndexes/{machinename}/SpellCheck/&amp;quot;&amp;gt;
        &amp;lt;IndexAttributeFields&amp;gt;
            &amp;lt;add Name=&amp;quot;nodeName&amp;quot;/&amp;gt;
        &amp;lt;/IndexAttributeFields&amp;gt;
        &amp;lt;IndexUserFields&amp;gt;
            &amp;lt;!-- Add the properties you want to extract words from here --&amp;gt;
            &amp;lt;add Name=&amp;quot;body&amp;quot;/&amp;gt;
            &amp;lt;add Name=&amp;quot;summary&amp;quot;/&amp;gt;
            &amp;lt;add Name=&amp;quot;description&amp;quot;/&amp;gt;
        &amp;lt;/IndexUserFields&amp;gt;
    &amp;lt;/IndexSet&amp;gt;
&amp;lt;/ExamineLuceneIndexSets&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we're actually ready to create a word index. Open the backoffice and the developer section. Navigate to the Examine Management tab. You should see the &amp;quot;SpellCheckIndexer&amp;quot;. Click it and select &amp;quot;Index info &amp;amp; tools&amp;quot;. You might already have documents in the index. If so, we're good to go, otherwise click &amp;quot;Rebuild index&amp;quot;. If you want, you can have a look at the data using &lt;a href="http://www.getopt.org/luke/"&gt;Luke&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://blog.aabech.no/media/1003/luke.png" alt="Indexed terms shown with Luke" /&gt;&lt;/p&gt;
&lt;h3&gt;Casting the spell&lt;/h3&gt;
&lt;p&gt;Now we're ready to let the Lucene Contrib &lt;code&gt;SpellChecker&lt;/code&gt; do its magic. It needs to get the dictionary from our index, so let's give it a searcher from Exmine. We'll configure that as a regular searcher in ExamineSettings.config:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;ExamineSearchProviders defaultProvider=&amp;quot;ExternalSearcher&amp;quot;&amp;gt;
    &amp;lt;providers&amp;gt;
        &amp;lt;add name=&amp;quot;SpellCheckSearcher&amp;quot;
            type=&amp;quot;UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine&amp;quot;
            analyzer=&amp;quot;Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net&amp;quot;
        /&amp;gt;
    &amp;lt;/providers&amp;gt;
&amp;lt;/ExamineSearchProviders&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we've got everything we need to build the checker. I've written some unit-tests for it, so the constructor takes a &lt;code&gt;BaseLuceneSearcher&lt;/code&gt; as an argument. For testing, I just create one from an index I know exists. For Umbraco I instantiate it as a singleton using the searcher from Examine.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;public class UmbracoSpellChecker
{
    private static readonly object lockObj = new object();
    private static UmbracoSpellChecker instance;
    public static UmbracoSpellChecker Instance
    {
        get
        {
            lock(lockObj)
            { 
                if (instance == null)
                {
                    instance = new UmbracoSpellChecker((BaseLuceneSearcher)ExamineManager.Instance.SearchProviderCollection[&amp;quot;SpellCheckSearcher&amp;quot;]);
                    instance.EnsureIndexed();
                }
            }
            return instance;
        }
    }

    private readonly BaseLuceneSearcher searchProvider;
    private readonly SpellChecker.Net.Search.Spell.SpellChecker checker;
    private readonly IndexReader indexReader;
    private bool isIndexed;

    public UmbracoSpellChecker(BaseLuceneSearcher searchProvider)
    {
        this.searchProvider = searchProvider;
        var searcher = (IndexSearcher)searchProvider.GetSearcher();
        indexReader = searcher.GetIndexReader();
        checker = new SpellChecker.Net.Search.Spell.SpellChecker(new RAMDirectory(), new JaroWinklerDistance());
    }

    private void EnsureIndexed()
    {
        if (!isIndexed)
        { 
            checker.IndexDictionary(new LuceneDictionary(indexReader, &amp;quot;word&amp;quot;));
            isIndexed = true;
        }
    }

    public string Check(string value)
    {
        EnsureIndexed();

        var existing = indexReader.DocFreq(new Term(&amp;quot;word&amp;quot;, value));
        if (existing &amp;gt; 0)
            return value;

        var suggestions = checker.SuggestSimilar(value, 10, null, &amp;quot;word&amp;quot;, true);

        var jaro = new JaroWinklerDistance();
        var leven = new LevenshteinDistance();
        var ngram = new NGramDistance();

        var metrics = suggestions.Select(s =&amp;gt; new
        {
            word = s,
            freq = indexReader.DocFreq(new Term(&amp;quot;word&amp;quot;, s)),
            jaro = jaro.GetDistance(value, s),
            leven = leven.GetDistance(value, s),
            ngram = ngram.GetDistance(value, s)
        })
        .OrderByDescending(metric =&amp;gt;
            (
                (metric.freq/100f) +
                metric.jaro +
                metric.leven +
                metric.ngram
            )
            / 4f
        )
        .ToList();

        return metrics.Select(m =&amp;gt; m.word).FirstOrDefault();
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The constructor keeps a reference to our word index and creates a SpellChecker that will do it's work in memory. When we first search for suggestions, &lt;code&gt;EnsureIndex&lt;/code&gt; passes our word index to the &lt;code&gt;SpellChecker&lt;/code&gt; and lets it create it's magic &amp;quot;back-end&amp;quot;. It could also be on disk, but for this site the performance is good enough to do it live.&lt;/p&gt;
&lt;p&gt;The only suggestion code you actually need in &lt;code&gt;Check&lt;/code&gt; is basically just:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;        var suggestion = checker.SuggestSimilar(value, 1);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It will give you a feasible word if the value has a typo. However, in many cases it's not the one you're after. You can increase the number of suggestions you want to get a few more, but for search suggestions, we really just want one. Changing the &lt;code&gt;StringDistance&lt;/code&gt; algorithm passed to the &lt;code&gt;SpellChecker&lt;/code&gt; also varies the output, so you should experiment with that.&lt;/p&gt;
&lt;p&gt;In my case there was a typo in the source data. I was testing against the Norwegian word for bathtub: &amp;quot;badekar&amp;quot;. My test phrase was &amp;quot;badkear&amp;quot; and I kept getting suggestions for &amp;quot;badkar&amp;quot;. That word snuck in there due to one spelling mistake in one document. So I figured word frequency should trump. But sorting the suggestions by word frequency turned out to be a bad idea. &amp;quot;Baderom&amp;quot;, the Norwegian word for &amp;quot;bathroom&amp;quot;, is more frequent and won over &amp;quot;badekar&amp;quot; that I was after.&lt;/p&gt;
&lt;p&gt;I spun up all the string distance algorithms and put the results, including the document frequency of the terms in Excel. Experimenting a bit showed that the frequency divided by 100 added to the sum of the distance results was a good metric. After ordering by that result and taking the first, my checker is pretty intelligent when providing suggestions.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://blog.aabech.no/media/1004/spellcheck_excel.png" alt="Evaluating algorithm with Excel" /&gt;&lt;/p&gt;
&lt;h3&gt;Getting results&lt;/h3&gt;
&lt;p&gt;Here's a snippet from the search controller that builds up the response model:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;var spellChecker = UmbracoSpellChecker.Instance; 
var checkedTerms = phrase.Split(' ').Select(t =&amp;gt; spellChecker.Check(t));
var didYouMean = String.Join(&amp;quot; &amp;quot;, checkedTerms);
result.DidYouMean = didYouMean;

// Uses Lucene's TopScoreDocCollector
var totalHits = GetTotalHits(query);

if (result.DidYouMean != phrase)
{
    if (totalHits == 0)
    {
        query = CreateQuery(result.DidYouMean);
        totalHits = GetTotalHits(query);
        result.Modified = true;
        result.Query = result.DidYouMean;
    }
    else
    {
        result.HasAlternative = true;
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When searching for &amp;quot;badkear&amp;quot; now, I get &lt;code&gt;result.Modified == true&lt;/code&gt;. I didn't get any hits for that word, but my spell checker found the word &amp;quot;badekar&amp;quot;. I can happily go search for that and tell my user I modified his search. &amp;quot;Did not find any hits for 'badkear', showing results for 'badekar'.&lt;/p&gt;
&lt;p&gt;When searching for &amp;quot;dujs&amp;quot;, a misspelled version of the Norwegian word for &amp;quot;shower&amp;quot;, I find that we have another typo in our data. I get &lt;code&gt;result.Modified == false&lt;/code&gt;, but &lt;code&gt;result.HasAlternative == true&lt;/code&gt;. My spell checker dutifully asks if I ment &amp;quot;dusj&amp;quot;. &amp;quot;Showing 1 result. Did you mean 'dusj'?&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://blog.aabech.no/media/1005/dujs.png" alt="Did you mean dusj?" /&gt;&lt;/p&gt;
&lt;p&gt;There's a lot more you can do with this to provide even more accurate suggestions. For my purposes however, this was all I needed. Leaning on Lucene Contrib when you need some fancy search functionality will almost always save the day. &lt;/p&gt;
&lt;p&gt;If you want to know more about Lucene, I recommend getting the book called &amp;quot;Lucene in Action&amp;quot;.&lt;/p&gt;
</description>
      <pubDate>Fri, 13 May 2016 13:19:11 Z</pubDate>
      <a10:updated>2016-05-13T13:19:11Z</a10:updated>
    </item>
  </channel>
</rss>