Welcome to part 3 of my Sitecore Solr eDisMax series. In part one I introduced Solr eDisMax and how to get started with it in your Sitecore solution. In part two I introduced Boosting, Tuning & Debugging. In this post I will show you how to build type-ahead search functionality with EdgeNGram.
What is an EdgeNGram?
Before we can talk about EdgeNGram, we need to understand NGram.
An NGram is a solr filter that supports partial matches of words anywhere in the token, including mid-word. For example, if you search for “itcor” an NGram filter could return matches for “Sitcore”.
At a glance this sounds great. But this is not how humans search. We will search for words, or at least start typing the beginning of a word. No one would ever search Google for “itcore” looking for “Sitecore” on purpose, this would most likely be a typo or misspelling. (This case is better handled by implementing Solr spelling check, which I will cover in my next post in this series)
This is where EdgeNGram comes in. EdgeNGram is another Solr filter that allows you to build matches of each word of your field, but starting only at the beginning of each word, more aligning with how humans search.
Revisiting our example above, if you search for “sit” an EdgeNGram filter will tokenize your query as:
and if you have the word “sitecore” in your content, it would also tokenize in the index the same way
So Solr will find a match on any documents that contain “Sit”, “Site”, and “Sitecore”. This allows you to search on partial words, perfect for type-ahead functionality (i.e. showing results immedietly AS the user types via AJAX.)
For more examples on EdgeNGram tokenization, I recommend reading the Official documentation:
To create an EdgeNGram, we have to make some simple modifications to the solr schema config. ( I also recommend adding these changes to your master index schema as well, or any additional solr indexes you might have).
In solr 7.2.1, we will want to modify the “managed-schema” (no extension) file. ex: C:\projects\Habitat-1.7\solr-7.2.1\server\solr\habitat_web_index\conf\managed-schema
We will add the 4 new xml configuration elements that will all work together to create our EdgeNGram definition.
1. Dynamic Field
This will define a new dynamic field type for our EdgeNGram. This is how we define a type suffix in solr, such as “_t” for text (e.g. “title_t”).
<dynamicField name="*_ngram" type="suggest_ngram" indexed="true" stored="false" />
The type “suggest_ngram” will be defined later in the “field type” section below.
(For brevity sake, I decided to name my type “ngram”, but this could be confused with an actual “ngram”, but you can rename it if to anything you like, such as “*_edgengram”)
We must explicitly define the new field where our EdgeNGram data will be actually stored. Lets name it “predictive_title_ngram”. Notice how it ends with our new “_ngram” dynamic field type:
<field name="predictive_title_ngram" type="suggest_ngram" indexed="true" stored="true" />
The copy field does all the heavy lifting for us. This tells Solr to automatically populate/build our new “predictive_title_ngram” field whenever the “title_t” field is updated:
<copyField source="title_t" dest="predictive_title_ngram"/>
In the example, the “title_t” field is an out-of-the-box Sitecore “Title” field.
We can add additional EdgeNGram-enabled fields by creating additional fields and corrisponding copy fields.
Finally we must actually define our EdgeNGram field type. This the value of the “type” field in our new dynamic field.
When you define a new field, you must define what tokenizers and filters you want to run both index time (whats stored in the index field) and query time (how the user’s search text is manipulated before it is executed in Solr as a search).
First I will explain the interesting bits – please scroll down to see the entire field definition.
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^A-Za-z0-9=><]+" replacement=" " />
This replaces any character that is not alpha numeric with a space. Then we cleverly use the solr.WhitespaceTokenizerFactory tokenizer to remove extra white spaces that the replacement might have added.
We will use the solr.EdgeNGramFilterFactory filter generate and store the edgeNGram for us in the field.
<fieldType name="suggest_ngram" class="solr.TextField" positionIncrementGap="100"> <similarity class="solr.BM25SimilarityFactory"> <!-- defaults: k1=1.2 (term freq) b=0.75 (field Norm) --> <float name="k1">1.2</float> <float name="b">0.75</float> </similarity> <analyzer type="index"> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^A-Za-z0-9=><]+" replacement=" " /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_suggest_ngram.txt" format="snowball" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index_suggest_ngram.txt" ignoreCase="true" expand="true"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords_suggest_ngram.txt"/> <filter class="solr.PorterStemFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="&" replacement=" ampersand " /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^A-Za-z0-9=><&]+" replacement=" " /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_suggest_ngram.txt" format="snowball" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_query_suggest_ngram.txt" ignoreCase="true" expand="true"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords_suggest_ngram.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType>
Copy the text between the comment tags:
<!-- BEGIN: CUSTOM - suggest_ngram --> <!-- END: CUSTOM - suggest_ngram -->
How to Query EdgeNGram
With our custom field setup, lets test querying it directly in the Solr admin panel.
- q = “sit”
- fl = “IGNORE” (Purposely adding non existent field to hide result documents from returning to make the screen easier to read)
- qf = “predictive_title_ngram”
- hl (highlighting) = CHECKED
- edismax = CHECKED
(Note that we are only querying the one field “predictive_title_ngram” and not “title_t”. Typically when implementing EdgeNGram we only query the single field as this will lead to strange and hard to debug results. If you have more complex requirements you should consider implementing your functionality as additional queries instead of trying to shoe-horn it into a single one.)
After executing the search, scroll to the “highlighting” results, and notice Solr found out results in our field “predictive_title_ngram” and the <em> tags wrapping each match it found (used for highlighting – more on this on a future post):
How to use with Search Demo Site
Back in the first post of this series I linked to my eDisMax Search Demo site. The demo site comes with a pre-built search datasource query item:
/sitecore/content/Global/Solr Queries/Default Edge N Gram
and a pre-built page with this datasource already set.
/sitecore/content/Home/Predictive Search Results
No custom code is needed using my demo site to leverage the edgeNGram. functionality. To enable our EdgeNGram functionality, just enter the field name “predictive_title_ngram” as query fields (qf) and you are all set.
Change the datasource of the “Search Results” rendering to point to this datasource:
Save, publish, etc. Then when you browse to the page on the demo site /predictive-search-results?q=sit you will get results for “Sitecore”. Scroll down to the “highlights” section which really “highlights” (pun intended) our completed result well: