Sunday, October 26, 2014

Solr and multi-word synonyms - one more way to handle

In a new project about creating a new eCommerce site for one of our clients I was asked to tune the existing search according to new requirements:
  1. if a document contains the whole search query or a phrase from it, it must be placed higher others;
  2. the retrieved documents must contain all words from the search query;
  3. query-time synonyms: both one word and multi-word.
The first two are seems to be simple, but the third one… Well… It blows up everything.

The problem of multi-word synonyms is well described in the posts of John Berryman and Mike McCandless, so I don’t dig into details here. The main point there – SynonymsFilter doesn’t work well with term positions if a synonym contains more than one word. After some brainstorming about how to deal this problem I made a decision – if I can’t use positions as I want, I should not use them at all. In other words all matching logic should use the simplest TermQueries and their combinations with boolean operations.

With this rule as basement for all my actions I defined three steps to achieve the goal:
  1. match all documents with any part of the search query and with multi-word synonyms;
  2. apply filter to get the required documents only;
  3. add boosts. 
According to this plan the main problem is resolved on the first step and others become the technical issues.


On the fist step I want Solr to return documents that contain any words from a search phrase. It’s just the simple disjunction. For example, for the search phrase "orange room freshener" such query looks like:
[orange] OR [room] OR [freshener]
Here and next I use square brackets ([ ]) to show tokens, which will be transformed into TermQueries, because Solr uses quotes (" ") for a PhraseQuery with all its problems.

In order to apply a multi-word synonym (e. g. "room freshener, air freshener") without using positions it’s needed to combine the target words ("room" and "freshener") into a single token ([room freshener]). Solr has ShingleFilterFactory designed for this purpose.

In index time I use the it to generate all possible shingles. The product fields are usually quite short, so it doesn’t increase the index a lot. In query time I generate shingles in the similar way:
[orange] OR [room] OR [freshener] OR [orange room] OR [room freshener] OR [orange room freshener]
I don’t use ShingleFilterFactory in for generating query tokens, because it complicates adding boosts and I’ll tell about them later.

SynonymFilterFactory must parse multi-word synonyms as a single token. For this it should be configured with KeywordTokenizerFactory:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" format="solr" ignoreCase="true" tokenizerFactory="solr.KeywordTokenizerFactory" />
Therefore after applying the synonym the query looks like:
[orange] OR [room] OR [freshener] OR [orange room] OR ([room freshener] OR [air freshener]) OR [orange room freshener]
The final query to be processed by Lucene may look like:
brand:(room OR freshener OR orange\ room OR room\ freshener OR air\ freshener OR orange\ room\ freshener)
OR name:(room OR freshener OR orange\ room OR room\ freshener OR air\ freshener OR orange\ room\ freshener)
OR category:(room OR freshener OR orange\ room OR room\ freshener OR air\ freshener OR orange\ room\ freshener)
At this point I believe the same can be dome with the standard Solr features: (e)DisMax handler and ShingleFilterFactory I mentioned above, but I didn’t tried it myself. Below I talk about filtering and boosting and things with them are not so good as I’d hope.


The next step is to filter out all documents that don’t contain any of the words from the search phrase. The simple ways (FQ for each word, or FQ with the whole phrase) doesn’t work because of the same reason - they don’t work with multi-word synonyms. The way to handle them is the similar to the main query - shingles. But if the query should match all documents with any shingle, the filter must match only documents that contain any shingles combination that covers the original search phrase.

For our query "orange room freshener" there are 4 such combinations:
[orange] AND [room] AND [freshener];
[orange room] AND [freshener];
[orange] ADN [room freshener];
[orange room freshener],
To build the final filter query I combine all of them with disjunction (synonym is also applied):
(orange AND room AND freshener)
OR (orange\ room AND freshener)
OR (orange AND (room\ freshener OR air\ freshener))
OR (orange\ room\ freshener)
Such filter query may be used as main query if you don’t use boosts, or you can expand it for using against several fields, but this is out of scope of this post.


The last remaining step is boosting. After everything done before it’s quite simple. The boosts can be added into the main query to boost documents with longer shingles:
[orange] OR [room] OR [freshener] OR [orange room]^200 OR ([room freshener] OR [air freshener])^200 OR [orange room freshener]^500
Or they may be added as separate boost queries built with shingles for using with (e)DisMax handler.


The described approach does the thing it’s created for – it allows to search using multi-word synonyms and having ability to affect the scoring with boosts. But it’s not ideal. The main issue with it is that Solr doesn’t help to build such queries. Almost all work I had to do on the client side and we got several issues:
  1. query size is quite big and may lead to overwhelming the servlet container HTTP header buffer (Jetty is in my case), although it’s could be handled with using POST for sending the queries or increasing the buffer size (I did the last);
  2. stopwords are not handled well. Solr can’t remove a stopwords from the multi-word token, so the queries with stopwords are not processed as I would like. In my case it isn’t critical, but it’s still an issue.
The problems are not critical in my case, but if you are going to use the method I described here, you should keep them in mind. 

No comments:

Post a Comment