Br0sk: How to split long concatenated words in a Solr index

Sunday, August 26, 2012

How to split long concatenated words in a Solr index

The problem:
Finding records by using individual words of concatenated words. Let's make up an example. We have a record that includes the word flagpole. A user searches for flag pole and finds nothing. The problem here is that there is no way for Solr to know how to split the word flagpole up in two words(at least not that I know of).

How can we solve this problem in Solr?

The solution:
The solution is not perfect and is pretty manual. You can use a list of words that will be extracted from concatenated words and indexed for the record. To do so you use a filter called DictionaryCompoundWordTokenFilterFactory. This filter let's you do exactly that. You set this up at index time.

First add the directive to the fields in your schema file(most likely to the standard text field). You need to add tihs in the index section of the field:
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="english-compound-words.txt" minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="true"/>

When that is done you can define a list of the words to be found and extracted from the concatenated words. The file takes one word per line.

Our file would look like this for flag pole:

flag

pole

After reindexing our Solr database we would be able to find the record by searching for flagpole or flag pole. The slight drawback with this approach is that you would find the record just by searching for the word flag or pole. This means that the result coming back is less restrictive and should be fine as long as you relevancy is set up in decent way. The other drawback is of course that this list has to be handled manually. Sure you can write some scripts to find the words to use for this filter but it is still a manual job. Solr doesn't magically do it for you. A good thing is that you have full control over what you add to the file.

What about if it was the other way around. Someone added the record putting the two words apart like flag pole. Now we wouldn't find the record by searching for flagpole.

The answer is to combine the compound index with synonyms and we have it working both ways.

Add synonyms to the query part of the text field:

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

In the synonyms file we add this.

flagpole=>flag pole

Restart Solr and you should find the record by typing in the word flagpole. It will still work with combinations like flag pole, flag or just pole.

If someone has a better way of solving this problem please reply in the comments.

3 comments:

Leon Vlieger1:47 am
Okay, not having a background in Apache or Solr, but being a fanatical Autohotkey and Perl user... this seems like a very complicated solution.

You write:
"We have a record that includes the word flagpole. A user searches for flag pole and finds nothing. The problem here is that there is no way for Solr to know how to split the word flagpole up in two words(at least not that I know of)."

Indeed, Solr doesn't know how split up "flagpole". But how about doing it the other way around? You can bounce the query "flag pole" off your database, do a search-and-replace on the search query, removing all spaces, and then bounce the resulting "flagpole" off your database as well. And, voila, we now have match.

But, I hear you say, what about search queries with more than one space in it? True, this solution works best for simple queries where you have only two words.

The other question I have is, why does Solr treat "flag pole" as one word? Can it not treat the search query as two words and do a search that matches anything containing the string "flag" or "pole"?

Perhaps I am asking for something that any experienced Solr user knows isn't possible, but this an outsiders perspective... ;)
ReplyDelete
Replies
Unknown9:41 am
Hi Leon,

Good questions I will try to answer them.
------------------------
"Indeed, Solr doesn't know how split up "flagpole". But how about doing it the other way around? You can bounce the query "flag pole" off your database, do a search-and-replace on the search query, removing all spaces, and then bounce the resulting "flagpole" off your database as well. And, voila, we now have match."
------------------------

That would mean two queries to the search server. That is possible but not really acceptable. And also if both the queries you send off get different result you have to merge the results and their sorting on the client side. That complicates things heavily. The whole thing with this solution is that you can make changes to what words are being split out without having to change the client side code at all. As you see the example of finding just one record with the word flagpole is greatly simplified. I can see how your solution could also work as an OR search making up the different combinations on the client side and create the OR search programatically before sending it down. This adds complexity and you still need to handle a word list to be able to split "flagpole" to flag and pole.

------------------------
"The other question I have is, why does Solr treat "flag pole" as one word? Can it not treat the search query as two words and do a search that matches anything containing the string "flag" or "pole"?"
------------------------

If solr finds "flagpole" at index time it will think it is one word(it works on word basis with stemming and all that). You can set word delimiters like - or even upper case letters. But for a word like flagpole it is impossible for solr to know that it can be split up in two words(Maybe there is a clever stemmer out there that can do it but I don't think so). You can of course run wildcard searches but I really want to use a clean Dismax query.

To sum it up I implemented it using Solr functionality so that the client side could be completely unchanged and no extra complexity had to be built in.

When it comes to searching I always try to do most of the stuff in the purpose built search engine. As soon as you start fiddling with these things on the client it tends to get messy. I think a rule of thumb could be that if it can be configured in the Solr don't try to recreate it on the client side. With Perl and AutoHotkey searching a flat text file or similar I totally understand if you need to do it since they don't have a search server like Solr built in to handle it for you.

After all this solution is not very complex. It is basically a change to the schema file and two flat text files to add words to.

Does my answer shed some light on why I did it this way?
ReplyDelete
Replies
Unknown10:46 am
Looking deeper in to this maybe this is even more suitable http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

I am going to investigate this further. It is incredible how little information about these filters you can find on the web. If anybody have any useful links to documentation about these filters, please post the links here.
ReplyDelete
Replies

Add comment