Update: Download the plugin on Github.
It’s a pretty common scenario when working with a Solr-powered search engine: you have a list of synonyms, and you want user queries to match documents with synonymous terms. Sounds easy, right? Why shouldn’t queries for “dog” also match documents containing “hound” and “pooch”? Or even “Rover” and “canis familiaris”?
As it turns out, though, Solr doesn’t make synonym expansion as easy as you might like. And there are lots of good ways to shoot yourself in the foot.
The SynonymFilterFactory
Solr provides a cool-sounding SynonymFilterFactory, which can be a fed a simple text file containing comma-separated synonyms. You can even choose whether to expand your synonyms reciprocally or to specify a particular directionality.
For instance, you can make “dog,” “hound,” and “pooch” all expand to “dog | hound | pooch,” or you can specify that “dog” maps to “hound” but not vice-versa, or you can make them all collapse to “dog.” This part of the synonym handling is very flexible and works quite well.
Where it gets complicated is when you have to decide where to fit the SynonymFilterFactory: into the query analyzer or the index analyzer?
Index-time vs. query-time
The graphic below summarizes the basic differences between index-time and query-time expansion. Our problem is specific to Solr, but the choice between these two approaches can apply to any information retrieval system.
Your first, intuitive choice might be to put the SynonymFilterFactory in the query analyzer. In theory, this should have several advantages:
- Your index stays the same size.
- Your synonyms can be swapped out at any time, without having to update the index.
- Synonyms work instantly; there’s no need to re-index.
However, according to the Solr docs, this is a Very Bad Thing to Do(™), and apparently you should put the SynonymFilterFactory into the index analyzer instead, despite what your instincts would tell you. They explain that query-time synonym expansion has two negative side effects:
- Multi-word synonyms won’t work as phrase queries.
- The IDF of rare synonyms will be boosted, causing unintuitive results.
- Multi-word synonyms won’t be matched in queries.
This is kind of complicated, so it’s worth stepping through each of these problems in turn.
Multi-word synonyms won’t work as phrase queries
At Health On the Net, our search engine uses MeSH terms for query expansion. MeSH is a medical ontology that works pretty well to provide some sensible synonyms for the health domain. Consider, for example, the synonyms for “breast cancer”:
breast neoplasm breast neoplasms breast tumor breast tumors cancer of breast cancer of the breast
So in a normal SynonymFilterFactory setup with expand=”true”, a query for “breast cancer” becomes:
+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)
…which matches documents containing “breast neoplasms,” “cancer of the breast,” etc.
However, this also means that, if you’re doing a phrase query (i.e. “breast cancer” with the quotes), your document must literally match something like “breast cancer breast breast” in order to work.
Huh? What’s going on here? Well, it turns out that the SynonymFilterFactory isn’t expanding your multi-word synonyms the way you might think. Intuitively, if we were to represent this as a finite-state automaton, you might think that Solr is building up something like this (ignoring plurals):
But really it’s building up this:
And your poor, unlikely document must match all four terms in sequence. Yikes.
Similarly, the mm parameter (minimum “should” match) in the DisMax and EDisMax query parsers will not work as expected. In the example above, setting mm=100% will require that all four terms be matched:
+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)~4
The IDF of rare synonyms will be boosted
Even if you don’t have multi-word synonyms, the Solr docs mention a second good reason to avoid query-time expansion: unintuitive IDF boosting. Consider our “dog,” “hound,” and “pooch” example. In this case, a query for any one of the three will be expanded into:
+(dog hound pooch)
Since “hound” and “pooch” are much less common words, though, this means that documents containing them will always be artificially high in the search results, regardless of the query. This could create havoc for your poor users, who may be wondering why weird documents about hounds and pooches are appearing so high in their search for “dog.”
Index-time expansion supposedly fixes this problem by giving the same IDF values for “dog,” “hound,” and “pooch,” regardless of what the document originally said.
Multi-word synonyms won’t be matched in queries
Finally, and most seriously, the SynonymFilterFactory will simply not match multi-word synonyms in user queries if you do any kind of tokenization. This is because the tokenizer breaks up the input before the SynonymFilterFactory can transform it.
For instance, the query “cancer of the breast” will be tokenized by the StandardTokenizationFactory into [“cancer”, “of”, “the”, “breast”], and only the individual terms will pass through the SynonymFilterFactory. So in this case no expansion will take place at all, assuming there are no synonyms for the individual terms “cancer” and “breast.”
Edit: I’ve been corrected on this. Apparently, the bug is in the Lucene query parser (LUCENE-2605) rather than the SynonymFilterFactory.
Other problems
I initially followed Solr’s suggestions, but I found that index-time synonym expansion created its own issues. Obviously there’s the problem of ballooning index sizes, but besides that, I also discovering an interesting bug in the highlighting system.
When I searched for “breast cancer,” I found that the highlighter would mysteriously highlight “breast cancer X Y,” where “X” and “Y” could be any two words that followed “breast cancer” in the document. For instance, it might highlight “breast cancer frauds are” or “breast cancer is to.”
After reading through this Solr bug, I discovered it’s because of the same issue above concerning how Solr expands multi-word synonyms.
With query-time expansion, it’s weird enough that your query is logically transformed into the spaghettified graph above. But picture what happens with index-time expansion, if your document contains e.g. “breast cancer treatment options”:
This is literally what Lucene thinks your document looks like. Synonym expansion has bought you more than you bargained for, with some Dada-esque results! “Breast tumor the options” indeed.
Essentially, Lucene now believes that a query for “cancer of the breast” (4 tokens) is the same as “breast cancer treatment options” (4 tokens) in your original document. This is because the tokens are just stacked one on top of the other, losing any information about which term should be followed by which other term.
Query-time expansion does not trigger this bug, because Solr is only expanding the query, not the document. So Lucene still thinks “cancer of the breast” in the query only matches “breast cancer” in the document.
Update: there’s a name for this phenomenon! It’s called “sausagization.”
Back to the drawing board
All of this wackiness led me to the conclusion that Solr’s built-in mechanism for synonym expansion was seriously flawed. I had to figure out a better way to get Solr to do what I wanted.
In summary, index-time expansion and query-time expansion were both unfeasible using the standard SynonymFilterFactory, since they each had separate problems:
Index-time
- Index size balloons.
- Synonyms don’t work instantly; documents must be re-indexed.
- Synonyms cannot be instantly replaced.
- Multi-word synonyms cause arbitrary words to be highlighted.
Query-time
- Phrase queries do not work.
- IDF values for rare synonyms are artificially boosted.
- Multi-word synonyms won’t be matched in queries.
I began with the assumption that the ideal synonym-expansion system should be query-based, due to the inherent downsides of index-based expansion listed above. I also realized there’s a more fundamental problem with how Solr has implemented synonym expansion that should be addressed first.
Going back to the “dog”/”hound”/”pooch” example, there’s a big issue usability-wise with treating all three terms as equivalent. A “dog” is not exactly the same thing as a “pooch” or a “hound,” and certain queries might really be looking for that exact term (e.g. “The Hound of the Baskervilles,” “The Itchy & Scratchy & Poochy Show”). Treating all three as equivalent feels wrong.
Also, even with the recommended approach of index-time expansion, IDF weights are thrown out of whack. Every document that contains “dog” now also contains “pooch”, which means we have permanently lost information about the true IDF value for “pooch”.
In an ideal system, a search for “dog” should include documents containing “hound” and “pooch,” but it should still prefer documents containing the actual query term, which is “dog.” Similarly, searches for “hound” should prefer “hound,” and searches for “pooch” should prefer “pooch.” (I hope I’m not saying anything controversial here.) All three should match the same document set, but deliver the results in a different order.
Solution
My solution was to move the synonym expansion from the analyzer’s tokenizer chain to the query parser. So instead of expanding queries into the crazy intercrossing graphs shown above, I split it into two parts: the main query and the synonym query. Then I combine the two with separate, configurable weights, specify each one as “should occur,” and then wrap them both in a “must occur” boolean query.
So a search for “dog” is parsed as:
+((dog)^1.2 (hound pooch)^1.1)
The 1.2 and the 1.1 are the independent boosts, which can be configured as input parameters. The document must contain one of “dog”, “hound,” or “pooch”, but “dog” is preferred.
Handling synonyms in this way also has another interesting side effect: it eliminates the problem of phrase queries not working. In the case of “breast cancer” (with the quotes), the query is parsed as:
+(("breast cancer")^1.2 (("breast neoplasm") ("breast tumor") ("cancer ? breast") ("cancer ? ? breast"))^1.1)
(The question marks appear because of the stopwords “of” and “the.”)
This means that a query for “breast cancer” (with the quotes) will also match documents containing the exact sequence “breast neoplasm,” “breast tumor,” “cancer of the breast,” and “cancer of breast.”
I also went one step beyond the original SynonymFilterFactory and built up all possible synonym combinations for a given query. So, for instance, if the query is “dog bite” and the synonyms file contains:
dog,hound,pooch bite,nibble
… then the query will be expanded into:
dog bite hound bite pooch bite dog nibble hound nibble pooch nibble
Try it yourself!
The code I wrote is a simple extension of the ExtendedDisMaxQueryParserPlugin, called the SynonymExpandingExtendedDisMaxQueryParserPlugin (long enough name?). I’ve only tested it to work with Solr 3.5.0, but it ought to work with any version that has EDisMax.
Edit: the instructions below are deprecated. Please follow the “Getting Started” guide on the Github page instead.
Here’s how you can use the parser:
- Drop this jar into your Solr’s lib/ directory.
- Add this definition to your solrconfig.xml:
- Add defType=synonym_edismax to your query URL parameters, or set it as the default in solrconfig.xml.
- Add the following query parameters. The first one is required:
<queryParser name="synonym_edismax" class="solr.SynonymExpandingExtendedDismaxQParserPlugin"> <!-- TODO: figure out how we wouldn't have to define this twice --> <str name="luceneMatchVersion">LUCENE_34</str> <lst name="synonymAnalyzers"> <lst name="myCoolAnalyzer"> <lst name="tokenizer"> <str name="class">solr.StandardTokenizerFactory</str> </lst> <lst name="filter"> <str name="class">solr.ShingleFilterFactory</str> <str name="outputUnigramsIfNoShingles">true</str> <str name="outputUnigrams">true</str> <str name="minShingleSize">2</str> <str name="maxShingleSize">4</str> </lst> <lst name="filter"> <str name="class">solr.SynonymFilterFactory</str> <str name="tokenizerFactory">solr.KeywordTokenizerFactory</str> <str name="synonyms">my_synonyms_file.txt</str> <str name="expand">true</str> <str name="ignoreCase">true</str> </lst> </lst> <!-- add more analyzers here, if you want --> </lst> </queryParser>
The analyzer you see defined above is the one used to split the query into all possible alternative synonyms. Synonyms that are exactly the same as the original query will be ignored, so feel free to use expand=true if you like.
This particular configuration (StandardTokenizerFactory + ShingleFilterFactory + SynonymFilterFactory) is just the one that I found worked the best for me. Feel free to try a different configuration, but something really fancy might break the code, so I don’t recommend going too far.
For instance, you can configure the ShingleFilterFactory to output shingles (i.e. word N-grams) of any size you want, but I chose shingles of size 1-4 because my synonyms typically aren’t longer than 4 words. If you don’t have any multi-word synonyms, you can get rid of the ShingleFilterFactory entirely.
(I know that this XML format is different from the typical one found in schema.xml, since it uses lst and str tags to configure the tokenizer and filters. Also, you must define the luceneMatchVersion a second time. I’ll try to find a way to fix these problems in a future release.)
Param | Type | Default | Summary |
synonyms | boolean | false | Enable or disable synonym expansion entirely. Enabled if true. |
synonyms.analyzer | String | null | Name of the analyzer defined in solrconfig.xml to use. (E.g. in the example above, it’s myCoolAnalyzer). This must be non-null, if you define more than one analyzer. |
synonyms.originalBoost | float | 1.0 | Boost value applied to the original (non-synonym) part of the query. |
synonyms.synonymBoost | float | 1.0 | Boost value applied to the synonym part of the query. |
synonyms.disablePhraseQueries | boolean | false | Enable or disable synonym expansion when the user input contains a phrase query (i.e. a quoted query). |
Future work
Note that the parser does not currently expand synonyms if the user input contains complex query operators (i.e. AND, OR, +, and –). This is a TODO for a future release.
I also plan on getting in contact with the Solr/Lucene folks to see if they would be interested in including my changes in an upcoming version of Solr. So hopefully patching won’t be necessary in the future.
In general, I think my approach to synonyms is more principled and less error-prone than the built-in solution. If nothing else, though, I hope I’ve demonstrated that making synonyms work in Solr isn’t as cut-and-dried as one might think.
As usual, you can fork this code on GitHub!
Posted by lulucas on December 10, 2012 at 3:39 PM
Hello,
Thank you very much for your nice work !
I try to run your class but I get the following error:
GRAVE: java.lang.IllegalAccessError: class org.apache.solr.search.SynonymExpandingExtendedDismaxQParser cannot access its superclass org.apache.solr.search.ExtendedDismaxQParser
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
This message is the same on a solr server 3.5.0 and 3.6.1.
Do you have any idea about the problem?
Thanks in advance.
Posted by Nolan Lawson on December 10, 2012 at 7:09 PM
Hi there,
Could you provide some more info about your situation? Full stacktrace, servlet container you’re running (e.g. Tomcat, Jetty), your Java version, etc.
I found this page, which says it may be a problem with Tomcat 7. Let me know if modifying your web.xml fixed the problem for you.
For the record, I used Java 6, Tomcat 6, and Solr 3.5.0.
Cheers,
Nolan
Posted by lulucas on December 11, 2012 at 9:10 AM
Thank you for your reply,
My environment :
JAVA : java version “1.6.0_18” (OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1))
TOMCAT : 7.0.23
SOLR : 3.6.1
I then added the tag metadata in the /opt/tomcat-master-dev/conf/web.xml file :
The full Java stack Trace :
11 déc. 2012 08:54:31 org.apache.solr.common.SolrException log
GRAVE: java.lang.IllegalAccessError: class org.apache.solr.search.SynonymExpandingExtendedDismaxQParser cannot access its superclass org.apache.solr.search.ExtendedDismaxQParser
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:615)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:334)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:388)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:419)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:441)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1612)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1606)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1639)
at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1556)
at org.apache.solr.core.SolrCore.(SolrCore.java:555)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:480)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:332)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:216)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:161)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96)
at org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277)
at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258)
at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
at org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4624)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5281)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:866)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:842)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615)
at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:649)
at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1581)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
The problem is still there…
Posted by Nolan Lawson on December 11, 2012 at 10:52 AM
I think I’ve found the culprit. It looks like in Solr 3.5.0, ExtendedDismaxQParserPlugin defines “queryFields” to be package-private, whereas in Solr 3.6.1, it becomes private. So my code fails because I try to access the superclass’s “queryFields” (here).
I’ve filed this as a bug on the GitHub page, and I’ll try to address it as soon as I can. Unfortunately this seems like a pretty nasty problem, and I don’t know if there’s any way I can make it work without gutting the offending lines of code. But I’ll look into it.
Posted by lulucas on December 11, 2012 at 12:25 PM
I also think the problem is at this private access to the field “queryFields”.
I hope you can find a solution (My java skills stops here, sorry).
Thank you very much for your promptness !
Posted by Matthias Samwald on December 21, 2012 at 4:20 PM
I was just about to start experimenting with this when I found your blog via Google (small world? well maybe not that many people are playing with Solr and MeSH after all). Very helpful post!
– Matthias (working in the same project as Nolan)
Posted by Nolan Lawson on December 23, 2012 at 9:33 PM
Fancy meeting you here! I think you’re right that it’s just our field of semantically-advanced, health-related search engines that’s small. :)
Posted by Kannan on January 4, 2013 at 10:30 PM
Thanks much for the nice work.
We also were not happy with the solr synonym handling and were thinking along the lines of writing a query parser and glad that we found this post.
We are using solr 4.0. I got the source code for the SynonymExpandingExtendedDismaxQParserPlugin from github and fixed few package imports (from solr to lucene — modeled after ExtendedDismaxQParserPlugin) and fixed couple of issues because of the new ResourceLoaderAware class, but hit into the private queryFields error.
Would appreciate if you have any update on this issue.
Also curious to find out if you contacted solr/lucene folks and
their reaction.
Posted by Nolan Lawson on January 5, 2013 at 8:49 PM
I actually was able to fix the queryFields problem, but I just hadn’t merged my changes into the master branch yet, because during my testing of Solr 3.6.1 I ran into a (probably) unrelated issue. It’s committed to the master branch now.
If you have other changes for Solr 4.0, though, then please merge your code with mine (up to the latest commit), test it, and if it works in Solr 4.0, then please send me a pull request in GitHub. Hopefully we can make this work for Solr 3.5, 3.6, 3.6.1, and 4.0 in one fell swoop!
I did contact the Lucene/Solr folks. It remains to be seen if they will integrate my changes, but if not, then I’m also happy to just consider this as a separate plug-in.
Posted by Kannan on January 8, 2013 at 9:31 PM
Thanks. Once we have working version of the code with solr 4.0, will send the code to you.
Posted by AB on January 9, 2013 at 5:23 AM
hi Nolan
I set out with great anticipation to use your jar since we have clients of various industries that run into the need for exactly this kind of parser. I built the jar from your github source with this environment:
Apache Maven 3.0.4 (r1232337; 2012-01-17 00:44:56-0800)
Maven home: C:\apache\apache-maven-3.0.4
Java version: 1.7.0_07, vendor: Oracle Corporation
Java home: C:\Program Files\Java\jdk1.7.0_07\jre
Default locale: en_US, platform encoding: Cp1252
OS name: “windows 7”, version: “6.1”, arch: “amd64”, family: “windows”
Solr 3.5
but I keep hitting that error running with Jetty:
————————————————-
Jan 08, 2013 4:43:47 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.IllegalAccessError: class org.apache.solr.search.SynonymExpand
ingExtendedDismaxQParser cannot access its superclass org.apache.solr.search.Ext
endedDismaxQParser
————————————————-
And now for the good news, it runs fine with Tomcat 6 apparently
————————————————-
Jan 08, 2013 4:50:54 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader
INFO: Adding ‘file:/C:/apache/solr350/contrib/querytimesynonymparser/hon-lucene-synonyms-1.0.jar’ to classloader
Jan 08, 2013 4:50:54 PM org.apache.solr.core.SolrConfig
————————————————-
Thank you for creating this little gem :)
AB
Posted by AB on January 9, 2013 at 10:56 PM
Sadly, I rejoiced a little too soon. Tomcat loads it really nicely, but as soon as you open up your solr in the browser, back to square one with a fat HTTP500 error
HTTP Status 500 – Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: false in solr.xml
————————— java.lang.IllegalAccessError: class org.apache.solr.search.SynonymExpandingExtendedDismaxQParser cannot access its superclass org.apache.solr.search.ExtendedDismaxQParser at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at
etc …
Posted by Otis Gospodnetic on January 14, 2013 at 5:02 PM
@Nolan:
> I did contact the Lucene/Solr folks. It remains to be seen if they will integrate my
> changes, but if not, then I’m also happy to just consider this as a separate plug-in.
Please do file the JIRA issue with your patch and set Fix Version to 4.2. Thanks.
Posted by tandula on January 15, 2013 at 12:01 AM
By placing the files apache-solr-solrj-3.5.0.jar & apache-solr-core-3.5.0.jar in /example/lib , I was able to get it past the original complaints.
Now the compile error is the following.
SEVERE: org.apache.solr.common.SolrException: Error Instantiating QParserPlugin,
solr.SynonymExpandingExtendedDismaxQParserPlugin is not a org.apache.solr.search.QParserPlugin
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:421)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:441)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1612)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1606)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1639)
at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1556)
at org.apache.solr.core.SolrCore.(SolrCore.java:555)
QUESTION: Nolan, which exact solr version did you write this jar against? Perhaps I’m using a version of QParser that is different than yours :) I’ll get this thing to work
AB
Posted by Nolan Lawson on January 18, 2013 at 10:53 AM
Hi guys,
I think this discussion is starting to outgrow WordPress. I’d prefer for us to document this in GitHub, so I created a new GitHub issue to track all the compatibility problems with Solr 3.6.0, 3.6.1, and 4.0.
I can confirm seeing the same errors myself. And I would greatly appreciate any Solr guru who could help shed some light on these problems. :)
– Nolan
Posted by Okke Klein on January 18, 2013 at 12:47 PM
If you need help, I suggest you make a Jira issue in the Solr project like Otis also suggested. You can upload none working patches so others can try to fix them.
Good luck. Looking forward to testing this feature.
Posted by Kevin Schaper on January 18, 2013 at 8:26 PM
Thanks for the blog post! It’s nice to have a confirmation of the strange behavior I’m getting with edismax, q.op=AND and synonym expansion of multi-word ontology terms.
An option I’ve considered is to move the term expansion out of Solr and into the application layer – that way I can carry forward your idea of a reduced boost for synonyms and also reduce the boost score based on how far down the DAG tree a child ontology term is.
I’m working with Solr for a model organism database – always nice to find more bio/med search people!
Posted by AB on January 29, 2013 at 1:37 AM
Nolan
I got it to work in run-time, and I want to just compliment you on how stunningly it works. I tested it with 3 word phrases and it broke it into single words, and then to ensure it finds the phrases exactly, I added quotes in the synonyms_extended.txt file. It worked like a charm.
How you ask?
I put the code for your synonym parser in with the rest of the actual Solr 3.5 source, and compiled the entire Solr source. Then I took that solr.war file and replaced my old one.
The issues that we keep seeing here, has to do with how Jetty, and Tomcat for that matter load the solr.war file, who is the envelope to the jars, and since the war file had no previous knowledge of the synonym parser being an extension of the QParser, well, you see the picture.
My next project is how to write it without needing to recompile solr.war, so that it’s just a synonymexpander.jar file I can drop into the lib folder, restart solr and be done and ready to go with the stock solr.war.
Oh yes, I also added a requestHandler to do the work, so my solr queries are clean like this localhost:8983/solr/ab/?q=shoes
explicit
xml
Hon Synonyms
synonym_edismax
*:*
30
*,score
2<-1 5<-2 6<90%
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
true
myCoolAnalyzer
1.0
1.2
false
hope this helps
AB
Posted by Nolan Lawson on January 29, 2013 at 2:13 PM
Hi there,
Yes, this helps a ton. Unfortunately, until we find a solution using a drop-in JAR file, it sounds like your “compile along with Solr” solution is the only reasonable one. Since this is the case, though, and since I’ve gotten so much positive feedback on this code, I will file an issue in the Solr JIRA itself to add my code as a patch.
I’m also taking the liberty of adding your comments to the GitHub issue. Please, folks, restrict your comments/bugfixes/me-too’s to GitHub! :) Thanks.
Posted by Nolan Lawson on January 30, 2013 at 12:26 AM
All right, compatibility bug fixed, documentation improved, and JIRA issue filed. Let’s see if this code can make it into an upcoming version of Solr.
Posted by Philippe Lequerre on February 25, 2013 at 9:33 PM
Interesting post.
How does Solr handle synonyms when the term is ambiguous? Let’s take “book” which can be both a noun and a verb. Is Solr going to return synonyms of both meanings?
Unless Solr fully disambiguates in context, which, to the best of my knowledge, it doesn’t, expanding query terms to synonyms will, in this case return even more irrelevant results than just using the query term.
Posted by AB on February 25, 2013 at 9:44 PM
Dear Philippe
Do you know of an algorithm that deals with slang and words in context like you describe? as an example, Google returns results for Books, when searching for Book. and Booking sites, when you search for “Book It” or Booking.
Solr has a well oiled Protected Words factory to deal with cases such as “Book It” and Booking. Enter what you need into the protected words file, ensure the Type of the field is using the Protected words factory, and voilla, Solr won’t stem and dismember the terms, but it has to appear as is in your data for this to work. Then, you can make synonyms between booking, book it, booking it, making a booking etc.
HTH
AB
Posted by Philippe Lequerre on February 26, 2013 at 1:28 PM
Our own semantic platform (Inbenta) handles disambiguation in context and it does it for several languages. Your example of “booking”, “book it”, “booking it” is fine but it is, pardon my French, trivial. In all these instances “book” has only one meaning, the meaning of “reserve”, so there’s no need to disambiguate. What if the Content has documents containing expressions like “ship a book” and “book a ship”?
Posted by nimnio on February 27, 2013 at 12:27 AM
Neither Solr nor Nolan’s open-source improvement deals with synonym semantics, nor do they claim to. Your “constructive criticism” seems no more than advertising.
Posted by Philippe Lequerre on February 27, 2013 at 2:11 PM
??? I am sorry but I responded to a post by AB asking for an algorithm doing what I describe. It’s not advertising, it’s answering a question.
Posted by nimnio on February 27, 2013 at 5:30 PM
Oh, I see now. Sorry, Philippe. The indentation didn’t make it clear that your comment was a reply, so I thought you were commenting on the overall blog post. Once again, sorry.
Posted by Philippe Lequerre on February 27, 2013 at 5:33 PM
No problem. We’re cool.
Posted by Jim on February 28, 2013 at 12:04 AM
Have you been able to do any work on the below statement?
“Note that the parser does not currently expand synonyms if the user input contains complex query operators (i.e. AND, OR, +, and -). This is a TODO for a future release.”
I am interested in using the Synonym handler you have created, but I need to add some additional information to the query to get the correct results. I have your code setup and it works as described thanks… just need a little more.
Posted by Nolan Lawson on March 4, 2013 at 12:29 PM
I currently don’t have any intention to solve the complex query operator problem (if you’re using complex operators, you’re probably not a naïve end-user who needs synonym expansion in the first place!), but you are welcome to submit a patch on GitHub if you’d like. :)
Posted by veggen on March 23, 2013 at 6:32 PM
Thanks a lot for the great work!
Is it safe to use your lib with Solr 4.2?
Posted by Nolan Lawson on May 5, 2013 at 11:36 PM
Yep, I just confirmed that the branch I wrote for Solr 4.1 also works fine in Solr 4.2. I.e. the unit tests pass.
Posted by art on June 7, 2013 at 6:34 PM
Could you explain what the effect of back pack=>backpack is?
Posted by Nolan Lawson on June 12, 2013 at 9:19 AM
It transforms all instances of “back pack” to “backpack,” but not the opposite direction.
Posted by jhsuh on June 25, 2013 at 2:08 AM
This is the very one I want and find for my search system.
But I have one question and problem for this.
This query parser uses raw query itself twice.
For example, I search the query “ny” which has synonyms. And I check the query phrase, I find the “ny” twice on that.
The query “new york” is same to that. When I check the score of matched documents, those get advantages.
I think, raw query doesn’t need to be searched in the synonym search phrase. Please consider about it.
$ tail synonyms.txt
Television, Televisions, TV, TVs
#notice we use “gib” instead of “GiB” so any WordDelimiterFilter coming
#after us won’t split it into two words.
#Synonym mappings can be used for spelling correction too
pixima => pixma
new york, nyc,ny, new york city
dog => hound, pooch, canis familiaris, man’s best friend
fc => football club
ml => major league
http://127.0.0.1:8983/solr/select?qf=Title_t&q=ny%20beaches&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true
+((((Title_t:ny) (Title_t:beaches))~2) ((+(((Title_t:”new york city”) (Title_t:beaches))~2)) (+(((Title_t:”new york”) (Title_t:beaches))~2)) (+(((Title_t:ny) (Title_t:beaches))~2)) (+(((Title_t:nyc) (Title_t:beaches))~2))))
http://127.0.0.1:8983/solr/select?qf=Title_t&q=new%20york%20beaches&defType=synonym_edismax&synonyms=true&debugQuery=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&q.op=AND&synonyms.constructPhrases=true
+((((Title_t:new) (Title_t:york) (Title_t:beaches))~3^1.1) (((+(((Title_t:”new york city”) (Title_t:beaches))~2)) (+(((Title_t:”new york”) (Title_t:beaches))~2)) (+(((Title_t:ny) (Title_t:beaches))~2)) (+(((Title_t:nyc) (Title_t:beaches))~2)))^0.9))
7.8 = (MATCH) sum of:
3.3000002 = (MATCH) sum of:
1.1 = (MATCH) weight(Title_t:new in 10044) [MinimalScoreDefaultSimilarity],
result of: 1.1 = score(doc=10044,freq=1.0 = termFreq=1.0),
product of: 1.1 = queryWeight, product of: 1.0 = idf(docFreq=83, maxDocs=144370) 1.1 = queryNorm 1.0 = fieldWeight in 10044,
product of: 1.0 = tf(freq=1.0),
with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=83, maxDocs=144370) 1.0 = fieldNorm(doc=10044)
1.1 = (MATCH) weight(Title_t:york in 10044) [MinimalScoreDefaultSimilarity],
result of: 1.1 = score(doc=10044,freq=1.0 = termFreq=1.0),
product of: 1.1 = queryWeight, product of: 1.0 = idf(docFreq=46, maxDocs=144370) 1.1 = queryNorm 1.0 = fieldWeight in 10044, product of: 1.0 = tf(freq=1.0),
with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=46, maxDocs=144370) 1.0 = fieldNorm(doc=10044)
1.1 = (MATCH) weight(Title_t:beaches in 10044) [MinimalScoreDefaultSimilarity],
result of: 1.1 = score(doc=10044,freq=1.0 = termFreq=1.0),
product of: 1.1 = queryWeight,
product of: 1.0 = idf(docFreq=5, maxDocs=144370) 1.1 = queryNorm 1.0 = fieldWeight in 10044,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=5, maxDocs=144370) 1.0 = fieldNorm(doc=10044)
4.5 = (MATCH) sum of:
4.5 = (MATCH) sum of:
3.6 = (MATCH) weight(Title_t:”new york” in 10044) [MinimalScoreDefaultSimilarity],
result of: 3.6 = score(doc=10044,freq=1.0 = phraseFreq=1.0),
product of: 1.8 = queryWeight,
product of: 2.0 = idf(),
sum of: 1.0 = idf(docFreq=83, maxDocs=144370) 1.0 = idf(docFreq=46, maxDocs=144370) 0.9 = queryNorm 2.0 = fieldWeight in 10044,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 2.0 = idf(), sum of: 1.0 = idf(docFreq=83, maxDocs=144370) 1.0 = idf(docFreq=46, maxDocs=144370) 1.0 = fieldNorm(doc=10044)
0.9 = (MATCH) weight(Title_t:beaches in 10044) [MinimalScoreDefaultSimilarity],
result of: 0.9 = score(doc=10044,freq=1.0 = termFreq=1.0),
product of: 0.9 = queryWeight,
product of: 1.0 = idf(docFreq=5, maxDocs=144370) 0.9 = queryNorm 1.0 = fieldWeight in 10044, product of: 1.0 = tf(freq=1.0),
with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=5, maxDocs=144370) 1.0 = fieldNorm(doc=10044)
Posted by jhsuh on June 25, 2013 at 6:45 AM
I have one more question about synonym_edismax.
Actually we set the field types in schema.xml just like below and when we index/query for each field, just solr can work with those setting.
But for synonym_edismax, I can set the only one tokenizer. How shall I do synonym search for each field which has diffrent type(diffrent tokenized)?
…
Posted by Niki on August 9, 2013 at 3:55 PM
Great post. Have you read the post from Mike McCandless ‘Lucene’s TokenStreams are actually graphs!’ at http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html. I think the problem you are having with the highlighting extra terms is a effect of a problem he called sausagization! LOL I have a similar situation to this, we are adding semantic tags that can span terms, and like you and I came to almost the exact same solution, although I didn’t think to play around with the boost. That’s a great idea.
Posted by Nolan Lawson on October 22, 2013 at 11:38 PM
Thanks for the link. “Sausagization” perfectly describes the highlighting problem I mentioned above.
Since the author says that fixing this problem would require some fairly low-level changes in Lucene’s indexer, it sounds like my solution is still pretty useful for the time being!
Posted by aowen on September 30, 2013 at 4:13 PM
i’m using solr 4.3.1 and want to use your solution. unfortunately i get the following output with http://localhost:8983/solr/select/?q=dog&debugQuery=on&qf=text&defType=synonym_edismax&synonyms=true
dog
dog
(+(() (((+())/no_coord) ((+())/no_coord) ((+())/no_coord) ((+())/no_coord))))/no_coord
+(() ((+()) (+()) (+()) (+())))
SynonymExpandingExtendedDismaxQParser
why isn’t it something like: +(DisjunctionMaxQuery((text:dog))…….
Posted by aowen on September 30, 2013 at 4:45 PM
sorry, everything is fine. it was a typo in the request
Posted by phranco on April 25, 2014 at 8:37 AM
I am getting the same kind of result, what was the typo in the request?
Posted by Nolan Lawson on April 25, 2014 at 9:09 AM
Not sure, but there are instructions in the readme on running the Python tests. It will build up the config and query automatically.
Posted by Nolan Lawson on April 25, 2014 at 9:10 AM
Ah, it could be qf=text. I think I used the name field instead.
Posted by Aaron on October 10, 2013 at 9:21 PM
Hi Nolan,
First of all, thanks for the great job done! Your synonyms expanding parser is very helpful and definitely a big step forward for the synonym handling in Solr. Were you able to make stemming work with your parser? What if you want “dogs” to be expanded without explicitly specifying plural form in the synonyms list?
Posted by Nolan Lawson on October 22, 2013 at 11:29 PM
Unfortunately, since the synonym expansion occurs before the query is processed by the query analyzer, plurals aren’t handled automatically. You’d have to either:
1) manually include plurals in your synonyms file, or
2) tweak the “synonym analyzer” and add a stemmer in the tokenization/analysis chain. (Your mileage may vary; I’ve never experimented with it myself.)
Hope that helps!
Posted by Peter Robsen on October 17, 2013 at 8:44 PM
Hi, great job for us, now im trying get the synonyms from a web service that already we have, can you help me to achieve?
Posted by Nolan Lawson on October 22, 2013 at 11:30 PM
Because of the way the SynonymFilterFactory is designed, I believe you’d have to download all the synonyms from your web service and save them as a text file (as described in the SynonymFilterFactory documentation).
Posted by Steve on December 20, 2013 at 8:06 AM
Nolan,
Nice work, thanks. Do you know whether anyone’s tried applying a similar design for a raw Lucene environment/repository (i.e. subclassing QueryParser)?
Thanks!
Posted by Nolan Lawson on December 22, 2013 at 6:03 PM
No, although if you wanted to replicate what I did, the code itself would be pretty straightforward. Basically, I just built up a lattice, expanding the query into every possible synonym combination (e.g. dog bites -> dog nibbles, pooch bites, pooch nibbles).
Posted by Rimas on January 3, 2014 at 4:22 AM
Hi Nolan,
Very nice tool, thanks. Is it possible to integrate synonyms in other way rather than save them as a text file as described in the SynonymFilterFactory documentation? For example: if synonyms or other related concepts are in xml files (as thesaurus). If yes, where reading of such synonyms source can be done?
Posted by Nolan Lawson on February 10, 2014 at 1:02 AM
Sorry, this plugin is really pretty tightly coupled to the SynonymFilterFactory file format.
Posted by tandula on January 23, 2014 at 4:20 PM
hi Nolan
It’s Jan 2014, Solr 4.6 is out, and in the Wiki, yours is the first mentioned and recognized 3rd party Query parser extension.
http://wiki.apache.org/solr/QueryParser
Way to go Nolan!
AB – Anria Billavara
Posted by Nolan Lawson on January 24, 2014 at 8:50 AM
That’s great! It’s nice to see folks getting so much use out of the plugin.
Posted by Alexander on February 10, 2014 at 3:45 PM
Hello,
This plugin gives the following error
maxClauseCount is set to 5100org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 5100
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:142)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:133)
at org.apache.solr.search.SynonymExpandingExtendedDismaxQParser.applySynonymQueries(SynonymExpandingExtendedDismaxQParserPlugin.java:416)
at org.apache.solr.search.SynonymExpandingExtendedDismaxQParser.attemptToApplySynonymsToQuery(SynonymExpandingExtendedDismaxQParserPlugin.java:379)
at org.apache.solr.search.SynonymExpandingExtendedDismaxQParser.parse(SynonymExpandingExtendedDismaxQParserPlugin.java:351)
at org.apache.solr.search.QParser.getQuery(QParser.java:142)
at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:142)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:744)
500
I think this is because the synonym list is too big. Is it possible to use this kind of expanding in index time to avoid this error?
Posted by Manuel Le Normand on February 18, 2014 at 8:46 AM
Hi Nolan,
I’m thinking of adapting your queryParser for dealing with word similarities. These are quasi-synonyms with different similarity scores determined by many hierarchies separate these terms (that are their boosts). It has a hierarchical structure built of bag of words (marked by {}) and terms, for example
{countries} => {countries in europe}, {countries in asia}, australia, usa
{countries in europe} => france, england
{countries in asia} => china, japan, israel
{celebrity} => britney spears, madonna
we expect a following query: q={!synonym_edismax}(countries or celebrity) ==> q=max( (australia usa)^2 OR (france, england)^1 OR (china, japan, israel)^1) OR max(britney spears or madonna)
Before I deep dive into the code, I wanted to know wether this queryParser would be adaptable for this need.
Second of all, what do you think of contributing the code to solr project so you wouldn’t have to worry for maintaining the code?
Thanks
Posted by Nolan Lawson on March 16, 2014 at 1:54 PM
Unfortunately, the synonym plugin wasn’t really designed for hierarchical synonyms or meronyms/holonyms as you describe. You can attach different synonym files to different fields in order to achieve what your want, but it’d be kinda hacky.
As for Solr, apparently it’ll be included in 4.8! https://issues.apache.org/jira/browse/SOLR-4381
Posted by Rimas on April 16, 2014 at 6:39 AM
Can you explain how different synonym files can be attached to different fields using your synonim plugin?
Posted by Nolan Lawson on April 16, 2014 at 9:33 AM
Check out the example config. Where it says “myCoolAnalyzer”, you can add multiple tags with whatever analyzers you want, e.g. “myCoolAnalyzer2”, “myCoolAnalyzer3”, etc. Then when you query, you just specify the analyzer you want to use with the synonyms.analyzer option. Unfortunately you’ll have to do a separate query for each analyzer, though.
Posted by Bernd Wölfel on July 28, 2014 at 9:06 AM
Hi Nolan,
this is an awesome plugin, exactly what I need for my project. Unfortunately I am not able to get it to run with the example config from GitHub. I’m using sole-4.6.1 with your latest plugin version (on a Sun-Java6 VM)
The configuration besides that is pretty simple, the plugin itself works nicely, but it cannot find the one and only “MyCoolAnalyzer” with it’s defined values from solrconfig. It always gives me NoAnalyzerSpecified/AnalyzerNotFound no matter what I do (even got the Java source and set the name to look for statically to “MyCoolAnalyzer”. The Collection just stays empty.
I guess it is a pretty dumb mistake, but if anyone could give me a pointer in the right direction I’d appreciate it very much.
Thank you!
Best,
Bernd
Posted by Bernd Wölfel on July 31, 2014 at 7:56 AM
Sorry, I figured my issue out as well (misallocated synonyms file) – Thank you so much for the great Plugin!
Posted by Nolan Lawson on August 9, 2014 at 2:03 PM
Hi Bernd, if you check out the code you will see that there are Python scripts to set up a little Solr server and run the tests automatically. It even downloads the Solr binaries for you, so you don’t have to do anything except have Python and Java installed. If you compare that setup to yours, I’m sure you’ll see what the issue is! Cheers.
Posted by Ashutosh on July 31, 2014 at 2:33 AM
Hi Nolan,
Thanks for great plugin. I tried your parser. Its expanding the queries as expected but I’m getting zero responses. Also I must mention that I’m just starting with solr and lucene.
Here’s a part of the result for the query http://127.0.0.1:3898/solr-webapp/collection1/select?q=crowd%20finance&qf=text&defType=synonym_edismax&synonyms=true&wt=json&synonyms.originalBoost=1.2&synonyms.synonymBoost=1.1&debugQuery=on
{
responseHeader: {
status: 0,
QTime: 18,
params: {
debugQuery: “on”,
synonyms.synonymBoost: “1.1”,
q: “crowd finance”,
qf: “text”,
synonyms: “true”,
wt: “json”,
synonyms.originalBoost: “1.2”,
defType: “synonym_edismax”
}
},
response: {
numFound: 0,
start: 0,
docs: [ ]
},
debug: {
rawquerystring: “crowd finance”,
querystring: “crowd finance”,
parsedquery: “SynonymExpandingExtendedDismaxQuery(custom(boost(+(((text:crowd) (text:finance))~2) (title:”crowd finance”~10^25.0) (concept_tags:”crowd finance”~10^25.0) (content:”crowd finance”~10),product(pow(int(share_count),const(1.5)),0.08/(3.16E-8float(ms(const(1406797200000),date(earliest_known_date)))+0.05)))))”,
parsedquery_toString: “custom(boost(+(((text:crowd) (text:finance))~2) (title:”crowd finance”~10^25.0) (concept_tags:”crowd finance”~10^25.0) (content:”crowd finance”~10),product(pow(int(share_count),const(1.5)),0.08/(3.16E-8float(ms(const(1406797200000),date(earliest_known_date)))+0.05))))”,
explain: { },
queryToHighlight: [
“org.apache.lucene.search.BooleanClause:((text:crowd) (text:finance))~2^1.2”,
“org.apache.lucene.search.BooleanClause:((+(((text:crowd) +(text:business) +(text:finance))~1) (title:”crowd business and finance”~10^25.0) (concept_tags:”crowd business and finance”~10^25.0) (content:”crowd business and finance”~10)))^1.1”,
“org.apache.lucene.search.BooleanClause:((+(((text:crowd) (text:funding))~2) (title:”crowd funding”~10^25.0) (concept_tags:”crowd funding”~10^25.0) (content:”crowd funding”~10)))^1.1”,
“org.apache.lucene.search.BooleanClause:((+(((text:crowd) (text:business) (text:finance))~3) (title:”crowd business finance”~10^25.0) (concept_tags:”crowd business finance”~10^25.0) (content:”crowd business finance”~10)))^1.1”,
“org.apache.lucene.search.BooleanClause:((+(((text:crowd) (text:financial))~2) (title:”crowd financial”~10^25.0) (concept_tags:”crowd financial”~10^25.0) (content:”crowd financial”~10)))^1.1”
],
expandedSynonyms: [
“crowd business and finance”,
“crowd business finance”,
“crowd finance”,
“crowd finance”,
“crowd financial”,
“crowd funding”
],
mainQueryParser: [
“QParser”,
“ExtendedDismaxQParser”,
“altquerystring”,
null,
“boost_queries”,
null,
“parsed_boost_queries”,
[ ],
“boostfuncs”,
null
],
synonymQueryParser: [
“QParser”,
“ExtendedDismaxQParser”,
“altquerystring”,
null,
“boost_queries”,
null,
“parsed_boost_queries”,
[ ],
“boostfuncs”,
null
],
…
Am I missing something obvious?
Thanks.
Posted by Ashutosh on July 31, 2014 at 5:36 AM
Sorry, found the culprit. It seems I hadn’t set the parameters of edismax correctly.
Thanks a ton.
Posted by Frédéric on August 29, 2014 at 10:00 AM
Hey Nolan, long time no see!!!
Guess what… I m currently working on my medical thesis which concerns some kind of search engine in the patients documentation. The thing is supposed to be well suited to in-hosp physicians (whose effectiveness on EHRs is quite poor in general, I must admit… ).
Anyways i was running some query on google with “mesh lucene semantic” (i think) and I ended up here! And i very much enjoyed it, I mut say!!! ;)
I m gonna delve into it right away so thanks for the work!
I hope we ll see each other again! Cheers
Frédéric (HUG)
Posted by Nolan Lawson on October 4, 2014 at 8:15 AM
Salut Fréd! That’s awesome; glad to know you’re still doing well at the HUG. :)
The synonyms project is still going strong, and still the most popular project on the HON’s GitHub page. So yeah, it filled a neat little void in the Solr ecosystem.
Take care; hope your thesis goes well!
Posted by rajeshhazari on December 9, 2014 at 9:06 AM
Hi Nolan,
Thanks for your great plugin, this plugin completely satisfies our requirement. I tried your synonym parser with below config and synonyms dictionary but i have issue with expansion of multi-terms/phrase synonyms.
solr-version : 4.9
and the example_synonyms.txt looks like
dog,hound,pooch,canis familiaris,man’s best friend
back pack,backpack
e-commerce,electronic commerce
bfo, brystforstørrende operation
blood bones ,Blood and Bones
swedish turnips,rutabagas
jigglypuff, purin
aaafoo,aaabar
bbbfoo,bbbfoo bbbbar
cccfoo,cccbar cccbaz
fooaaa bar,baraaa bar2,bazaaa bar3
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
Absolutely Mindy,Mindy Thomas
ACC, Atlantic Coast Conference
AFC, American Football Conference
when i query for ‘q=dog’ (any single term like q=gb or q=television or q=backpack) synonym expansion works fine my final query looks like
for ‘q=dog’ or any single term :: http://localhost:8585/solr/select?q=dog&rows=1&wt=json&indent=true&debugQuery=true&defType=synonym_edismax&synonyms=true&synonyms.constructPhrases=true
this same query fails expand synonyms for any phrase query for ex: q=”Absolutely+Mindy”
http://localhost:8585/solr/select?q=%22Absolutely+Mindy%22&rows=1&wt=json&indent=true&debugQuery=true&defType=synonym_edismax&synonyms=true&synonyms.constructPhrases=true
even the synonyms from your example list also don’t expand i.e,
q=”back pack” OR for q=”swedish turnips”
Sample debug response:
“responseHeader”:{
“status”:0,
“QTime”:5,
“params”:{
“debugQuery”:”true”,
“indent”:”true”,
“synonyms.constructPhrases”:”true”,
“q”:”\”Absolutely Mindy\””,
“synonyms”:”true”,
“wt”:”json”,
“defType”:”synonym_edismax”,
“rows”:”0″}},
“response”:{“numFound”:347,”start”:0,”docs”:[]
},
“spellcheck”:{
“suggestions”:[
“correctlySpelled”,true]},
“debug”:{
“rawquerystring”:”\”Absolutely Mindy\””,
“querystring”:”\”Absolutely Mindy\””,
“parsedquery”:”(+DisjunctionMaxQuery((textSpell:\”absolutely mindy\”)))/no_coord”,
“parsedquery_toString”:”+(textSpell:\”absolutely mindy\”)”,
“explain”:{},
“expandedSynonyms”:[“\”Absolutely Mindy\””],
“reasonForNotExpandingSynonyms”:[
“name”,”DidntFindAnySynonyms”,
“explanation”,”No synonyms found for this query. Check your synonyms file.”],
“mainQueryParser”:[
“QParser”,”ExtendedDismaxQParser”,
“altquerystring”,null,
“boost_queries”,null,
“parsed_boost_queries”,[],
“boostfuncs”,null],
“synonymQueryParser”:[
“QParser”,”ExtendedDismaxQParser”,
“altquerystring”,null,
“boostfuncs”,null],
………………
I am looking in to your code to find the cause of the issue.
let me know where i am wrong, let me know if you need any more info on this issue.
Posted by rajeshhazari on December 9, 2014 at 9:12 AM
Hi Nolan,
Posted by rajeshhazari on December 9, 2014 at 9:21 AM
Hi Nolan,
Please ignore/delete my previous comment, sorry about the wacky comment.
Posted by rajeshhazari on December 9, 2014 at 11:52 AM
Hi Nolan,
Thanks for your great plug-in, this plug-in completely satisfies our requirement.
Please ignore my previous/delete my previous comments.
solr-version : 4.9
Plugin version jar : hon-lucene-synonyms-1.3.4-solr-4.3.0.jar
Your plugin is working as expected after i added the “filter” config in the parser.
I tried your synonym parser with below synonym and there seems to be an issue with expansion please see below the debug response
query :
http://localhost:8585/solr/select?q=fooaaa+bar&rows=0&wt=json&indent=true&debugQuery=true&defType=synonym_edismax&synonyms=true
Synonym : fooaaa, fooaaa bar
“debug”: {
“rawquerystring”: “fooaaa bar”,
“querystring”: “fooaaa bar”,
“parsedquery”: “(+((DisjunctionMaxQuery((textSpell:fooaaa)) DisjunctionMaxQuery((textSpell:bar))) (((+(DisjunctionMaxQuery((textSpell:fooaaa)) DisjunctionMaxQuery((textSpell:bar)) DisjunctionMaxQuery((textSpell:bar))))/no_coord)) (((+DisjunctionMaxQuery((textSpell:fooaaa)))/no_coord))))/no_coord”,
“parsedquery_toString”: “+(((textSpell:fooaaa) (textSpell:bar)) ((+((textSpell:fooaaa) (textSpell:bar) (textSpell:bar))) ((+(textSpell:fooaaa))))”,
“explain”: {},
“queryToHighlight”: [
“org.apache.lucene.search.BooleanClause:(textSpell:fooaaa) (textSpell:bar)”,
“org.apache.lucene.search.BooleanClause:(+((textSpell:fooaaa) (textSpell:bar) (textSpell:bar)))”,
“org.apache.lucene.search.BooleanClause:(+(textSpell:fooaaa))”
],
“expandedSynonyms”: [
“fooaaa”,
“fooaaa bar”,
“fooaaa bar bar”
],
I guess i am not missing any required configs :).
Please let me know if you need more info on this issue.
Posted by Nolan Lawson on December 10, 2014 at 8:10 AM
Hi Rajesh,
Could you please file an issue on the GitHub page? It’s probably a more appropriate spot for such issues: https://github.com/healthonnet/hon-lucene-synonyms/issues
Also I should let you know that this plugin is no longer supported, although of course since it’s open-source I’m always happy to receive pull requests! :)
Posted by rajeshhazari on December 10, 2014 at 9:57 AM
Thanx for letting me know that this plugin is no longer supported, can you plz share the alternative to this if you have?
Created issue #48 to track this.
Posted by Okke Klein on February 21, 2015 at 3:17 AM
Shame it is not supported anymore. Really useful plugin. Anyone tested this with Solr5 yet?
Posted by Biswajit on April 6, 2015 at 8:53 AM
Hi,
Is the reason for this plugin to be not supported being that its now integrated into the main solr build as mentioned by you in the Jira https://issues.apache.org/jira/browse/SOLR-4381 ?
Posted by Okke Klein on June 23, 2015 at 7:37 AM
Doesn’t look like a final solution is integrated anytime soon. Luckily this solution still works with Solr 5.2.1.
Posted by Martin on July 31, 2015 at 8:30 AM
As Okke said, this seems to work well with Solr 5.2.1. I had a bit of trouble getting Solr to load hon-lucene-synonyms-1.3.5-solr-4.3.0.jar properly, but in the end I put it in a folder under the new folder server/solr/lib and modified my java command which launches start.jar to include the parameter -DsharedLib=/path/to/server/solr/lib , and then everything worked well.
Posted by Pankaj Patil on January 13, 2016 at 4:02 PM
Thanks for the work Nolan. Its really interesting approach and I am eager to try it. I am however getting following error when I run it on SOLR 4.10.2, do you know what could be wrong?
ERROR – 2016-01-13 23:54:51.253; org.apache.solr.servlet.SolrDispatchFilter; An Error was wrapped in another exception – please report complete stacktrace on SOLR-6161
org.apache.solr.common.SolrException: SolrCore ‘collection1’ is not available due to init failure: null
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException
at org.apache.solr.core.SolrCore.(SolrCore.java:873)
at org.apache.solr.core.SolrCore.(SolrCore.java:646)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:255)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:249)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
… 1 more
Caused by: java.lang.AbstractMethodError
at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:675)
at org.apache.solr.core.SolrCore.(SolrCore.java:855)
… 8 more
WARN – 2016-01-13 23:54:51.254; org.eclipse.jetty.servlet.ServletHandler; Error for /solr/
java.lang.AbstractMethodError
at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:675)
at org.apache.solr.core.SolrCore.(SolrCore.java:855)
at org.apache.solr.core.SolrCore.(SolrCore.java:646)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:255)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:249)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Posted by Pankaj Patil on January 20, 2016 at 8:14 PM
My apologies for the previous post. I was using wrong version of the jar, and downloaded correct jar for my solo version (4.10.2) from https://github.com/healthonnet/hon-lucene-synonyms. It worked like a charm.
Many thanks for the great work!
Posted by Kapil Jhamb on June 30, 2016 at 11:13 AM
Hi ,
This is really a amazing solution for query-time synonym expansion. I have integrated this plugin with solr 4.2.1 and its works great. Are there any scaling issues with the synonym feature.That is, can it handle thousands of entries in the example_synonym_file.txt . for instance in my use case there are 63k entries .How this will effect the Solr performance.
Thanks
Kapil
Posted by Plamen M Todorov on July 1, 2016 at 9:35 AM
Hi Nolan, first thank you for the great plugin! We are testing it for our KM system and so far it’s working perfectly.
I am having only one problem with it, when using it in embedded function queries in constructed fields I don’t seem to be able to engage the parser. This is the query I’m testing:
=/select
params={q=happy opera&defType=synonym_edismax&qf=CONTENT^1.0 KCTITLE^10.0 METADATA_CONTENT^2.0&fl=*,+score,KCTITLE_FULL:exists(query({!q.op=AND type=synonym_edismax synonyms=true v=’KCTITLE:(happy opera)’}))&synonyms=true&sort=score asc&rows=25&wt=json&wt=json&wt=javabin&debugQuery=true}
The query is ‘happy opera’ and for ‘happy’ there is a defined synonym ‘joyful’, so the query correctly matches the document that has ‘joyful opera’ in the KCTITLE field. The problem is with the dynamic field KCTITLE_FULL which I construct in the fl param. It’s purpose is to check whether ALL terms from the query are matched in the KCTITLE field (as opposed to ANY or to being matched in other fields), so I do the following:
KCTITLE_FULL:exists(query({!q.op=AND type=synonym_edismax synonyms=true v=’KCTITLE:(happy opera)’}))
This should return true if both happy and opera are found in the KCTITLE field and false otherwise. Unexpectedly it returns false, even though KCTITLE contains ‘joyful’ and ‘opera’ and joyful=happy. I know for sure the query is correct because when v=’KCTITLE:(joyful opera)’ KCTITLE_FULL returns true -> which means it’s not an issue with the synonym_edismax and the q.op parameter.
I suspect the synonym_edismax is not being activated at all in the embedded query for some reason or the synonyms=true parameter is not being sent. Any idea why this is not working? I’ve compiled with and using the plugin on Solr 5.4.0 (had to modify the code cosmetically, nothing that affects functionality)
Thanks in advance
Posted by Plamen M Todorov on July 5, 2016 at 6:37 AM
Never mind, found I was missing the qf parameter. Working ok now, thanks!
Posted by Ethan Zhang on August 1, 2016 at 11:31 PM
Hi Nolan,
Thank you very much for your great plugin. I am currently using it in our solr-based search engine for synonym handling, and it works like a charm!
Just one question here, I found that for query has multiple terms, the expanded synonyms will always contain the original query, even if I specified the explicit mapping (“=>”) in the synonym.txt file.
For example, the synonym.txt: “foo bar => buz”,
Query: q=foo+bar&qf=name_en&fl=name_en&defType=synonym_edismax&synonyms=true&debugQuery=true
What I expected is the query “foo bar” would turn to “buz”. But the debug info shows the original query is still used to do the search:
foo bar
foz baz
Is there a way I can get rid of the original query?
Posted by JD on September 9, 2016 at 6:36 AM
I know this blog post is a bit outdated, but still.. quite an amazing and detailed explanation. Greatly appreciated!!
V/r,
^_^
Posted by Lalit Joshi on September 11, 2016 at 8:07 AM
Hi Nolan,
My self Lalit Joshi. I am having a problem with solr synonym search. My synonym search is working fine until the phrase length is 11 characters including spaces. So If I create synonym for the word “Marketing Technology” like “Marketing Technology,mark,martech” it won’t work until I reduce the size of the word “Marketing Techonology” upto 11 characters For ex: “Marketing t,mark, martech” is working for me. So just want your help whether or not there is some phrase size limit for synonym searching. Also, I have added synonym filter only in the and not in the index time.
Please suggest.
Posted by Multi-Word Synonyms: Solr Adds Query-Time Support | Lucidworks on April 18, 2017 at 10:22 AM
[…] Blog: https://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ […]
Posted by Multi-term in Solr with Auto Phrasing TokenFilter | Lucidworks on June 1, 2017 at 3:38 PM
[…] Jack Krupansky’s proposal, John Berryman’s excellent summary and Nolan Lawson’s query parser solution). Basically, what it boils down to is a problem with parallel term positions in the […]
Posted by Enhanced multi-word synonyms and phrase search in hybris | hybrismart | SAP hybris under the hood on August 9, 2017 at 9:12 AM
[…] The details of the problems are greatly explained in this article. […]
Posted by Stephen on September 12, 2017 at 9:43 AM
Thank you. Took my synonym filter out of query analyzer and everything started working a little more in alignment with my expectations.
Posted by Esteban on December 6, 2017 at 6:49 AM
Hello, What would be your updated strategy for SOLR 7 and on?
http://lucene.472066.n3.nabble.com/Solr-6-4-0-and-deprecated-SynonymFilterFactory-td4318599.html
Posted by VivekMandlik on May 30, 2023 at 4:33 AM
Hi Nolan ,
I recently came across your blog post on synonyms in Solr, and I wanted to express my gratitude for providing such informative content. As a newcomer to Solr, I found your blog to be a valuable resource in understanding synonyms and their implementation.
However, I encountered an issue while using synonyms at query time in Solr 8.11.2. Prior to reading your blog, I had successfully implemented the functionality in Solr 6.2.0, but I faced certain challenges when attempting to replicate the process in Solr 8.11.2. Specifically, I encountered the exception “org/apache/lucene/analysis/util/ResourceLoaderAware” among various others.
In light of this, I was wondering if you could provide some guidance regarding the necessary changes required to make your plugin compatible with Solr 8.11.2. I believe that certain modifications need to be made to adapt the plugin to the newer version, but I am unsure about the specific changes and updates required.
I have already reviewed the documentation and release notes for the plugin, but I couldn’t find any explicit information about the compatibility with Solr 8.11.2. Therefore, I would greatly appreciate it if you could offer some insight into the necessary adjustments or point me in the right direction.
If possible, I kindly request you to share any relevant code modifications or updates that need to be applied to successfully utilize your plugin with Solr 8.11.2. Additionally, if you have any recommended resources or suggestions for resolving the exception “org/apache/lucene/analysis/util/ResourceLoaderAware,” I would be extremely grateful.
I understand that you might be busy, but any assistance you can provide would be immensely valuable to me in overcoming this challenge. Thank you once again for your time and for sharing your expertise through your blog. I look forward to hearing from you and learning more about the necessary steps to make your plugin work with Solr 8.11.2.