Sunday, October 26, 2014

Solr and multi-word synonyms - one more way to handle

In a new project about creating a new eCommerce site for one of our clients I was asked to tune the existing search according to new requirements:
  1. if a document contains the whole search query or a phrase from it, it must be placed higher others;
  2. the retrieved documents must contain all words from the search query;
  3. query-time synonyms: both one word and multi-word.
The first two are seems to be simple, but the third one… Well… It blows up everything.

The problem of multi-word synonyms is well described in the posts of John Berryman and Mike McCandless, so I don’t dig into details here. The main point there – SynonymsFilter doesn’t work well with term positions if a synonym contains more than one word. After some brainstorming about how to deal this problem I made a decision – if I can’t use positions as I want, I should not use them at all. In other words all matching logic should use the simplest TermQueries and their combinations with boolean operations.

With this rule as basement for all my actions I defined three steps to achieve the goal:
  1. match all documents with any part of the search query and with multi-word synonyms;
  2. apply filter to get the required documents only;
  3. add boosts. 
According to this plan the main problem is resolved on the first step and others become the technical issues.


On the fist step I want Solr to return documents that contain any words from a search phrase. It’s just the simple disjunction. For example, for the search phrase "orange room freshener" such query looks like:
[orange] OR [room] OR [freshener]
Here and next I use square brackets ([ ]) to show tokens, which will be transformed into TermQueries, because Solr uses quotes (" ") for a PhraseQuery with all its problems.

In order to apply a multi-word synonym (e. g. "room freshener, air freshener") without using positions it’s needed to combine the target words ("room" and "freshener") into a single token ([room freshener]). Solr has ShingleFilterFactory designed for this purpose.

In index time I use the it to generate all possible shingles. The product fields are usually quite short, so it doesn’t increase the index a lot. In query time I generate shingles in the similar way:
[orange] OR [room] OR [freshener] OR [orange room] OR [room freshener] OR [orange room freshener]
I don’t use ShingleFilterFactory in for generating query tokens, because it complicates adding boosts and I’ll tell about them later.

SynonymFilterFactory must parse multi-word synonyms as a single token. For this it should be configured with KeywordTokenizerFactory:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" format="solr" ignoreCase="true" tokenizerFactory="solr.KeywordTokenizerFactory" />
Therefore after applying the synonym the query looks like:
[orange] OR [room] OR [freshener] OR [orange room] OR ([room freshener] OR [air freshener]) OR [orange room freshener]
The final query to be processed by Lucene may look like:
brand:(room OR freshener OR orange\ room OR room\ freshener OR air\ freshener OR orange\ room\ freshener)
OR name:(room OR freshener OR orange\ room OR room\ freshener OR air\ freshener OR orange\ room\ freshener)
OR category:(room OR freshener OR orange\ room OR room\ freshener OR air\ freshener OR orange\ room\ freshener)
At this point I believe the same can be dome with the standard Solr features: (e)DisMax handler and ShingleFilterFactory I mentioned above, but I didn’t tried it myself. Below I talk about filtering and boosting and things with them are not so good as I’d hope.


The next step is to filter out all documents that don’t contain any of the words from the search phrase. The simple ways (FQ for each word, or FQ with the whole phrase) doesn’t work because of the same reason - they don’t work with multi-word synonyms. The way to handle them is the similar to the main query - shingles. But if the query should match all documents with any shingle, the filter must match only documents that contain any shingles combination that covers the original search phrase.

For our query "orange room freshener" there are 4 such combinations:
[orange] AND [room] AND [freshener];
[orange room] AND [freshener];
[orange] ADN [room freshener];
[orange room freshener],
To build the final filter query I combine all of them with disjunction (synonym is also applied):
(orange AND room AND freshener)
OR (orange\ room AND freshener)
OR (orange AND (room\ freshener OR air\ freshener))
OR (orange\ room\ freshener)
Such filter query may be used as main query if you don’t use boosts, or you can expand it for using against several fields, but this is out of scope of this post.


The last remaining step is boosting. After everything done before it’s quite simple. The boosts can be added into the main query to boost documents with longer shingles:
[orange] OR [room] OR [freshener] OR [orange room]^200 OR ([room freshener] OR [air freshener])^200 OR [orange room freshener]^500
Or they may be added as separate boost queries built with shingles for using with (e)DisMax handler.


The described approach does the thing it’s created for – it allows to search using multi-word synonyms and having ability to affect the scoring with boosts. But it’s not ideal. The main issue with it is that Solr doesn’t help to build such queries. Almost all work I had to do on the client side and we got several issues:
  1. query size is quite big and may lead to overwhelming the servlet container HTTP header buffer (Jetty is in my case), although it’s could be handled with using POST for sending the queries or increasing the buffer size (I did the last);
  2. stopwords are not handled well. Solr can’t remove a stopwords from the multi-word token, so the queries with stopwords are not processed as I would like. In my case it isn’t critical, but it’s still an issue.
The problems are not critical in my case, but if you are going to use the method I described here, you should keep them in mind. 

Monday, November 5, 2012

OSX, VirtualBox, NAT and static DHCP



Update Oct, 2014: the guide was slightly modified to support new OSX 10.10 Yosemite. The changes are minor, so I believe it still works for the previous versions, but I didn’t check it.

There's an issue that VirtualBox doesn't allow to configure VMs' static IP-addresses by, for example, their MAC-addresses.

In general the solution is quite simple:
  1. run DHCP server on the host on the VM-ethernet interface (vboxnet0);
  2. allow access from VMs to the external network via NAT.
Step by step solution
en0 - ethernet interface
en1 - WiFi interface
192.168.56.0/24 - internal virtual network for VMs
  1. in VirtualBox disable DHCP-server;
  2. for each VM set network adapter to Host-only;
  3. install (I used brew) dnsmasq;
  4. in dnsmasq conf-file (/usr/local/etc/dnsmasq.conf) configure DHCP-settings;
    • set the property dhcp-leasefile=/usr/local/etc/dnsmasq.leases;
    • interface listen to (interface=vboxnet0);
    • IP-range for dynamic IP-addresses (dhcp-range=...);
    • static IPs for particular MAC-addresses(dhcp-host=...);
  5. configure dnsmasq to start as daemon;
  6. enable port forwarding
    • sudo sysctl -w net.inet.ip.forwarding=1
  7. in /etc/pf.conf add line after nat-anchor "com.apple/*"
    • nat on { en0 en1 } from 192.168.56.0/24 to any -> { (en0) (en1) }
  8. load rules into pf
    • pfctl -F all -f /etc/pf.conf
  9. enable pf with command
    • pfctl -e

Some additional hints

IP-forwarting can be enabled permanently. For this add into /etc/sysctl.conf (create this file if it doesn't exist):
  • net.inet.ip.forwarding=1
To enable pf on boot open file /System/Library/LaunchDaemons/com.apple.pfctl.plist and add -e to ProgramArguments:
        <key>ProgramArguments</key>
        <array>
                <string>pfctl</string>
                <string>-f</string>
                <string>/etc/pf.conf</string>
                <string>-e</string>
        </array>

To run dnsmasq as daemon create file /Library/LaunchDaemons/homebrew.mxcl.dnsmasq.plist with this content:
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>homebrew.mxcl.dnsmasq</string>
    <key>ProgramArguments</key>
    <array>
      <string>/usr/local/sbin/dnsmasq</string>
      <string>--keep-in-foreground</string>
    </array>
    <key>KeepAlive</key>
    <true/>
  </dict>
</plist>

Then register it:
  • sudo launchctl load /Library/LaunchDaemons/homebrew.mxcl.dnsmasq.plist

The virtual network interface vboxnet0 is created only when you run one of the VMs. To create it on boot I wrote a script and place it in ~/.scripts/vboxnet0.sh:
#!/bin/bash

VBoxManage list hostonlyifs > /dev/null
VBoxManage hostonlyif ipconfig vboxnet0 --ip 192.168.56.1 > /dev/null

Don't forget to execute chmod a+x ~/.scripts/vboxnet0.sh

Then I've created ~/Library/LaunchAgents/virtualbox.vboxnet0.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>virtualbox.vboxnet0</string>
    <key>ProgramArguments</key>
    <array>
      <string>/Users/Oleg/.scripts/vboxnet0.sh</string>
    </array>
    <key>RunAtLoad</key>
        <true/>
  </dict>
</plist>
*Don't forget to change the username to yours :-)

And registered it:
  • launchctl load ~/Library/LaunchAgents/virtualbox.vboxnet0.plist
Thus dnsmasq doesn't crash with the message that there's no interface vboxnet0.

Tuesday, October 30, 2012

OSX: Russian "Ё" in the PC keyboard layout

The Windows PC keyboard layout is familiar for the most of us. There's one within Lion and Mountain Lion, but there's an issue with the Russian layout - the letter "Ё" is not on it's standard place under "~" button. I was going to write how to create the custom keyboard layout, but there's a lot of manuals about it already. So I've decided to share the my one: Russian - PC - yo.zip
This archive contains two files:
  • Russian - PC - yo.keylayout
  • Russian - PC - yo.icns
Just put them into "/Library/Keyboard Layouts/" directory and reboot. 

Friday, October 26, 2012

Homebrew and GUI applications



Homebrew is a package manager where there are a lot of ports of popular Linux utilities.
It installs all software into /usr/local/bin directory. And to use them by default this directory should be before the standard /usr/bin in the PATH environment variable.
For the Terminal it's simple. Just add into ~/.bash_profile the following line:
  • PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:$PATH
But it doesn't work for GUI applications that use these utilities (e. g. IDE uses Maven and SVN).
To fix this I've changed PATH in the two places.

The first place is /etc/launchd.conf:
  • setenv PATH /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

The second place is /etc/paths:
/usr/local/bin
/usr/bin
/bin
/usr/sbin
/sbin

Reboot after this changes are made.
With this changes almost all applications work fine. "Almost" is because there's one specific for my project, it doesn't see SVN 1.7. As workaround I run it from the Terminal. 

Wednesday, October 24, 2012

OSX software for common task








When start using a new operating system, it's painful to do usual tasks, because the software is unknown. For make this process easier I want to publish the list of software I'm using (or used).

Almost all software is free. For the paid software there are free analogues.

Terminal

Nice terminal with a lot of preferences and features. I like splitting windows vertically and saving workspaces.

File Manager 

muCommander
I used it initially. But it's buggy, so I've moved to the next one.

ForkLift2 ($19.99) (AppStore)
I was lucky to get it when it was free. I like how it stores bookmarks for the remote hosts, how it connects to my home NAS with WebDav and several other features.

Text editor

TextWrangler (AppStore)
With default TextEdit it's hard for me to edit XML and properties filed. So I've found this one.

Package manager

Homebrew
The most of Linux utilities are available with "brew install [utility]". For example, "brew install wget" for wget, which isn't available for OSX by default.

VNC-client

VNC Viewer
I like how it scales the remote desktop view on the window size changing;

Screen Sharing (/System/Library/CoreServices/Screen Sharing.app)
I've just found it. The first impression is very good.

OpenVPN client

Tunnelblick
It works. But it replaces the default DNS servers with ones it gets from OpenVPN server. If these servers are down, nothing works :-). So I've moved to the next one.

Viscosity ($9.99)
I didn't notice any problems while using it.

Window manager

Scaling the window to full screen with double-click on its header + several other nice features.

Application launcher

Launching applications with "Control+Space" and first letter of the application.

I haven't used it. Other OSX users say that it's almost the same as Quicksilver. I put it here, because it's available in AppStore.

Notification center

Growl ($3.99) (AppStore)
Managing notifications from different application in the one unified way.

Context manager

ControlPlane
The context is a set of preferences and/or actions that depends on the environment (office, home, etc.). This application allows to change them automatically basing on the rules. For example, it changes my default printer when I at home or at work. Or, when I'm at home it mounts my NAS volumes automatically.

The first note

I've created this blog to share my experience and to get some feedback.