Distributed Posting List Joins

From CommerceNet Wiki

Jump to: navigation, search

by Kragen Sitaker; see also Publications

Distributed posting list joins (aka inverted list intersections) are the biggest unsolved problem for a geographically-distributed full-text web-search engine (according to the Nutch FAQ, if I read it correctly, in the section entitled "Will Nutch be a distributed, P2P-based search engine?"). Here are some thoughts which I think include a workable solution.

I talked about posting list colocation before, in a post entitled "distributed peer-to-peer full-text web searching", in February 2004, on kragen-tol. Posting list colocation seems similar to the coincidence problem for pub-sub rendezvous discussed in Consistent Hashing Multicast --- I wonder if there's a way to cut down the numbers in a similar way? Maybe as discussed in Loo, Huebsch, et al.'s [http://iptps04.cs.ucsd.edu/papers/loo-hybrid.pdf "The Case for a Hybrid P2P Search Infrastructure"] you could avoid moving large posting lists around by moving queries to them instead of moving them to queries? Hmm...

That paper also mentions something about join indexes, which sound like they could be a useful optimization for frequent or expensive joins.

In the same conference, Shi, Yang, et al. published a paper entitled [http://iptps04.cs.ucsd.edu/papers/shi-keysearch.pdf "Making Peer-to-Peer Keyword Searching Feasible Using Multi-level Partitioning"]. This brilliant paper proposes a solution to the distributed posting-list join problem, and this solution appears to me to be a major advance that should make peer-to-peer keyword searching of filenames work noticeably better, but it seems to me that it doesn't yet solve the problem for full-text web search.

Basically it just puts each group of documents in its own group of nodes (called a level-L subgroup in the paper) which is assumed to be tightly connected. Unfortunately the second-largest posting list in a query must be able to be transmitted over the network within the group within the required response time. Putting some realistic numbers on this, "http" and "www" might each have 16GB of total posting-list data, each group of nodes might have 1Mbps of bandwidth available to any member, and you might want 1-second response time. This means you must divide your document space into at least 128000 groups, and send any query to all 128000 groups. (The paper includes a tree multicast protocol that makes it possible to do this with reasonable latency.) This unfortunately is comparable to the total number of nodes that can be expected to participate in such a system, so when you try to use this technique to replace Google, it becomes indistinguishable from the partition-by-document technique used by Gnutella, and requires five orders of magnitude too much work from the network as a whole for most queries. This will cause it to perform poorly under many simultaneous queries, which doesn't appear to be considered in the paper.

The paper actually claims that its result is practical for full-text web search, but its performance results are simulated. They find on the order of 30GB of total data transferred for a search of 3 * 10^9 documents in a network of 16384 nodes, or about 1.8MB transferred per node on average. On a 1Mbps link, that's 15 seconds. Accordingly, the latency numbers presented stop around 2 seconds per search with L=4 and 100 million documents. (The paper is ambiguous, but I believe the 30GB is per search, not a total over several queries.)

So here's a slight tweak on this idea which, I think, makes it practical: the aggregate bandwidth available to a particular posting list should be proportional to the size of the posting list. We achieve this by distributing posting lists for more common terms over a larger number of nodes.

My "more on distributed peer-to-peer full-text web searching" post in February 2004 gives a challenge problem: 18 million postings for "criteria", one million postings each for "autism" and "eye-contact", with an intersection of only 4200, in under a second.

Suppose your chosen strategy was to distribute the 2M postings for "autism" and "eye-contact" to the machines where the "criteria" postings lived, then have those machines do the merge and send all the remaining results to a single collection point, in under a second; and suppose that all the machines involved have only half a megabit per second available in each direction. Each posting is perhaps five bytes, so we have 10 megabytes of posting data to scatter/gather in a second, or 80 megabits; that means that the "autism" and "eye-contact" postings must be scattered among 80 machines each, for a total of 12500 postings per machine. Each of those machines then generates a half-megabit spike of traffic for a second sending its load to the "criteria" machines (of which there are 1440, so each sender must send to 18-20 destinations); those machines filter the incoming packets in a streaming fashion against their local hash tables of "criteria" postings, leaving only two small lists of autism&criteria and eye-contact&criteria that must be merged.

If all these machines must maintain in-RAM hash tables of all posting list parts they hold, and they're not allowed to hold more than 12500 postings of any particular posting list (so they can transmit them over the network in a second), then each hash table might be 12500 * (8 (pointer table bytes) + 4 (nextpointer bytes) + 8 (docid bytes) = 20 bytes) = 250 kB. This means that a 1GB machine can be responsible for 4000 full-sized posting-list parts, or 50 million postings. (Maybe use a sorted list instead and save a factor of 4?)

For reference, my current mailbox contains 300MB, 1.2 million terms, 14 million postings, and 36000 documents (mail messages). If those numbers are typical, this scheme needs one network node per 130 thousand documents or 1GB of text. This scales linearly with RAM on those machines, but that still means we can only index the documents that fit in these machines' RAM. This limitation doesn't change with available bandwidth.

Unfortunately, you also need enough network nodes that the longest posting list can be distributed at 12500 postings per node, and practically speaking, I think that means that you need one node per 12500 documents. This limitation doesn't change with available RAM, but it does change with available bandwidth. You'd need about 5 megabits (per acceptable response time, e.g. per second, or per two seconds) each way for it to equal the limitation of 1GB RAM.

So that's an interesting pair of limitations. In this scheme, we can only effectively use about 200MB of RAM per megabit of bandwidth, or vice versa, and either one buys us about 25000 documents or 200MB indexed.

1GB RAM now retails around $200, or about $40 per 200MB.

The Shi et al. paper discusses ways to dramatically reduce bisection bandwidth requirements; these are not discussed here on the theory that the planet-wide internet probably has sufficient bisection bandwidth. However, their geographic clustering technique does apply to the design discussed here.

I'm guessing that right now core bandwidth costs about $20 per bidirectional megabit per second per month. (This means that every two months, the bandwidth cost equals the memory cost, and that's only going to get shorter.) Indexing the 40 terabytes of data on the Web in this fashion would require 40 000 one-gigabyte-RAM, five-megabit machines, which would cost about $100 each per month to run if they were doing searches flat-out all the time. Burst bandwidth is cheaper than sustained bandwidth, though.

It's very reasonable to expect 40 000 machines, or even ten times that number, to participate in a global peer-to-peer network.

If you were going to build this system today, the user interface should probably be in Korean, then translated into Japanese, then possibly translated into Chinese and English; although there are more potential participants in China and English-speaking countries, they tend to lag far behind Korean and Japanese participants in available bandwidth.

Personal tools