联系人数据库中快速子串搜索的算法(Algorithm for fast substring search in a contact database)

我有一个大约有100万个名字和地址的数据库。 应该公开该数据库,以便在网页上进行“Google Suggest”式即时搜索。 我正在寻找一种有效的算法/数据结构,可以帮助我实现这一目标。

使这比使用Trie或Generalized Suffix Tree更困难的原因是它必须支持一些遗漏的查询。 例如,当用户键入“Elvis Pr”时,应该建议“Elvis Aaron Presley”。

我希望得到整个索引在内存中(我有大约4GB的RAM用于此)。

该应用程序是用Java编写的,因此链接到基于Java的库被认为是非常有用的。 我一直在看Lucene和MG4J ,但我还没弄清楚我可以用哪种类型的索引来解决我的问题。

I have a database with approximately 1 million names and addresses. This database should be exposed for a "Google Suggest"-like instant search on a web-page. I'm looking for an efficient algorithm/data structure which can help me achieve this.

What makes this more difficult than just using a Trie or a Generalized Suffix Tree, is that it must support queries with some of the names left out. For instance when a user types "Elvis Pr", "Elvis Aaron Presley" should be suggested.

I'm hoping to get the whole index in memory (I have about 4GB of RAM to be used for this).

The application is written in Java, so links to Java-based libraries are considered extra helpful. I've been looking a bit at Lucene and MG4J, but I haven't figured out which type of indexing I could use for my problem.

最满意答案

看看Cleo ,看起来你有一个非常相似的用例。 他们使用Bloom过滤器搜索前缀,使用Forward索引删除误报。

Have a look at Cleo, it looks you have a pretty similar use case. They use a Bloom filter to search prefixes and a Forward index to remove false positives.

更多推荐