如何在azure-search中实际使用关键字分析器?(How to practially use a keywordanalyzer in azure-search?)

有点相关并继续存在这个问题: Azure Search Analyzer

我想使用关键字分析器来进行单词集合。

我们有不同字段的文档(产品),如product_name,品牌,类别等。 为了实现基于关键词的排名(评分),我想添加一个包含不同(未经过说明的!!)关键字的集合(Edm.String)字段,如:“brown teddy”或“green bean”。 为了实现这一点,我想到了使用具有以下定义的关键字分析器:

//字段定义: { “name”:“keyWordList”, “type”:“Collection(Edm.String)”, “分析仪”:“keywordAnalyzer” } ... “分析器”:[{ “名”:“keywordAnalyzer” “@ odata.type”: “#Microsoft.Azure.Search.CustomAnalyzer” “分词”:“keywordTokenizer” “tokenFilters”:[“小写”,“经典”] }] ... “tokenizers”:[{ “name”:“keywordTokenizer”, “@ odata.type”:“#Microsoft.Azure.Search.KeywordTokenizer” }

现在,在上传了一些文档后,我通过输入完全选择的关键字找不到这些字段。 例如,这是一个包含以下字段数据的文档:

“keyWordList”:[“蓝熊”,“蓝熊”,“蓝熊123”]

我无法通过查询以下搜索找到任何结果:

{search:“blue bear”,count:“true”,queryType:“full”}

这也是我尝试过的:

使用预定义的keywordanalyzer而不是定制的- >不成功 而不是使用Collection(Edm.String)我只用一个普通的String字段来测试它,只包含一个关键字- >没有成功 将字段定义块中的分析器拆分为searchAnalyzer =“lowercaseAnalyzer”和filterAnalyzer =“keywordAnalyzer”反之亦然- >没有成功

最后,我能得到的唯一结果是将整个搜索阶段作为单个术语发送。 但是这应该由分析仪完成,对吧?!

{search:“\”blue bear \“”,count:“true”,queryType:“full”}

用户不知道他们是否搜索现有关键字或执行分词搜索。 这就是为什么这不会成为一种选择。

有没有解决这个问题的方法? 或者,对于这种关键字(高分)搜索可能有更好/更简单的方法吗?

谢谢!

a little relating and continuing to this question: Azure Search Analyzer

I want to use a keywordanalyzer for word collections.

We have documents (products) with different fields like product_name, brand, categorie and so on. To implement a keyword based ranking (scoring) I would like to add a Collection(Edm.String) field which contains different (untokenized!!) keywords, like: "brown teddy" or "green bean". To achieve this I thought about using a keywordanalyzer with the following definition:

// field definition: { "name": "keyWordList", "type": "Collection(Edm.String)", "analyzer": "keywordAnalyzer" } ... "analyzers": [ { "name":"keywordAnalyzer", "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer", "tokenizer":"keywordTokenizer", "tokenFilters":[ "lowercase", "classic" ] } ] ... "tokenizers": [{ "name": "keywordTokenizer", "@odata.type": "#Microsoft.Azure.Search.KeywordTokenizer" }

Now after having uploaded some documents, I just can't find the fields by entering exactly the chosen keywords. For example the is a document with the following field-data:

"keyWordList": [ "Blue Bear", "blue bear", "blue bear123" ]

Im not able to find any result by querying the following search:

{ search:"blue bear", count:"true", queryType:"full" }

Here is what I tried as well:

using the predefined keywordanalyzer instead of a customized one -> no success instead of using Collection(Edm.String) I just tested it with a normal String field, containing only one keyword -> no success splitting up the analyzer in the field definition-block into searchAnalyzer="lowercaseAnalyzer" and filterAnalyzer="keywordAnalyzer" vice versa -> no success

In the end the only result I could get was via sending the whole seach phase as a single term. But this should be done by the analyzer, right?!

{ search:"\"blue bear\"", count:"true", queryType:"full" }

Users don't know if they search for an existing keyword or perform a tokenized search. That's why this won't be an option.

Is there any solution to this issue of mine? Or is there maybe a better / easier approach for this kind of keyword (high scoring) seach?

Thanks!

最满意答案

简短的回答:

你正在观察的行为是正确的。

在语义上,您的搜索查询蓝熊意味着:查找所有匹配术语蓝色 术语熊的文档。 由于您使用的是关键字tokenizer,因此您编入索引的条款为blue bearblue bear123 。 您的索引中不存在蓝色的术语。 这就是为什么只有短语查询返回您期望的结果。


长答案:

让我解释一下如何在查询处理期间应用分析器以及在文档索引期间如何应用它。

在索引方面,您定义的分析器独立处理keyWordList集合的元素。 以倒序索引结尾的术语是:

蓝熊 (因为你使用小写过滤器蓝熊蓝熊被标记为相同的术语)。

蓝熊123

正如你所期望的那样蓝熊是一个术语 - 在空间上不分成两个 - 因为你正在使用关键字标记器。 同样适用于蓝熊123

在查询处理方面,发生了两件事:

您的搜索查询也被重写: blue | bear (查找文档蓝色 )。 这是因为默认情况下使用searchMode = any 。 如果您使用searchMode = all,则您的搜索查询将被重写为蓝色+熊 (查找带有蓝色熊的文档)。

查询解析器接受您的搜索查询字符串,并从查询条件中分离查询运算符(如+,|,*等)。 然后,它将搜索查询分解为受支持类型的子查询,例如,术语后跟后缀运算符'*'成为前缀查询,引用术语词组查询等。任何受支持的运算符之前或之后的术语成为单个术语查询。

在你的例子中,查询解析器分别将你的查询字符串blue分解为两个术语查询,分别为蓝色 。 搜索引擎查找与任何查询匹配的文档(searchMode = any)。

查询分析器处理所识别的子查询的查询项。

在您的示例中,术语bluebear由分析器单独处理。 他们不会被修改,因为他们已经是小写字母了。 索引中不存在这些标记,因此不会返回任何结果。

如果您的查询如下所示: “Blue Bear” (带引号)将被重写为“Blue Bear” - 注意不会发生变化,因为现在您正在寻找一个短语,因此OR运算符未放在单词之间。 查询解析器将整个短语术语(两个单词)传递给分析器,分析器又输出一个小写的术语: 蓝熊 。 此令牌与您的索引中的内容相匹配。

这里关键的一点是查询解析器在应用分析器之前处理查询字符串。 分析器应用于查询解析器标识的子查询的各个术语。

我希望这可以帮助你理解你正在观察的行为。 请注意,您可以使用Analyze API测试自定义分析仪的输出。

Short answer:

The behavior you're observing is correct.

Semantically, your search query blue bear means: find all documents that match the term blue or the term bear. Since you are using the keyword tokenizer the terms that you indexed are blue bear and blue bear123. The terms blue and bear individually don't exist in your index. That's why only the phrase query returns the result you are expecting.


Long answer:

Let me explain how the analyzer is applied during query processing and how it's applied during document indexing.

On the indexing side, the analyzer you defined processes elements of the keyWordList collection independently. The terms that end up in your inverted index are:

blue bear (since you're using the lowercase filter blue bear and Blue Bear are tokenized to the same term).

blue bear123

As you'd expect blue bear is one term - not split into two on space - since you're using the keyword tokenizer. Same applies to blue bear123

On the query processing side, two things happen:

Your search query is rewritten too: blue|bear (find documents blue or bear). This is because searchMode=any is used by default. If you used searchMode=all, your search query would be rewritten to blue+bear (find documents with blue and bear).

The query parser takes your search query string and separates query operators (such as +, |, * etc.) from query terms. Then it decomposes the search query into subqueries of supported types e.g., terms followed by the suffix operator ‘*’ become a prefix query, quoted terms a phrase query etc. Terms that are not preceded or followed by any the supported operators become individual term queries.

In your example, the query parser decomposed your query string blue|bear into two term queries with terms blue and bear respectively. The search engine looks for documents that match any of those queries (searchMode=any).

Query terms of the identified subqueries are processed by the search analyzer.

In your example, terms blue and bear are processed by the analyzer individually. They are not modified since they are already lowercase. None of those tokens exist in your index, thus no results are returned.

If you query looked as follows: "Blue Bear" (with quotes) it would be rewritten to "Blue Bear" - notice no change, the OR operator has not been put between the words since now you're looking for a phrase. The query parser passes the entire phrase term (two words) to the analyzer which in turn outputs a single, lowercased term: blue bear. This token matches what's in your index.

The key lesson here is that the query parser processes the query string before the analyzers are applied. The analyzers are applied to individual terms of subqueries identified by the query parser.

I hope this helps you understand the behavior you're observing. Note, you can test the output of your custom analyzer using the Analyze API.

更多推荐