ElasticSearch-analyzer-分词

2025-01-22 08:19:30 3.3k 字 #elasticSearch

This post is also available in English and alternative languages.

ElasticSearch版本：6.5.0(点击跳转官方文档)

1. 全文索引

全文检索的基本思路：将非结构化数据（全文数据）中一部分信息提取出来，重新组织，使其变得有一定结构，然后对此结构数据进行检索，从而达到快速搜索的目的。

先建立索引，再对索引进行搜索的过程就叫全文检索(Full-text Search)。

2. 倒排索引

非结构化数据中所存储的信息是每个文件包含的字符串，即 文件 –> 字符串，就是从 文件到字符串 的映射。

而我们想搜索的信息就是，哪些已知文件包含欲求字符串，即 字符串 –> 文件，就是从 字符串到文件 的映射。

从 文件到字符串的映射 是 字符串到文件的映射 的反向过程，于是保存这种信息的索引称为 反向索引。

反向索引 被用来存储某个单词在一个文档或一组文档存储位置的映射，即 “字符串到文件的映射”（字符 -> 文档、词典 -> 文档）。

词典中存储的，就是文档中某字段解析出来的词，这些词怎么来，这就是分词需要做的。

举个栗子：

A文档 内容：
    《机械设计（修订版）》是普通高等教育“十一五”国家级规划教材，是在第一版基础上并参照教育部高等学校机械基础课程教学指导分委员会最新提出的高等学校“机械设计课程教学基本要求（修订稿）”。

假设通过分词，词典中存储：
	机械、设计、机械设计

即：'机械'、'设计'、'机械设计' ->>> A文档

假设有两个文档，doc-1包含三个关键字：中国、美国、韩国，doc-2 包含四个关键字：中国、美国、德国、英国。

文档 - 单词对照表
文档词语
doc-1 中国、美国、韩国
doc-2 英国、中国、美国、德国
单词 - 文档对照表（反向索引）
词语文档
中国 doc-1、doc-2
美国 doc-1、doc-2
韩国 doc-1
英国 doc-2
德国 doc-2

文档	词语
doc-1	中国、美国、韩国
doc-2	英国、中国、美国、德国

词语	文档
中国	doc-1、doc-2
美国	doc-1、doc-2
韩国	doc-1
英国	doc-2
德国	doc-2

如果想要查找包含关键字 “美国” 的文档，那么结果就是 doc-1 和 doc-2。

这种从 文档包含单词 到 单词所属文档 的转换，就是倒排的由来。

3. analyzer 分析器

elasticsearch全文检索，只能查找 倒排索引 中真实存在的项，所以保证文档在索引时与查询字符串在搜索时应用相同的分析过程非常重要，这样查询的项才能够匹配倒排索引中的项。

文档中的每个字段都可以指定 analyzer（分析器），并且每个字段可以指定不同的 analyzer（分析器）。

在 elasticsearch 中，不是所有类型都会被分词。

String字符串包含两种不同的类型：keyword 和 text。

keyword 数据：elasticsearch 索引（写数据）时不会做任何处理。
text 数据：elasticsearch 在索引（写数据）之前，会对该类型数据做一些分析处理。

4. 默认分析器

analyzer（分析器）在索引、搜索时可以从三个层面进行定义：字段、索引、全局默认。

如果没有在字段层面指定 analyzer（分析器），ElasticSearch 会按照下面顺序依次处理，直到找到能使用的 analyzer（分析器）。

索引时
① mapping 字段配置中的 analyzer（分析器）
② 索引中配置的 analyzer（分析器）
③ standard 标准分析器
搜索时
① 查询搜索时自己定义的 analyzer（分析器）
② 字段 mapping 中定义的 analyzer（分析器）
③ 索引中配置的 analyzer（分析器）
④ standard 标准分析器

为了区分 索引analyzer（分析器） 和 搜索analyzer（分析器），ElasticSearch 支持 search_analyzer mapping配置。

5. es-ik-分词器

下载ik分词器
ElasticSearch版本为6.5.0，为此 elasticsearch-analysis-ik 版本也选择6.5.0
下载地址： elasticsearch-analysis-ik-6.5.0.zip

安装 elasticsearch-analysis-ik
在elasticsearch安装目录的plugins目录中创建’analysis-ik’目录
例如我的路径：‘/home/tool/elasticsearch/elasticsearch-6.5.0/plugins’，在此路径下创建analysis-ik目录
‘/home/tool/elasticsearch/elasticsearch-6.5.0/plugins/analysis-ik’
然后将下载的zip包，放这个目录下，并解压缩。解压缩后，将zip包删除

启动
接着，正常启动 elasticsearch 即可

测试

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "2018年5月全球编程语言排行榜"
}

返回结果

{
  "tokens" : [
    {
      "token" : "2018年",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "TYPE_CQUAN",
      "position" : 0
    },
    {
      "token" : "5月",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "TYPE_CQUAN",
      "position" : 1
    },
    {
      "token" : "全球",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "编程",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "语言",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "排行榜",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

5.1. ik 分词器模式

IK分词器有两种模式：

模式	说明
ik_max_word	文本做最细粒度拆分
ik_smart	文本做粗粒度拆分

5.1.1. ik_max_word

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "2018年5月全球编程语言排行榜"
}

返回结果：

{
  "tokens" : [
    {
      "token" : "2018",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "ARABIC",
      "position" : 0
    },
    {
      "token" : "年",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "COUNT",
      "position" : 1
    },
    {
      "token" : "5",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "ARABIC",
      "position" : 2
    },
    {
      "token" : "月",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "COUNT",
      "position" : 3
    },
    {
      "token" : "全球",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "编程",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "语言",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "排行榜",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "排行",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "榜",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "CN_CHAR",
      "position" : 9
    }
  ]
}

5.1.2. ik_smart

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "2018年5月全球编程语言排行榜"
}

返回结果：

{
  "tokens" : [
    {
      "token" : "2018年",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "TYPE_CQUAN",
      "position" : 0
    },
    {
      "token" : "5月",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "TYPE_CQUAN",
      "position" : 1
    },
    {
      "token" : "全球",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "编程",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "语言",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "排行榜",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

6. 示例

6.1. mapping 设置

索引名：bookdatas

类型：bookType

name、publish、type、author、info 几个字段都是 text 类型。

索引分词指定：ik_max_word（文本最细粒度拆分）

查询分词指定：ik_smart（文本粗粒度拆分）

PUT /bookdatas
{
    "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1
    },
    "mappings": {
        "bookType": {
            "properties": {
                "name": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "publish": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "type": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "author": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "info": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "price": {
                    "type": "integer"
                }
            }
        }
    }
}

6.2. 导入数据

1	curl -H 'Content-Type: application/x-ndjson' -XPOST '10.47.181.141:9200/bookdatas/bookType/_bulk?pretty' --data-binary @Elasticssearch测试图书数据.json

注意：通过curl直接导入的数据，在分词查询时是有问题的。最好还是通过程序写入数据。

6.3. 分词测试

导入数据以后，基于测试数据，可以测试一下分词。

先用 _analyze 命令看下 ‘全国高等’ 这四个字分词的结果

因为上面 mapping 中查询分词使用的是 ik_smart（文本粗粒度拆分），所以同样使用 ik_smart 进行分词。

GET /bookdatas/_analyze/
{
  "analyzer": "ik_smart",
  "text": "全国高等"
}

分词结果：

这个查询条件，被分词为 ‘全国’、‘高等’。

{
  "tokens" : [
    {
      "token" : "全国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "高等",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

使用 match 命令检索：

GET /bookdatas/bookType/_search
{
  "size": 100,
  "query": {
    "match": {
      "name": "全国高等"
    }
  }
}

检索结果

数据量比较大，挑了几个结果。可以观察下 name 字段的值。

{
	"took": 14,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": 347,
		"max_score": 4.114977,
		"hits": [{
			"_index": "bookdatas",
			"_type": "bookType",
			"_id": "524",
			"_score": 4.114977,
			"_source": {
				"author": "张燕瑾",
				"price": 0,
				"publish": "",
				"name": "全国高等院校本科教材全国高等院校专升本教材：中国古代小说专题",
				"id": 524,
				"type": "大学通用",
				"info": "....."
			}
		},
		{
			"_index": "bookdatas",
			"_type": "bookType",
			"_id": "765",
			"_score": 3.9878383,
			"_source": {
				"author": "李晓燕",
				"price": 33,
				"publish": "高等教育出版社",
				"name": "全国高等院校本科教材全国高等院校专升本教材：教育法学",
				"id": 765,
				"type": "大学教材",
				"info": "....."
			}
		},
		
		// ..........................
		
		{
			"_index": "bookdatas",
			"_type": "bookType",
			"_id": "353",
			"_score": 0.723893,
			"_source": {
				"author": "",
				"price": 29,
				"publish": "高等教育出版社",
				"name": "面向21世纪课程教材 高等学校经济管理类基础课程教材：西方经济学（微观经济学）",
				"id": 353,
				"type": "大学教材",
				"info": "....."
			}
		},
		{
			"_index": "bookdatas",
			"_type": "bookType",
			"_id": "473",
			"_score": 0.6573355,
			"_source": {
				"author": "董惠良，李相波",
				"price": 11,
				"publish": "高等教育出版社",
				"name": "21世纪经济与管理类通用教材 普通高等教育十一五国家级规划教材配套用书：会计学学习指导习题与实训",
				"id": 473,
				"type": "会计、审计",
				"info": ""
			}
		}]
	}
}

通过 Kibana 搜索分析器看下（只展示5个结果）。

检索条件 ‘全国高等’ 被分词为 ‘全国’、‘高等’，并拼接成 booleanQuery 查询

7. match query DSL

match 查询会对查询语句进行分词，分词后查询语句中的任何一个词项被匹配，文档就会被检索到。

GET /bookdatas/bookType/_search
{
  "size": 100, 
  "query": {
    "match": {
      "name": "全国高等"
    }
  }
}

7.1. 逻辑条件

通过设置 ‘OR’、‘AND’ 设置分词后结果查询的逻辑。

7.1.1. and

分词结果：全国、高等

返回文档中，name字段必须既包含 ‘全国’，又包含 ‘高等’ 两个词。

GET /bookdatas/bookType/_search
{
  "size": 1000,
  "query": {
    "match": {
      "name": {
        "query": "全国高等",
        "operator": "and"
      }
    }
  },
  "_source": "name"
}

结果

检索结果返回了39个文档。

7.1.2. or

分词结果：全国、高等

返回的文档中，name字段包含 ‘全国’ 或者包含 ‘高等’ 。

GET /bookdatas/bookType/_search
{
  "size": 1000,
  "query": {
    "match": {
      "name": {
        "query": "全国高等",
        "operator": "or"
      }
    }
  },
  "_source": "name"
}

结果

返回了 347 篇文档。

8. Java 检索

/**
 * 全文检索
 *
 * @param indexName
 * @param typeName
 * @param key
 * @param value
 * @param size
 * @return
 */
public SearchResponse fullTextSearch(String indexName, String typeName, String key, Object value, int size) {

    SearchRequest searchRequest = createSearchRequest(indexName);
    searchRequest.types(typeName);

    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.size(size);

    MatchQueryBuilder matchQuery = QueryBuilders.matchQuery(key, value);
    //matchQuery.analyzer("ik_smart");
    matchQuery.operator(Operator.OR);

    SearchSourceBuilder matchQueryBuilder = searchSourceBuilder.query(matchQuery);

    searchRequest.source(matchQueryBuilder);

    return fullTextSearchProcess(searchRequest);
}

/**
 * 全文检索
 *
 * @param request searchRequest
 * @return
 */
private SearchResponse fullTextSearchProcess(SearchRequest request) {
    return elasticsearchActionImpl.execute(
            (HighClientAction<SearchResponse>) restHighLevelClient ->
                    restHighLevelClient.search(request, RequestOptions.DEFAULT)
    );
}

9. Reference

聊聊Elasticsearch中的文本分析