ElasticSearch-实战及优化

2025-01-22 08:19:30 2k 字 #elasticSearch

This post is also available in English and alternative languages.

ElasticSearch版本：6.5.0
Elasticsearch 权威指南
Elasticsearch Get API
Elasticsearch使用建议

1. 图书索引

PUT /bookdatas
{
    "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1
    },
    "mappings": {
        "bookType": {
            "properties": {
                "name": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "publish": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "type": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "author": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "info": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                },
                "price": {
                    "type": "integer"
                },
                "sceneId":{
  			            "type":"keyword"
	      	      }
            }
        }
    }
}

2. 性能优化

这里性能优化是指，前期可预见的一些方面，进行优化；更多的时候，需要结合实际情况、业务进行梳理分析。

2.1. mapping层面优化

2.1.1. 按时间范围来创建索引

如日志数据，建议按照时间范围创建索引，而不是把所有数据都存放到一个超级大的索引里面。基于时间范围索引。可以是按天的索引（logs_2014-10-24）。

删除旧数据，直接删除旧索引即可。

对旧索引（旧数据）进行分片合并，优化查询效率。

利用alias（别名）机制，可以在多个索引间灵活切换。

2.2. 查询层面优化

2.2.1. realtime

使用 GetRequest 时，将 realtime 设置为false。

By default, the get API is realtime, and is not affected by the refresh rate of the index (when data will become visible for search). In case where stored fields are requested (see stored_fields parameter) and the document has been updated but is not yet refreshed, the get API will have to parse and analyze the source to extract the stored fields. In order to disable realtime GET, the realtime parameter can be set to false.

默认情况下，get API是实时的，并且不受索引刷新率的影响（当数据成为搜索可见时）。如果请求的是存储字段（参见sedicated_fields参数），而文档已经更新但尚未刷新，则get API将不得不解析和分析源以提取存储字段。为了禁用实时GET，实时参数可以设置为false。

1 2	GetRequest getRequest = new GetRequest(personIndex, id.toString()); getRequest.realtime(false);

2.2.2. timeout

检索查询设置超时时间

1
2
3

SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(queryBuilder);
searchSourceBuilder.timeout(TimeValue.timeValueMillis(300L));

2.2.3. 自定义 _id

索引数据（写数据）时，document的_id，建议使用es自动生成的，不使用自己的业务ID或自增ID。

如果使用自定义ID（业务ID、自增ID），es需要先检查同一shard内是否已经存在一个同样的ID；相当于每次索引前，都会先用id查一次。

当然使用自定义ID（业务ID、自增ID），也不全是坏处。

使用自定义ID（业务ID、自增ID），可以搭配路由功能，在检索时可以通过设置路由（routing+id），来提高检索速度。

2.2.4. 使用constant_score Filter

使用过滤器，constant_score Filter 排除评分模式。

需要精确查询时，只希望对文档进行包括或排除的计算，而且不用考虑对查询进行评分计算，可以使用 constant_score filter。它会跳过整个评分阶段。

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "price" : 19
                }
            }
        }
    }
}

2.2.4.1. 查询与过滤的性能差异

一般情况下，一次过滤会比一次评分的查询性能更优异，并且表现更稳定。

当使用过滤情况时，查询被设置为一个“不评分”或“过滤”查询。即这个查询只是去判断是否匹配，结果是yes或no。

当使用查询情况时，查询会变成一个“评分”的查询。和不评分的查询类似，也要去判断这个文档是否匹配，同时还需要判断这个文档匹配的程度如何。

2.2.4.2. DSL

GET /bookdatas/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "price": "19"
        }
      }
    }
  }
}

2.2.4.3. Java Client

/**
 * 精确查询，不对查询进行评分计算，只对文档进行'包括'、'排除'计算
 * <p>
 * constant_score
 *
 * @param fieldName
 * @param value
 * @return
 */
@Override
public SearchResponse termSearch(String fieldName, Object value) {

    ConstantScoreQueryBuilder scoreQueryBuilder = new ConstantScoreQueryBuilder(new TermQueryBuilder(fieldName, value));

    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.query(scoreQueryBuilder);

    SearchRequest searchRequest = new SearchRequest(indices);
    searchRequest.types(typeName);
    searchRequest.source(searchSourceBuilder);

    return search(searchRequest);
}

2.2.5. 5、深度分页使用scroll

当检索大批量数据时，先查后取的过程支持用from和size参数分页，但有限制。

结果集在返回之前需要在每个分片上先进行排序，然后合并之后再排序输出。使用足够大的from值，排序过程可能会变得非常沉重，使用大量的CPU、内存和带宽。因此，强烈建议不要使用深分页。

为了避免深度分页，推荐采用scroll查询返回大量数据。

scroll查询可以用来对Elasticsearch有效地执行大批量的文档查询，而又不用付出深度分页那种代价。scroll查询允许我们先做查询初始化，然后再批量地拉取结果。

注意：使用 scroll 底层会制作查询结果集的快照，可能会无法看到"实时数据"。

2.3. 索引层面优化

2.3.1. 使用多线程发送数据到elasticsearch

单线程发送bulk请求不能够发挥集群索引的能力。

对于相同大小的bulk请求，通过测试可以得到最优的线程数量。可以逐步增加线程数量直至到集群中的机器Load或CPU饱和。

建议使用“nodes stats”接口查看节点中的cpu和load状态，nodes stats 参数指南

例如，在执行bulk请求时，使用的线程数量为2个，观察Load和CPU的情况；如果未饱和，可再增加线程数量。当线程数量增加到N个时，此时Load和CPU已饱和，建议就采用N个线程去执行bulk请求提高索引效率。通过测试获得最优的线程数量。

2.3.2. 增加refresh_interval刷新时间

默认情况下，每个分片每秒自动刷新一次，但并不是所有场景都需要如此频繁的刷新。

在索引大量文件情况下，优化索引速度，而不需要近实时搜索，可以增加该配置的间隔时间，降低每个索引刷新频次。

可以使用rest请求，动态修改、生效：

# 调整指定index的刷新间隔为60秒
PUT /bookdatas/_settings
{
  "refresh_interval": "60s"
}
  
# 调整后，使用 get 命令查看
GET /bookdatas/_settings
  
# response：
{
  "bookdatas" : {
      "settings" : {
      "index" : {
        "refresh_interval" : "60s",
        "number_of_shards" : "5",
        "provided_name" : "bookdatas",
        "number_of_replicas" : "1",
        ....
      }
    }
  }
}

针对全部索引的rest命令这里不提供，生产环境尽量不要针对全部索引进行改动！

2.3.3. 初始化索引时，禁用 refresh、replicas

在初始化索引时，例如第一次导入大量数据到索引里时，可以先禁用refresh，并且把refresh_interval设置为-1（关闭刷新）、replicas 也设置为0，等待数据全部导入后，再改回预期值。

# 调整指定index的刷新间隔为1秒
PUT /bookdatas/_settings
{
  "refresh_interval": -1
}

# 调整后，使用 get 命令查看
GET /bookdatas/_settings

# response：
{
  "bookdatas" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "-1",
        "number_of_shards" : "5",
        "provided_name" : "bookdatas",
        "number_of_replicas" : "1",
        ....
      }
    }
  }
}

# 动态修改副本分片数量
PUT /bookdatas/_settings
{
  "number_of_replicas": 1
}

针对全部索引的rest命令这里不提供，生产环境尽量不要针对全部索引进行改动！