Elasticsearch性能优化

Posted by Zeusro on December 26, 2018

使用routing明确数据对应的分片位置

Elasticsearch的路由(Routing)特性

节点

自己玩就别整那么多节点了,Elasticsearch是内存杀手.

coordinating

协调节点是请求的入口

1
2
3
node.master:false
node.data:false
node.ingest:false

master

选主,决定分片位置

1
2
3
node.master:true
node.data:false
node.ingest:false

data

存放分片的节点

1
2
3
node.master:false
node.data:true
node.ingest:false

ingest

ingest节点负责处理pipeline

1
2
3
node.master:false
node.data:false
node.ingest:true

性能减低的原因

  1. Your clients are simply sending too many queries too quickly in a fast burst, overwhelming the queue. You can monitor this with Node Stats over time to see if it’s bursty or smooth
  2. You’ve got some very slow queries which get “stuck” for a long time, eating up threads and causing the queue to back up. You can enable the slow log to see if there are queries that are taking an exceptionally long time, then try to tune those
  3. There may potentially be “unending” scripts written in Groovy or something. E.g. a loop that never exits, causing the thread to spin forever.
  4. Your hardware may be under-provisioned for your workload, and bottlenecking on some resource (disk, cpu, etc)
  5. A temporary hiccup from your iSCSI target, which causes all the in-flight operations to block waiting for the disks to come back. It wouldn’t take a big latency hiccup to seriously backup a busy cluster… ES generally expects disks to always be available.
  6. Heavy garbage collections could cause problems too. Check Node Stats to see if there are many/long old gen GCs running

参数的设置

对于 _all 这项参数,如果在业务使用上没有必要,我们通常的建议是禁止或者有选择性的添加。

  • shard

建议在小规格节点下单shard大小不要超过30GB。更高规格的节点单shard大小不要超过50GB。

对于日志分析场景或者超大索引,建议单shard大小不要超过100GB。

shard的个数(包括副本)要尽可能匹配节点数,等于节点数,或者是节点数的整数倍。

通常我们建议单节点上同一索引的shard个数不要超5个。

查询优化

  • 只选取必须的字段

就像在关系型数据库里面,不要select * 一样.

1
2
3
4
5
6
GET /product/goods/109524071?filter_path=_source.zdid
{
  "_source" : {
    "zdid" : 48
  }
}

类似的用法还有_source,但是与filter_path不同的在于,返回的结果会带上文档本身的默认字段

1
2
3
4
5
6
7
8
9
10
11
GET /product/goods/109524071?_source_include=zdid
{
  "_index" : "product",
  "_type" : "goods",
  "_id" : "109524071",
  "_version" : 4,
  "found" : true,
  "_source" : {
    "zdid" : 48
  }
}
1
2
3
_source=false
_source_include=zdid
_source_exclude

注意:_source和filter_path不能一起用

  • 新建索引时关闭索引映射的自动映射功能

index别名

系统配置优化

thread_pool:
    bulk:
        queue_size: 2000
    search:
        queue_size: 2000
indices:
  query:
    bool:
      max_clause_count: 50000
  recovery:
    max_bytes_per_sec:

queue_size 是并发查询的限制,默认是1000,不同的版本名称可能略有区别,线程池的参数可以直接附在启动参数里面(毕竟挂载配置文件对我来说也是一种麻烦)

参考:

  1. 配置集群
  2. 更新集群配置
  3. 线程池配置

其他经验

按照实际经验,elasticsearch多半是index的时候少,search的时候多,所以针对search去做优化比较合适.

日志的最佳实践

如果日志丢了也无所谓,建议用1节点0副本分片储存日志.

日志 index 用 xx-<date> ,这样删除的时候直接删 index 就行

delete by query 的我表示每次都想死…

1
2
3
4
5
6
7
8
9
10
11
12
13
POST /tracing/_delete_by_query?conflicts=proceed
{
	"query": {
		"range": {
			"@timestamp": {
				"lt": "now-90d",
				"format": "epoch_millis"
			}
		}
	}
}

GET /_tasks?&actions=*delete*

故障维护

Unassigned Shards

解决方案:新建一个number_of_replicas为0的新index,然后用_reindex.迁移完成之后,把number_of_replicas改回去.reindex有个size的参数,按需配置或许更快些.

注意可以通过GET _tasks?actions=indices:data/write/reindex?detailed查看相关任务

参考

  1. Elasticsearch Reindex 性能提升10倍
  2. 解决elasticsearch集群Unassigned Shards 无法reroute的问题
  3. tasks API

reindex

reindex也是有技巧的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 禁用副本
put geonames/_settings
{
 
    "settings" : {
      "index" : {
        "number_of_replicas" : "0"
    }
  
}
}
# 禁用刷新期间,_count结果不更新
json='{"index":{"refresh_interval":"-1"}}'
curl -XPUT 0.0.0.0:9200/geonames/_settings -H 'Content-Type: application/json' -d $json

# 中途想取消也行
curl -XPOST 0.0.0.0:9200/_tasks/mHCg6HqYTqqd12nIDFDk1w:2977/_cancel

# 恢复刷新机制
json='{"index":{"refresh_interval":null}}'
curl -XPUT 0.0.0.0:9200/geonames/_settings -H 'Content-Type: application/json' -d $json

gc overhead

1
[2019-01-04T08:41:09,538][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-onekey-3] [gc][159] overhead, spent [276ms] collecting in the last [1s]

解决方案/问题根源:集群负荷过重,宕机了

  • index长时间yellow

解决方案/问题根源:先把number_of_replicas调成0,再调回去,手动触发同步.

1
2
3
4
5
6
7
8
9
10
put geonames/_settings
{
 
    "settings" : {
      "index" : {
        "number_of_replicas" : "0"
    }
  
}
}
  • 滚动重启

_rolling_restarts

  • 慢日志分析

慢日志分搜索和索引两种,并且可以从index,或者cluster级别进行设置

1
2
3
4
5
6
7
8
9
10
11
12
PUT _settings
{
        "index.indexing.slowlog.threshold.index.debug" : "10ms",
        "index.indexing.slowlog.threshold.index.info" : "50ms",
        "index.indexing.slowlog.threshold.index.warn" : "100ms",
        "index.search.slowlog.threshold.fetch.debug" : "100ms",
        "index.search.slowlog.threshold.fetch.info" : "200ms",
        "index.search.slowlog.threshold.fetch.warn" : "500ms",
        "index.search.slowlog.threshold.query.debug" : "100ms",
        "index.search.slowlog.threshold.query.info" : "200ms",
        "index.search.slowlog.threshold.query.warn" : "1s"
}

参考链接:

  1. ES 慢查询收集总结
  2. 使用reroute手动转移分片

No alive nodes found in your cluster

这个要具体分析,看看ES的日志。有可能是并发连接数1000限制导致的问题。

参考工具

elasticHQ

参考链接:

  1. 如何使用Elasticsearch构建企业级搜索方案?
  2. 滴滴Elasticsearch多集群架构实践
  3. 从平台到中台:Elaticsearch 在蚂蚁金服的实践经验