ElasticStack 官方文档阅读（数据查询搜索篇）

Posted on 2022-02-25 Edited on 2023-03-14 In ElasticStack , ElasticSearch Views: Symbols count in article: 72k Reading time ≈ 1:05

ES 索引管理文档源

2022-02-23 - Elasticsearch Guide [8.0] » Index modules （未实践）

2022-03-24

section_11 - 文档的简单的CRUD操作的端点使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// section_11 
//# 文档的简单的CRUD操作的端点 : index / create / read / update / delete ， 五项操作
//## index  : PUT my_index/_doc/1  （没有则创建文档，有则删除旧的，增加新的，并且版本号递增）
//## create : PUT my_index/_create/1 OR POST my_index/_doc (自动生成文档ID，如果存在则版本号冲突创建失败)
//## read   : GET my_index/_doc/1
//## update : POST my_index/_udpate/1
//## delete : DELETE my_index/_doc/1

//-> index 方式创建索引，并指定文档ID创建文档
PUT test-user-01/_doc/1
{
"name": "daqiang2"
}

//-> 获取文档，看到版本号递增
GET test-user-01/_doc/1

//-> create 方式（可以端点也可以参数指定）创建文档,并指定文档ID， 如果重复调用，则版本号冲突而失败
PUT test-user-02/_doc/1?op_type=create
{
"name": "daqiang",
"age": 22
}
PUT test-user-02/_create/1
{
    "name":  "daqiang"
}

GET /test-user-02/_doc/1

// 端点直接使用, 可以指定id，如果不指定，则自动生成ID
POST test-user-03/_doc/1
{
    "name": "daqiang"
}

GET /test-user-03/_search

//-> update 更新文档, 指定单个存在的文档，更新只能对字段做增量修改,如果数据没有变化，则版本号不会变化
POST /test-user-02/_update/1
{
    "doc": {
        "name": "miaoa",
        "comment": "I want to good at English."
    }
}

GET /test-user-02/_doc/1

//-> delete 删除单个文档
DELETE /test-user-01
DELETE /test-user-02
DELETE /test-user-03
DELETE /test-user-04

section_11 - 文档的Bulk操作

1
2
3
4
5
6
7
8
9
10
11
//# bulk api : 集中批量调用各种操作，并且可以针对不同的索引进行操作,写入每两行一组，先指定索引及文档；删除操作可以单行操作完； 执行错误则继续执行，操作结果返回每个操作对应的结果
POST _bulk
{"index":{"_index":"test-01","_id":1}}
{"f1":"v1"}
{"delete":{"_index":"test-01","_id":2}}
{"create":{"_index":"test-02","_id":"3"}}
{"f1":"v2"}
{"update":{"_id":"1","_index":"test"}}
{"doc":{"f1":"v2"}}
{"udpate": {"_id":"1", "_index": "test-01"}}
{"doc": {"f1": "ff"}}

section_11 - 批量读取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// - mget : 批量读取
GET /_mget
{
    "docs": [
        {
            "_index": "test",
            "_id": "1"
        },
        {
            "_index": "test-01",
            "_id": "1"
        }
    ]
}

GET /test-01/_mget
{
    "ids": [
        "1",
        "1"
    ]
}

// - msearch : 批量搜索
POST /kibana_sample_data_ecommerce/_msearch
{}
{"query":{"match_all":{}},"size":1}
{"index":"kibana_sample_data_flights"}
{"query":{"match_all":{}},"size":1}

section_13 - 分词器处理

分词器包括：
- Standard Analyzer(默认内置)，按词切分，小写处理
- Simple Analyzer 按照非字母切分（符号被过滤掉），小写处理
- Stop Analyzer 小写处理，停用词过滤
- Whitespace Analyzer 按照空格切分，不转小写
- Keyword Analyzer 不做分词，直接将输入作为输出
- Patter Analyzer 正则表达式，默认以 \W+ （非字符分隔）
- Language - 提供30多种常见语言的分词器
  - english : 只是保留词根
  - 中文：安装分词器的过程，下载到puglin补录，并执行elasticsearch-plugin install xxx，使用elasticsearch-plugin list 查看插件列表
- Customer Analyzer 自定义分词器：如 Mastering Elasticsearch & Elasticsearch in Action
  - 定义 Character Filters （去除原始杂项内容）Mastering Elasticsearch Elasticsearch in Action
  - 定义 Tokenizer （根据规则切分单词） Mastering Elasticsearch in Action
  - 定义 Token Filter (过滤不需要的单词) Mastering Elasticsearch Action
分词使用时机：索引过程和搜索过程均可使用
目的：把全文本转换为一系列单词的过程（term/token）

可以使用 /_analyze api ，如

GET /_analyze
{
    "analyzer": "standard",
    "text": "Mastering Elasticsearch , Elasticsearch in Action"
}

// 指定索引中的字段进行分词测试，索引已经设置分词器
GET /test-index/_analyze
{
    "field": "title",
    "text": "Mastering Elasticsearch , Elasticsearch in Action"
}

// 自定义分词器
GET /_analyze
{
    "tokenizer": "standard",
    "filter": [],
    "text": "Mastering Elasticsearch"
}

// 测试 1： standard
GET /_analyze
{
    "analyzer": "standard",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


GET /_analyze
{
    "analyzer": "standard",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


GET /_analyze
{
    "analyzer": "simple",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


GET /_analyze
{
    "analyzer": "whitespace",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


GET /_analyze
{
    "analyzer": "keyword",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

GET /_analyze
{
    "analyzer": "pattern",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

GET /_analyze
{
    "analyzer": "stop",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

GET /_analyze
{
    "analyzer": "english",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

// 保留词根
GET /_analyze
{
    "analyzer": "english",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


// ###### 中文 分词 #######
GET /_analyze
{
    "analyzer": "icu_analyzer",
    "text": "他说的确实在理"
}

GET /_analyze
{
    "analyzer": "standard",
    "text": "他说的确实在理"
}

GET /_analyze
{
    "analyzer": "icu_analyzer",
    "text": "他说的确实在理"
}
// ik 支持自定义词库，支持热更新分词词典？

2022-03-25

数据导入问题

Kibana 的样例数据, 直接Web操作即可

GroupLens 数据集导入（先安装LogStash），并配置以下数据转化过程, 要注意的是需要认证识别(需要到kibana建立角色和用户并授予写入权限), 并且考虑移除多余的字段

// logstash -f logstash.conf
input {
    file {
        path => "/home/es/ml-latest-small/movies.csv"
        start_position => "beginning"
        sincedb_path => "/dev/null"
    }
}

filter {
    csv {
        separator => ","
        columns => ["id","content","genre"]
    }

    mutate {
        split => { "genre" => "|" }
        remove_field => ["path", "host","@timestamp","message"]
    }

    mutate {
        split => ["content", "("]
        add_field => { "title" => "%{[content][0]}"}
        add_field => { "year" => "%{[content][1]}"}
    }

    mutate {
        convert => {
        "year" => "integer"
        }
        strip => ["title"]
        remove_field => ["path", "host","@timestamp","message","content","log","@version","event"]
    }
}
output {
    elasticsearch {
        cacert => '/home/es/certs/http_ca.crt'
        hosts => ["https://172.16.10.131:9200"]
        index => "movies"
        document_id => "%{id}"
        user => logstash_internal
        password => "dev123"
    }
    stdout {}
}

也可以通过kibana，web页面直接上传各种格式的文件，进行数据导入

Precision (查准率) 和 Recall (查全率/召回率) 作为搜索质量检测重要标准，是怎么计算的？
- precision : 已经返回的搜索结果中，相关的文档所占比例
- recall : 已返回的搜索结果中相关的文档，在所有的实际相关文档中的所占比例

URI Search

URI Query String Syntax ：

// GET /movies/_search?q=2012&df=title&sort=year:desc&from=0&size=10&timeout=1s
    {
        "profile": true 
    }
// 指定默认字段(df)上进行查询(含有2012的)，按年倒序分页，并设置超时时间
    GET /movies/_search?q=2012&df=title&sort=year:desc&from=0&size=20&timeout=1s
    {
        "profile": true
    }
 // 泛查询，如果不指定字段，则正对_all, 所有字段上进行搜索
    GET /movies/_search?q=2012
    {
        "profile": true
    }
    GET /movies/_search?q=title:2012
    {
        "profile": true
    }  
    // 使用引号，Phrase查询
    GET /movies/_search?q=title:"Beautiful Mind"
    {
        "profile": true
    }

    // 如果这样，则Mind的部分，成为泛查询
    GET /movies/_search?q=title:Beautiful Mind
    {
        "profile": true
    }

    // 如果要做Term查询，则需要括号包裹
    GET /movies/_search?q=title:(Beautiful Mind)
    {
        "profile": true
    }

    GET /movies/_search?q=title:(Beautiful NOT Mind)
    {
        "profile": true
    }

    // AND 查询，两个单次都包括
    GET /movies/_search?q=title:(Beautiful %2BMind)
    {
        "profile": true
    }

    // 范围查询格式
    GET /movies/_search?q=year:>=1980
    {
        "profile": true
    }

    // 通配符
    GET /movies/_search?q=title:b*
    {
        "profile": true
    }

    // 模糊匹配 : 后缀容错
    GET /movies/_search?q=title:beautial~2
    {
        "profile": true
    }

    // 设定slot值
    GET /movies/_search?q=title:"Lord Rings"~2
    {
        "profile": true
    }

_cat 端点使用：

// GET /_cat/indices/kibana*?v&s=index
// GET /_cat/indices?v&health=green
// GET /_cat/indices?v&s=docs.count:desc
// GET /_cat/indeces?

查询表达式 - Query DSL 使用

source filtering : 可以过滤返回的source数据结构中存储的字段
match all
script field

// 脚本字段
GET /kibana_sample_data_ecommerce/_search
{
    "script_fields": {
        "new_field": {
            "script": {
                "lang": "painless",
                "source": "doc['order_date'].value + '_hello'"
            }
        }
    },
    "query": {
        "match_all": {}
    }
}

Match

// DSL  - Match, 分词查询
GET /movies/_search
{
    "query": {
        "match": {
            "title": "Beautiful Mind"
        }
    }
}

GET /movies/_search
{
    "query": {
        "match": {
            "title": {
                "query": "Beautiful Mind", 
                "operator": "and"
            }
        }
    }
}

Match Phrase

// match phrase 查询，可以设置文本容忍的编辑距离
GET /movies/_search
{
    "query": {
        "match_phrase": {
            "title": "one love"
        }
    }
}

GET /movies/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "one love",
                "slop": 1
            }
        }
    }
}

Query String Query : 将URI格式变为结构化查询格式

POST movies/_search
{
    "query": {
        "query_string": {
            "default_field": "title",
            "query": "Discovery"
        }
    }
}

POST movies/_search
{
    "query": {
        "query_string": {
            "fields": ["title", "genre"],
            "query": "(These AND Drama) OR (Generation AND Action)"
        }
    }
}

Simple Query String Query : 类似于上面的Query String，但是会忽略错误的语法，只支持部分查询语法，不支持AND OR NOT，会将他们当做字符串处理

POST movies/_search
{
    "query": {
        "simple_query_string": {
            "query": "Adven ",
            "fields": ["title", "genre"],
            "default_operator": "AND"
        }
    }
}

Mapping 设置

更改Mapping的规则：新增字段可以，已有字段，则需要重建索引

可以对mapping 的开关进行设置 true、false、strict, 影响索引的文档索引、写入

如果是 true ：则可以新增字段，并对该字段进行索引和查询

如果是false ：则可以新增字段，也可以正常索引数据，但是不能对该字段进行查询，mapping中也不会有该字段的设置

// 验证关闭动态映射后，索引新字段不会再生成映射，但是可以正常存入数据，但是不能查询该字段
PUT /my-map-test/_doc/3
{
    "firstname": "Dong",
    "lastname": "FuQiang",
    "logindate": "2029-02-01T10:30:00"
}

PUT /my-map-test/_mapping
{
    "dynamic": false
}

PUT /my-map-test/_doc/4
{
    "newField": "new fields"
}

GET /my-map-test/_mapping

GET /my-map-test/_search
{
    "query": {
        "term": {
            "newField": {
                "value": "new fields"
            }
        }
    }
}

如果是strict ：则不能新增字段，不能索引新的字段数据，也不能查询该字段

设定mapping

设置字段不被索引 : "index": true|false

DELETE users

PUT users
{
    "mappings": {
        "properties": {
            "firstname": {
                "type": "text"
            },
            "lastname": {
                "type": "text"
            },
            "mobile": {
                "type": "text", 
                // 设置字段不被索引
                "index": false
            }
        }
    }
}


PUT users/_doc/1
{
    "firstname": "Run",
    "lastname": "Yiming",
    "mobile": "1234567"
}

// 执行异常，不能对未索引的字段进行查询
GET users/_search
{
    "query": {
        "match": {
            "mobile": "1234567"
        }
    }
}

索引选项配置： index_options: docs(doc id)|freqs|postions(text类型的默认，记录doc id、term freq、term position)|offsets(多一个chracter offsets)，记录的越多需要的存储空间越大
自定义空值 : null_value

copy_to : 将多个字段拷贝到一个目标字段上，进行搜索，但实际上 copy_to 并不会出现在_source 中， 但是需要增加额外存储

DELETE users
PUT users
{
    "mappings": {
        "properties": {
            "firstname": {
                "type": "text",
                "copy_to": "fullname"
            },
            "lastname": {
                "type": "text",
                "copy_to": "fullname"
            },
            "mobile": {
                "type": "keyword", 
                // 设置字段控制
                "null_value": "NULL"
            }
        }
    }
}

PUT users/_doc/2
{
    "firstname": "Run",
    "lastname": "Yiming",
    "mobile": null
}

// copy_to 将字段内容合并在一起查询，但是返回结构中并没有这个字段
GET users/_search?q=fullname:(Run Yiming)
{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "mobile": "NULL"
                    }
                }
            ]
        }
    }
}

// 尝试，使用数组结构时，如果使用了 copy_to 的效果，则只是最后一个元素起作用
DELETE /my-index
PUT my-index
{
    "mappings": {
        "properties": {
        "comments": {
            "type": "nested",
            "properties": {
                "author": {
                    "type": "text",
                    "copy_to": "full_name"
                },
                "tags": {
                    "type": "nested",
                    "properties": {
                        "interests" : {
                            "type": "text",
                            "copy_to": "full_name"
                        }
                    }
                }
            }
        }
        }
    }
}

PUT my-index/_doc/1?refresh
{
    "comments": [
        {
        "author": "kimchy",
        "tags": [
            {
            "interests": "t1"
            }
        ]
        }
    ]
}

PUT my-index/_doc/2?refresh
{
    "comments": [
        {
        "author": "kimchy",
        "tags": [
            {
            "interests": "t1"
            },
            {
            "interests": "t2"
            }
        ]
        },
        {
        "author": "nik9000",
        "tags": [
            {
            "interests": "t2"
            }
        ]
        }
    ]
}

PUT my-index/_doc/3?refresh
{
    "comments": [
        {
            "author": "kimchy"
        }
    ]
}


POST /my-index/_search
{
    "query": {
        "term": {
            "full_name": {
                "value": "t1"
            }
        }
    }
}

POST /my-index/_search
{
    "query": {
        "match": {
            "full_name": "t1 kimchy"
        }
    }
}
POST my-index/_search
{
    "query": {
        "nested": {
            "path": "comments",
            "query": {
                "bool": {
                "must_not": [
                    {
                        "term": {
                            "comments.author": "nik9000"
                        }
                    }
                ]
                }
            }
        }
    }
}

数组类型：没有专门的数组类型，但是任何字段都可以包含多个相同的数据类型的值

2022-03-26

多字段特性

对某个属性进行

PUT products
{
    "mappings": {
        "properties": {
            "company": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            },
            "comment": {
                "type": "text",
                "fields": {
                    "english_comment": {
                        "type": "text",
                        "analyzer": "english",
                        "search_analyzer": "english"
                    }
                }
            }
        }
    }
}

自定义分词器

Chracter Filter : 多个进行串联处理
Tokenizer : 分词器
Token Filters: 分词过滤器
各个分词器使用效果验证：

// 自定义 html 分析器
POST _analyze
{
    "tokenizer": "keyword",
    "char_filter": ["html_strip"],
    "text": "<b>hello word</b>"
}

// 完成目标字符的替换
POST _analyze
{
    "tokenizer": "standard", // 以字母切分文本
    "char_filter": [ 
        {
        "type": "mapping", // 映射器转换
        "mappings": [
            "- => _"
        ]
        }
    ],
    "text": "123-456, I-test! test-9990 650-555-1234"
}


POST _analyze
{
    "tokenizer": "standard",
    "char_filter": [
        {
            "type": "mapping",
            "mappings": [
                ":) => happy",
                ":( => sad"
            ]
        }
    ],
    "text": ["I am felling :)", "Felling:( today!"]
}

// 正则表达式替换
POST _analyze
{
    "tokenizer": "standard",
    "char_filter": [
        {
            "type": "pattern_replace",
            "pattern": "http://(.*)",
            "replacement": "$0"
        }
    ],
    "text": "http://www.elastic.co"
}

// 路径分词器
POST _analyze
{
    "tokenizer": "path_hierarchy",
    "text": "/user/ymruan/a/b/c/d/e"
}

// 空格和停用词分词, 需要先转小写
POST _analyze
{
    "tokenizer": "whitespace",
    "filter": ["stop"],
    "text": ["The rain in Spain falls mainly on the plain."]
}

POST _analyze
{
    "tokenizer": "whitespace",
    "filter": [
        "lowercase",
        "stop"
    ],
    "text": [
        "The rain in Spain falls mainly on the plain."
    ]
}

// 在索引中自定义一个分词器
PUT my_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_custmer_analyzer": { // 声明自定义分析器，引用自定义的三大组件
                    "type": "custom",
                    "char_filter": [
                        "emoticons"
                    ],
                    "tokenizer": "punctuation",
                    "filter": [
                        "lowercase",
                        "english_stop"
                    ]
                }
            },
            "tokenizer": {
                "punctuation": { // 自定义一个正则分词器，识别标点符号
                    "type": "pattern",
                    "pattern": "[ .,!?]"
                }
            },
            "char_filter": {
                "emoticons": {
                    "type": "mapping",
                    "mappings": [  // 自定义一个表情过滤转换器
                        ":) => _happy_",
                        ":( => _sad_"
                    ]
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop",
                    "stopwords": "_english_" // 增加一个自定义停用词
                }
            }
        }
    }
}

// 测试索引的分词效果
POST my_index/_analyze
{
    "analyzer": "my_custmer_analyzer",
    "text": ["I'm a :) person, and you?"]
}

Index Template : 索引自动生成时，对索引的设置进行保证一致、完善、安全；模板可以创建多个，es会自动合并设置

// 优先级 创建索引请求 > 高order index template > 低order值 index template > 默认的setting
PUT _template/temp_default
{
    "index_patterns": ["*"],
    "order": 0,
    "version": 1,
    "settings": {
        "number_of_shards": 1, 
        "number_of_replicas": 1
    }
}

PUT _template/temp_test
{
    "index_patterns": ["test*"],
    "order": 1,
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 2
    },
    "mappings": {
        "date_detection": false, // 关闭日期自动检测
        "numeric_detection": true
    }
}

// 查看 template 信息
GET /_template/temp*


PUT ttemp/_doc/1
{
    "somenumber": "1",
    "somedate": "2019/01/01"
}

GET ttemp/_mapping


PUT /testtemplate/_doc/1
{
    "somenumber": "1",
    "somedate": "2019/01/01"
}

// 发现命中索引模板的文档，字段已经被设置的template影响
GET /testtemplate/_mapping

Dynamic Template : Index Template 用来控制索引的创建设置，而动态模板，用来根据字段的名称，控制字段类型的设置（如：is开头的都是bool类型，long开头的都是值类型）

需要在mapping中进行设置

PUT my_index/_doc/1
{
    "fistName": "Ruan",
    "isVIP": "true"
}

GET my_index/_mapping
DELETE my_index 
// 使用动态模板，识别特定格式的字段
PUT my_index
{
    "mappings": {
        "dynamic_templates": [
            {
                "string_as_boolean": {
                "match_mapping_type": "string",
                "match": "is*",
                    "mapping": {
                        "type": "boolean"
                    }
                }
            },
            {
                "string_as_keywords": {
                "match_mapping_type": "string",
                    "mapping": {
                        "type": "keyword"
                    }
                }
            }   
        ]
    }
}

PUT my_index/_doc/1
{
    "fistName": "Ruan",
    "isVIP": "true"
}

// 验证动态模板： 匹配到目标类型，使用匹配规则，匹配到字段（fistName，isVIP），然后转化设置的映射类型进行字段的设置
GET my_index/_mapping

更综合性的动态模板的设置，可以对字段进行复杂的过滤和匹配，然后设置

// 对字段进行规则匹配，成功后则进行 copy_to 的设置，将多个字段合并到一个字段进行搜索
    PUT my_index
    {
        "mappings": {
            "dynamic_templates": [
                {
                    "full_name": {
                        "path_match": "name.*",
                        "path_unmatch": "*.middle",
                        "mapping": {
                                "type": "text",
                                "copy_to": "full_name"
                        }
                    }
                }
            ]
        }
    }

    PUT my_index/_doc/1
    {
        "name": {
            "first": "John",
            "middle": "Winston",
            "last": "Lennon"
        }
    }

    GET /my_index/_mapping

    GET /my_index/_search?q=full_name:(John Le)

2022-03-27

聚合查询： Bucket 、Metric、Pipeline、Matrix

Bucket & Metric 简单使用

GET kibana_sample_data_flights/_search
{
    "size": 0,
    "aggs": {
        "flight_test": {
            "terms": {
                "field": "DestCountry",
                "size": 3
            }
        }
    }
}

GET kibana_sample_data_flights/_search
{
    "size": 0,
    "aggs": {
        "flight_test": {
            "terms": {
                "field": "DestCountry"
            },
            "aggs": {
                "avg_price": {
                    "avg": {
                        "field": "AvgTicketPrice"
                    }
                },
                "max_price": {
                    "max": {
                        "field": "AvgTicketPrice"
                    }
                },
                "min_price": {
                    "min": {
                        "field": "AvgTicketPrice"
                    }
                },
                "weather": {
                    "terms": {
                        "field": "DestWeather",
                        "size": 5
                    }
                }
            }
        }
    }
}

2022-03-28 section_45 _ 47

s45 - 桶聚合和指标聚合

多桶聚合 : min, max, avg , sum ,cardinality
指标聚合：单值分析、多值分析(stats, extended stas , percentile, percentile rank, top hits)
实例

DELETE /employees

PUT /employees
{
    "mappings": {
        "properties": {
            "age": {
                "type": "integer"
            },
            "gender": {
                "type": "keyword"
            },
            "job": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 50
                    }
                }
            },
            "name": {
                "type": "keyword"
            },
            "salary": {
                "type": "integer"
            }
        }
    }
}

GET /employees/_settings

PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : {  "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : {  "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : {  "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : {  "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : {  "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : {  "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : {  "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : {  "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : {  "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : {  "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : {  "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : {  "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : {  "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : {  "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : {  "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : {  "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
{ "index": {"_id": "21"}}
{"name": "Daqiang", "age": null, "job": "CTO", "gender": null, "salary": 1000}



# // 找出最低工资 、 最高工资、平均工资
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "min_salary": {
            "min": {
                "field": "salary"
            }
        },
        "max_salary": {
            "max": {
                "field": "salary"
            }
        },
        "avg_salary": {
            "avg": {
                "field": "salary"
            }
        }
    }
}

# // 一次聚合，多值输出
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "stats_salary": {
            "stats": {
                "field": "salary"
            }
        },
        "ext_stats_salary": {
            "extended_stats": {
                "field": "salary"
            }
        }
    }
}

# term aggregation : 需要打开 fielddata ，才能直接使用该key做聚合； 或者使用keyword，由于其默认支持了 doc_values

# // 对keyword 聚合查询
POST employees/_search
{
"size": 0,
    "aggs": {
        "jobs": {
            "terms": {
                "field": "job.keyword", // 对text字段直接聚合，是不可以的（除非修改mapping如下）
                "size": 100
            }
        }
    }
}

# 修改text字段的terms分词设置，打开fielddata
PUT /employees/_mapping
{
    "properties": {
        "job": {
            "type": "text",
            "fielddata": true
        }
    }
}

POST employees/_search
{
    "size": 0,
    "aggs": {
        "jobs": {
            "terms": {
                "field": "job", // 改成了对分词进行桶聚合
                "size": 100
            }
        }
    }
}

# 比较 job.keyword 和 job 的term 基数聚合结果，分桶的总数不同
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "distinct_count_job": {
            "cardinality": {
                "field": "job"
            }
        },
        "distinct_count_job_keyword": {
            "cardinality": {
                "field": "job.keyword"
            }
        }
    }
}

# 性别的keyword 聚合
POST /employees/_search
{
"size": 0,
    "aggs": {
        "gender_agg": {
            "terms": {
                "field": "gender",
                "size": 10
            }
        }
    }
}

POST /employees/_search
{
    "size": 0,
    "aggs": {
        "age_agg": {
            "terms": {
                "field": "age",
                "size": 10
            }
        }
    }
}

// 对term 排序时的预热设置(预先加载到内存的字段数据)
PUT /employees
{
    "mappings": {
        "properties": {
            "age": {
                "type": "keyword",
                "eager_global_ordinals": true
            }
        }
    }
}


// range 和 histogram 聚合
POST employees/_search
{
    "size": 0,
    "aggs": {
        "salary_range": {
            "range": {
                "field": "salary",
                "ranges": [
                    {
                        "to": 10000
                    },
                    {
                        "from": 10000,
                        "to": 20000
                    },
                    {
                        "key": ">20000",
                        "from": 20000
                    }
                ]
            }
        }
    }
}

// 直方图 ，间隔聚合
POST employees/_search
{
    "size": 0,
    "aggs": {
        "salary_histogram": {
            "histogram": {
                "field": "salary",
                "interval": 5000,
                "extended_bounds": {
                    "min": 0,
                    "max": 100000
                }
            }
        }
    }
}


// 嵌套聚合
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "job_salary_stats": {
            "terms": {
                "field": "job.keyword",
                "size": 10
            },
            "aggs": {
                "salary_stats": {
                    "stats": {
                        "field": "salary"
                    }
                }
            }
        }
    }
}

// 多级嵌套聚合
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "job_sex_salary": {
            "terms": {
                "field": "job.keyword",
                "size": 10
            },
            "aggs": {
                "sex_salary": {
                    "terms": {
                        "field": "gender",
                        "size": 10
                    },
                    "aggs": {
                        "salary_stats": {
                            "stats": {
                                "field": "salary"
                            }
                        }
                    }
                }
            }
        }
    }
}

s46 - pipeline 聚合分析

通过对聚合的结构进行条件操作，找到目标聚合结果
分为两类：
- Sibling （结果和现有分析结果同级）： Max 、Min 、 Avg 、Sum、Stats、Percentiles
- Parent （结果内嵌到现有的聚合分析结果中）： Derivative （求导）、Cumultive Sum(累计就和)、Moving Function（滑动窗口）
实例

GET employees/_doc/21
// 找到平均工资最低和最高的工种
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "job_class": {
            "terms": {
                "field": "job.keyword",
                "size": 2
            },
            "aggs": {
                "salary_avg": {
                    "avg": {
                        "field": "salary"
                    }
                }
            }
        },
        "min_salary_by_job": {
            "min_bucket": {
                "buckets_path": "job_class>salary_avg"
            }
        },
        "max_salary_by_job": {
            "max_bucket": {
                "buckets_path": "job_class>salary_avg"
            }
        }
    }
}

// 平均工资统计分析
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "jobs": {
            "terms": {
                "field": "job.keyword",
                "size": 10
            },
            "aggs": {
                "avg_salary": {
                    "avg": {
                        "field": "salary"
                    }
                }
            }
        },
        "stats_salary_by_job": {
            "stats_bucket": {
                "buckets_path": "jobs>avg_salary"
            }
        },
        "percentiles_salary_by_job": {
            "percentiles_bucket": {
                "buckets_path": "jobs>avg_salary"
            }
        }
    }
}

// 按照年两对平均工资求导
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "age_histogram": {
            "histogram": {
                "field": "age",
                "min_doc_count": 0,
                "interval": 1
            },
            "aggs": {
                "salary_avg": {
                    "avg": {
                        "field": "salary"
                    }
                },
                "salary_avg_derivative": {
                    "derivative": {
                        "buckets_path": "salary_avg"
                    }
                }
            }
        }
    }
}

s47 - 作用范围和排序

聚合的默认作用范围是 query 的查询范围
但是，ES还能支持对聚合的进一步范围的调整： Filter 、 Post Filter 、 Global，
Filter
Post Filter
Global
实例：

// query
POST employees/_search
{
    "size": 0,
    "query": {
        "range": {
            "age": {
                "gte": 20
            }
        }
    },
    "aggs": {
        "jobs": {
            "terms": {
                "field": "job.keyword"
            }
        }
    }
}
// filter aggregation
POST employees/_search
{
    "size": 0,
    "aggs": {
        "older_person": {
            "filter": {
                "range": {
                    "age": {
                        "gte": 30
                    }
                }
            },
            "aggs": {
                "jobs": {
                    "terms": {
                        "field": "job.keyword",
                        "size": 10
                    }
                }
            }
        },
        "all_jobs": {
            "terms": {
                "field": "job.keyword"
            }
        }
    }
}


// post filter : 在聚合结果中，再继续过滤满足条件的聚合结果
POST employees/_search
{
    "size": 100,
    "aggs": {
        "jobs": {
            "terms": {
                "field": "job.keyword"
            }
        }
    },
    "post_filter": {
        "match": {
            "job.keyword": "Dev Manager"
        }
    }
}

// global 聚合中忽略query查询条件
POST /employees/_search
{
    "size": 0,
    "query": {
        "range": {
            "age": {
                "gte": 40
            }
        }
    },
    "aggs": {
        "jobs": {
        "terms": {
            "field": "job.keyword",
            "size": 10
        }
        },
        "all": {
            "global": {},
            "aggs": {
                "salary_avg": {
                    "avg": {
                        "field": "salary"
                    }
                }
            }
        }
    }
}

// agg 排序， term 排序默认以文档数排序
POST employees/_search
{
    "size": 0,
    "query": {
        "range": {
            "age": {
                "gte": 20
            }
        }
    },
    "aggs": {
        "jobs": {
            "terms": {
                "field": "job.keyword",
                "size": 10,
                "order": [
                    {
                        "_count": "asc"
                    },
                    {
                        "_key": "desc"
                    }
                ]
            }
        }
    }
}

// 以子指标聚合为排序依据
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "jobs": {
        "terms": {
            "field": "job.keyword",
            "size": 10,
            "order": [
                {
                    "avg_salary": "desc"
                }
            ]
        },
        "aggs": {
            "avg_salary": {
                "avg": {
                    "field": "salary"
                }
            }
        }
        }
    }
}


// 以子指标聚合为排序依据
POST /employees/_search
{
    "size": 0,
    "aggs": {
        "jobs": {
            "terms": {
                "field": "job.keyword",
                "size": 10,
                "order": [
                    {
                        "stats_salary.min": "desc"
                    }
                ]
            },
            "aggs": {
                "stats_salary": {
                    "stats": {
                        "field": "salary"
                    }
                }
            }
        }
    }
}

2022-03-29 section_24

词项和全文搜索：其中，基于term的查询，是表达语义的最小单位。搜索和利用统计的语言模型进行自然语言处理都需要处理Term

复合查询： Constant Score 转为 Filter

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }

POST /products/_search
{
    "query": {
        "term": {
            "productID.keyword": {
                "value": "XHDK-A-1293-#fJ3"
            }
        }
    }
}

// 使用 constant_score 避免算分，并且利用filter缓存
POST /products/_search
{
    "query": {
        "constant_score": {
            "filter": {
                "term": {
                    "productID.keyword": {
                        "value": "XHDK-A-1293-#fJ3"
                    }
                }
            }
        }
    }
}

term 的查询需要注意索引阶段分词的影响

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }

// 数据被分词器做了小写转化，所以大写查不到
POST /products/_search
{
    "query": {
        "term": {
            "desc": {
                //"value": "iPhone"
                "value": "iphone"
            }
        }
    }
}
POST /products/_search
{
    "query": {
        "term": {
            "productID": {
                //"value": "xhdk-a-1293-#fJ3",
                "value": "xhdk"
            }
        }
    }
}
POST /products/_search
{
    "query": {
        "term": {
            "productID.keyword": {
                "value": "XHDK-A-1293-#fJ3"
            }
        }
    }
}

POST /_analyze
{
    "analyzer": "standard",
    "text": ["XHDK-A-1293-#fJ3"]
}

Term Level Query: Term Query / Range Query / Exists Query / Prefix Query / Wildcard Query
默认情况下，Term查询不对输入条件做分词，而是作为一个整体，查找倒排索引，并且使用相关度算分公式为每个包含该词项的文档进行相关度算分
可以通过 Constant Score 将查询转换成一个 Filtering ，避免算法消耗，并且利用缓存
基于全文的查询： Match Query / Match Phrase Query / Query String Query
- 索引和搜索都会进行分词，查询词会进行分词后，分别查询， 汇总各自得分
- Match Query

2022-03-30/31 section_25 - section_29

结构化搜索，指的是对结构化的数据进行搜索，主要分为布尔、时间、日期、数字、文本

结构化搜索，可以做精确匹配或者部分匹配，如： Term查询和 Prefix 查询

// 结构化搜索，精确
DELETE products
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

GET products/_mapping
// 对布尔值 match 查询，含有算分
POST /products/_search
{
    "profile": true,
    "explain": true,
    "query": {
        "term": {
            "avaliable": "true"
        }
    }
}

// 无算分
POST /products/_search
{
    "profile": true,
    "explain": true,
    "query": {
        "constant_score": {
            "filter": {
                "term": {
                    "avaliable": "true"
                }
            }
        }
    }
}

GET products/_search
{
    "query": {
        "constant_score": {
            "filter": {
                "range": {
                    "price": {
                        "gte": 20,
                        "lte": 30
                    }
                }
            },
            "boost": 1.2
        }
    }
}

// Date Math Expresstions : y 年、M 月、 w 周、 d 天、 H/h 小时、m 分钟、s 秒
// 
GET /products/_search
{
"query": {
    "constant_score": {
        "filter": {
            "range": {
                "date": {
                    "gte": "now-234y"
                }
            }
        },
        "boost": 1.2
    }
}
}

// Exists
POST /products/_search
{
"query": {
    "constant_score": {
    "filter": {
        "exists": {
        "field": "date"
        }
    },
    "boost": 1.2
    }
}
}



POST /movies/_bulk
{ "index": { "_id": 1 }}
{ "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy"}
{ "index": { "_id": 2 }}
{ "title" : "Dave","year":1993,"genre":["Comedy","Romance"] }
// 处理多值字段，term 查询是 包含，而不是等于，如果要求对多值进行严格匹配，可以单独加一个数组计数字段，也可以runtime_field
POST /movies/_search
{
    "query": {
        "constant_score": {
            "filter": {
                "term": {
                    "genre.keyword": "Comedy"
                }
            }
        }
    }
}

2022-04-02 section_30 - section_31

多语言以及中文分词检索

人类语言不一定和查询条件不能完全匹配，所以如果抽取词根、归一化词元（清除变音符号）、包含同义词、拼写错误或者同音异形词的范畴匹配
多语言存储和查询：可以使用多个字段进行存储
- 识别用户上下文或者浏览器使用的语言，地理位置，甚至概率估算用户使用语言，进行搜索
- 一个文本内容中出现多种语言，如英文中出现德文，则德文相对算分更高，因为更稀有
英文分词的方式：如对 You’re 是一个还是多个的问题，对 Half-baked 要分割，还是不分割

中文分词的方式：应对组合型歧义、交集性歧义、真歧义的问题，如：中华人民共和国、美国会通过对台售武法案、上海仁和服装厂

最小长度分词法、统计语言模型、基于统计的机器学习算法（HMM、CRF、SVM、深度学习等算法），考虑上下文
HanLP 分词器(https://www.hanlp.com/)、 IK分词器、Pinyin 分词器（用于汉字拼音搜索）
配置远程词典，可以远程扩展字典、停止词词典等
ik支持字典热更新

安装方式：

<!-- elstcisearch-plugin install https://github.com/KennFalcon/elaticsearch-analysis-hanlp/release/download/hanlp.zip -->
<!-- ES_HOME/config/analysis-hanlp/hanlp-remote.xml -->
<properties>
    <comment>HanLP Analyzer 扩展配置</comment>
    <!-- 配置远程扩展字典 -->
    <entry key="remote_ext_dict">words_location</entry> 
    <!-- 配置远程扩展停止词字典 -->
    <entry key="remote_ext_stopwords">stop_words_location</entry>
</properties>

# 安装ik分词器插件
curl -XPOST http://localhost:9200/index/_mapping -H 'Content-Type:application/json' -d'
{
    "properties": {
        "content": {
            "type": "text",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_smart"
        }
    }
}
'

// hanlp 分词器： 没有找到 8.0 版本进行实践
// hanlp_standard
// hanlp_index : 索引分词器
// hanlp_nlp : NLP分词器
// hanlp_n_short : N - 最短路径分词
// hanlp_dijkstra : 最短路分词
// hanlp_crf : CRF 分词 
// hanlp_speed : 极速词典分词
POST _analyze
{
    "analyzer": "hanlp_standard",
    "text": ["剑桥分析公司多位高管对卧底记者说，他们确保了唐纳德·特朗普在总统大选中获胜"]
}

POST /_analyze
{
    "analyzer": "pinyin",
    "text": ["刘德华"]
}

POST _analyze
{
    "analyzer": "ik_max_word",
    "text": ["剑桥分析公司多位高管对卧底记者说，他们确保了唐纳德·特朗普在总统大选中获胜"]
}  


// Pinyin 分词器
DELETE artists
PUT /artists
{
    "settings": {
            "analysis": {
            "analyzer": {
                "user_name_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": "pinyin_first_letter_and_full_pinyin_filter"
                }
            },
            "filter": {
                "pinyin_first_letter_and_full_pinyin_filter": {
                    "type": "pinyin",
                    "keep_first_leeter": true,
                    "keep_full_pinyin": false,
                    "keep_none_chinese": true,
                    "keep_original": false,
                    "limit_first_letter_length": 16,
                    "lowercase": true,
                    "trim_whitespace": true,
                    "keep_none_chinese_in_first_letter": true
                }
            }
        }
    }
}

GET /artists/_analyze
{
"text": ["刘德华 张学友 郭富城 黎明 四大天王"],
"analyzer": "user_name_analyzer"
}

Space Jam, 搜索实例 - TMDB API
- python脚本：默认分词器新建索引，并导入数据，进行查询，相关度不明显
- python脚本：使用英文分词器，重建索引，并增加英文分词器，进行查询，提升相关性
- 增加高亮，查看相关查询的效果

2022-04-09 section_32

搜索模板 - Search Template

解耦ES功能和程序，让优化查询过程和程序逻辑分开
使用搜索模板的过程：

// 创建一个搜索模板
POST _scripts/tmdb
{
    "script": {
        "lang": "mustache",
        "source": {
        "size": 20,
        "_source": [
            "title", "overview"
        ],
        "query": {
            "multi_match": {
                "query": "{{q}}",
                "fields": ["title", "overview"],
                // "fields": ["title^10", "overview"] // 后台升级模板，也不会影响前端使用模板进行查询
            }
        }
        }
    }
}
DELETE _scripts/tmdb
GET _scripts/tmdb


// 验证模板
POST _render/template
{
    "id": "tmdb",
    "params": {
        "q": "basketball with catoon aliens"
    }
}

// 使用模板进行搜索
POST tmdb/_search/template
{
    "id": "tmdb", // 使用search template id，进行搜索
    "params": {
        "q": "basketball with cartoon aliens"
    }
}

// 还可以使用模板进行批量的查询，如： my-index/_msearch/template

Index Alias 实现运维零停机 (section 32)

GET _cat/aliases
PUT /movies-2019/_doc/1
{
    "name": "AAA",
    "time": "2019-01-01 00:00:00",
    "rating": 100
}

GET /movies-2019
// 增加别名，让索引可以不间断的进行维护，不影响搜索
POST _aliases
{
    "actions": [
        {
            "add": {
                "index": "movies-2019",
                "alias": "movies-latest"
            }
        },
        {
            "add": {
                "index": "movies",
                "alias": "movies-latest"
            }
        }
    ]
}

// 查看索引中的设置已经加入别名
GET /movies-2019

DELETE /movies-2019

// 使用别名进行搜索
POST movies-latest/_search
{
    "size": 10, 
    "query": {
        "match_all": {}
    }
}

// 在别名中预设一个过滤器，做为基础条件
POST _aliases
{
    "actions": [
        {
            "add": {
                "index": "movies-2019",
                "alias": "movies-latest-highrate",
                "filter": {
                "range": {
                    "rating": {
                        "gte": 100
                    }
                }
                }
            }
        }
    ]
}

// 获取别名列表
GET _alias/mov*

// 执行搜索，则经过别名设置的条件过滤后，可以看到结果
POST movies-latest-highrate/_search
{
    "query": {
        "match_all": {}
    }
}

Function Score Query (section 33)

在查询结束后，对每一个匹配的文档进行一些重新算分，生成新的分数
- weight 设置，可以对每个文档设置权重
- Field Value Factor ：使用该数值来修改 _score , 将文档中影响算分的数值因子，参与分数计算，例如”热度“和”点赞数“的因素
- Random Score ：为每一个用户使用一个不同的，随机的算分结果
- 衰减函数：以某个字段为标准，距离某个值越近，得分越高
- Script Score ：可以自定义脚本完全控制算分逻辑
实践：对点赞多的博客，靠前显示，同时基础逻辑依然通过搜索的评分作为依据，定义公式：更新后的算分 = 基础算分 * 投票数


// Example 1 : 根据欢迎程度，在基础匹配的前提下，提升博客的权重

DELETE blogs

// 文档都一样的情况下，观察算分效果
PUT /blogs/_doc/1
{
  "title":   "About popularity",
  "content": "In this post we will talk about...",
  "votes":   0
}

PUT /blogs/_doc/2
{
  "title":   "About popularity",
  "content": "In this post we will talk about...",
  "votes":   100
}

PUT /blogs/_doc/3
{
  "title":   "About popularity",
  "content": "In this post we will talk about...",
  "votes":   1000000
}

POST /blogs/_search
{
  "query": {
    "multi_match": { // 由于文档都一样，所以，基础匹配评分都一样
      "query": "popularity",
      "fields": [
        "title",
        "content"
      ]
    }
  }
}

// 此处算分逻辑为： 新算分 = 原算分 * 投票数
// 这时会有两个问题： 1> 投票为0时，如何？ 2> 投票数很大时，如何？
// 总体思想是，要让结果之间的 差别不能特别大，要显的很平滑
POST /blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "popularity",
          "fields": [
            "title",
            "content"
          ]
        }
      },
      "field_value_factor": { // 可以提升投票数
        "field": "votes"
        // ,"modifier": "log1p"  // 因为纯粹的求积运算，对分数的影响放大效果超出合理范围，所以通过更改作用函数，达到影响平滑，支持： none、log、log1p、log2p、ln、ln1p、ln2p、square、sqrt、reciprocal
        //,"factor": 0.5 // 增加因子， 如 new_score = old_score + log(1 + factor * vote_count)
      }
    }
  }
}


// Boot Mode 和 Max Boost
// 上述算分模式，使用了 Boost Mode 的 Multipy 模式，进行算分和函数值的乘积方法
// 还可以有： Sum（算分和函数的和）、Min/Max(算分和函数之间取大/小)、Replace（使用函数值取代算分）

// Max Boost : 可以让算分的结果，控制在一个最大的限度内进行

POST /blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "popularity",
          "fields": [
            "title",
            "content"
          ]
        }
      },
      "field_value_factor": { // 可以提升投票数
        "field": "votes",
        "modifier": "log1p",
        "factor": 0.1
      }
      ,"boost_mode": "sum" // 两分相加得出结果
      ,"max_boost": 3 // 控最大分值
    }
  }
}


// 一致性随机函数 ， 使用同一个种子值，得到的排序则是一样的
// 场景联想 ： 网站的广告需要提高展现率
// 要求1 ： 每个用户看到不同的广告排名
// 要求2 ： 对于一个用户，多次访问，排名稳定

PUT /_cluster/settings
{
  "transient": {
    "indices": {
      "id_field_data": {
        "enabled": true
      }
    }
  }
}

POST /blogs/_search
{
  "query": {
    "function_score": {
      "random_score": {
        "seed": 991199, // 改变种子值，可以看到结果相对于seed，保持稳定的排序
        "field": "votes"
      }
    }
  }
}

Suggest as you type （键入时建议）功能实现，俗称：纠错查询 (Session 34)

例如： Google搜索，elastosearch，虽然输入错误，但是可以得到一个纠错建议
ES 提供了： Suggest API，来实现此种功能，原理是通过将输入文本进行分解，得到Token，然后在索引字典中查找相似的Term，并返回
ES 提供4种类别的 Suggesters : (1-2) Term & Phrase Suggester ; (3-4) Complete & Context Suggester
本质，Suggester 是一种特殊类型的搜索
实践：

// 词条推荐算法 ： Levenshtein Edit Distance ，基于修改幅度达成和另一个词汇相同，同时通过其他参数控制计算过程的相似性的模糊程度，如"max_edits
DELETE articles
PUT articles
{
  "mappings": {
    "properties": {
      "title_completion":{
        "type": "completion"
      }
    }
  }
}

POST articles/_bulk
{ "index" : { } }
{ "title_completion": "lucene is very cool"}
{ "index" : { } }
{ "title_completion": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "title_completion": "Elasticsearch rocks"}
{ "index" : { } }
{ "title_completion": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "title_completion": "Elk stack rocks"}
{ "index" : {} }

POST /articles/_search
{
  "size": 0,
  "suggest": {
    "article-suggest": {
      "prefix": "elk", // 前缀补全匹配
      "completion": {
        "field": "title_completion"
      }
    }
  }
}


DELETE articles

POST articles/_bulk
{ "index" : { } }
{ "body": "lucene is very cool"}
{ "index" : { } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "body": "Elasticsearch rocks"}
{ "index" : { } }
{ "body": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "body": "Elk stack rocks"}
{ "index" : {} }
{  "body": "elasticsearch is rock solid"}


POST /articles/_search
{
  "size": 1,
  "query": {
    "match": {
      "title_completion": "lucen rock" // 查询不到相应文档, 可以看建议的结果
    }
  },
  "suggest": {
    "term-suggestion": {
      "text": "lucen rock", // 但是通过建议搜索，找到lucen 的相近词条的推荐，而rock是拼写正确的，则不会有所建议，
      "term": {
        "suggest_mode": "missing", // 缺失补全，如果索引中已经存在，就不提供建议； Popular（频率更高）、Always（无论是否存在，都提供建议）
        "field": "body"
      }
    }
  }
}


POST /articles/_search
{
  "suggest": {
    "term-sug": {
      "text": "lucen rock",
      "term": {
        "field": "body",
        "suggest_mode": "popular" // 此时返回了rocks，因为算法认为文档中存在的词，流行程度不一样，但是依然可以找到
      }
    }
  }
}

POST /articles/_search
{
  "suggest": {
    "term-sug": {
      "text": "lucen rock",
      "term": {
        "field": "body",
        "suggest_mode": "always"
      }
    }
  }
}


POST /articles/_search
{
  "suggest": {
    "term-sug": {
      "text": "lucen hocks", // 两个词都拼写错误
      "term": {
        "field": "body",
        "suggest_mode": "always",
        //"prefix_length": 0, // 如果能让hocks得到推荐，则可以控制此参数
        "sort": "frequency"
      }
    }
  }
}

// Phrase Suggester, 可以控制更多的容错参数
POST /articles/_search
{
  "suggest": {
    "my-sug": {
      "text": "lucne and elasticsear rock hello world ",// lucene 和 elasticsearch拼写错误，同时添加了冗余词
      "phrase": {
        "field": "body",
        "max_errors": 2,
        "confidence": 0,
        //"confidence": 2, // 修改成2，则改变了返回的结果
        "direct_generator": [
          {
            "field": "body",
            "suggest_mode": "always"
          }
        ],
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

Autocomplete 自动补全，每次输入，都进行网络查询，进行匹配项查找

ES 提供了 completion suggestion 进行实现
性能要求很高，所以ES使用了不同的数据结构，不是倒排索引，而是将Analyze的数据编码成 FST 和索引一起存放。FST整体加载近内存，速度会查询的非常快
FST 只能用于进行前缀查找
定义方式：需要在mapping 中对字段类型进行定义，然后索引数据，最后使用

DELETE /articles
PUT articles
{
  "mappings": {
    "properties": {
      "title_completion": {
        "type": "completion" // 定义补全类型
      }
    }
  }
}

POST articles/_bulk
{ "index" : { } }
{ "title_completion": "lucene is very cool"}
{ "index" : { } }
{ "title_completion": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "title_completion": "Elasticsearch rocks"}
{ "index" : { } }
{ "title_completion": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "title_completion": "Elk stack rocks"}
{ "index" : {} }


POST articles/_search?pretty
{
  "size": 0,
  "suggest": {
    "article-sug": {
      "prefix": "el", // 进行前缀查询
      "completion": {
        "field": "title_completion"
      }
    }
  }
}


// Context Suggester
// 其他 ： 需要结合用户上下文感知，如：star，可以感知用户是咖啡店，还是电影院
// ES 支持上下文有 ： Category(任意字符串) 和 Geo（地理位置）

DELETE comments
PUT comments
{
  "mappings": {
    "properties": {
      "comment_autocomplete": { // 在索引中建立一个上下文补全存储结构，用于存储将来前缀自动补全的词库
        "type": "completion",
        "contexts": [
          {
            "type": "category",
            "name": "comment_category"
          }
        ]
      }
    }
  }
}


POST comments/_doc
{
  "comment":"I love the star war movies",
  "comment_autocomplete":{
    "input":["star wars"],
    "contexts":{
      "comment_category":"movies"
    }
  }
}

POST comments/_doc
{
  "comment":"Where can I find a Starbucks",
  "comment_autocomplete":{
    "input":["starbucks"],
    "contexts":{
      "comment_category":"coffee"
    }
  }
}

POST comments/_search
{
  "suggest": {
    "YOUR_SUGGESTION": {
      "prefix": "star",
      "completion": {
        "field": "comment_autocomplete",
        "contexts": {
          // "comment_category": "movies" // 进行上下文中的匹配
          "comment_category": "coffee" // 进行上下文中的匹配, 配出咖啡店县关
        }
      }
    }
  }
}



// 总结：
精准度 ： Completion > Phrase > Term
召回率 ： Term > Phrase > Completion
性能 ： Completion > Phrase > Term

跨集群搜索
- ES 一个集群中，单个Active Master 的问题，会导致元数据存储的越来越多，从而导致整个集群无法正常工作，所以可以继续集群级别扩展
- 早期的Tribe Node方式，实现跨集群访问，需要Client Node加入每个集群，每个集群的Master信息变更都需要得到Tribe Node 的回应
- Tribe Node 不保存Cluster State 信息，一旦重启，初始化很慢，另外，索引重名也只能设置一种Prefer规则，
- 所以ES5.3后推出了 Cross Cluster Search 的方案，允许任何节点，扮演 federated 节点，将搜索请求进行代理，也不需要Client Node的形式加入其他集群
- 使用方式：
1
# 首先，需要启动三个集群，进行试验
1
ss