ElasticStack 官方文档阅读(数据查询搜索篇)

ES 索引管理文档源

  1. 2022-02-23 - Elasticsearch Guide [8.0] » Index modules (未实践)

2022-03-24

section_11 - 文档的简单的CRUD操作的端点使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// section_11 
//# 文档的简单的CRUD操作的端点 : index / create / read / update / delete , 五项操作
//## index : PUT my_index/_doc/1 (没有则创建文档,有则删除旧的,增加新的,并且版本号递增)
//## create : PUT my_index/_create/1 OR POST my_index/_doc (自动生成文档ID,如果存在则版本号冲突创建失败)
//## read : GET my_index/_doc/1
//## update : POST my_index/_udpate/1
//## delete : DELETE my_index/_doc/1

//-> index 方式创建索引,并指定文档ID创建文档
PUT test-user-01/_doc/1
{
"name": "daqiang2"
}

//-> 获取文档,看到版本号递增
GET test-user-01/_doc/1

//-> create 方式(可以端点也可以参数指定)创建文档,并指定文档ID, 如果重复调用,则版本号冲突而失败
PUT test-user-02/_doc/1?op_type=create
{
"name": "daqiang",
"age": 22
}
PUT test-user-02/_create/1
{
"name": "daqiang"
}

GET /test-user-02/_doc/1

// 端点直接使用, 可以指定id,如果不指定,则自动生成ID
POST test-user-03/_doc/1
{
"name": "daqiang"
}

GET /test-user-03/_search

//-> update 更新文档, 指定单个存在的文档,更新只能对字段做增量修改,如果数据没有变化,则版本号不会变化
POST /test-user-02/_update/1
{
"doc": {
"name": "miaoa",
"comment": "I want to good at English."
}
}

GET /test-user-02/_doc/1

//-> delete 删除单个文档
DELETE /test-user-01
DELETE /test-user-02
DELETE /test-user-03
DELETE /test-user-04

section_11 - 文档的Bulk操作

1
2
3
4
5
6
7
8
9
10
11
//# bulk api : 集中批量调用各种操作,并且可以针对不同的索引进行操作,写入每两行一组,先指定索引及文档;删除操作可以单行操作完; 执行错误则继续执行,操作结果返回每个操作对应的结果
POST _bulk
{"index":{"_index":"test-01","_id":1}}
{"f1":"v1"}
{"delete":{"_index":"test-01","_id":2}}
{"create":{"_index":"test-02","_id":"3"}}
{"f1":"v2"}
{"update":{"_id":"1","_index":"test"}}
{"doc":{"f1":"v2"}}
{"udpate": {"_id":"1", "_index": "test-01"}}
{"doc": {"f1": "ff"}}

section_11 - 批量读取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// - mget : 批量读取
GET /_mget
{
"docs": [
{
"_index": "test",
"_id": "1"
},
{
"_index": "test-01",
"_id": "1"
}
]
}

GET /test-01/_mget
{
"ids": [
"1",
"1"
]
}

// - msearch : 批量搜索
POST /kibana_sample_data_ecommerce/_msearch
{}
{"query":{"match_all":{}},"size":1}
{"index":"kibana_sample_data_flights"}
{"query":{"match_all":{}},"size":1}

section_13 - 分词器处理

  1. 分词器包括:

    • Standard Analyzer(默认内置),按词切分,小写处理
    • Simple Analyzer 按照非字母切分(符号被过滤掉),小写处理
    • Stop Analyzer 小写处理,停用词过滤
    • Whitespace Analyzer 按照空格切分,不转小写
    • Keyword Analyzer 不做分词,直接将输入作为输出
    • Patter Analyzer 正则表达式,默认以 \W+ (非字符分隔)
    • Language - 提供30多种常见语言的分词器
      • english : 只是保留词根
      • 中文 : 安装分词器的过程 , 下载到puglin补录,并执行elasticsearch-plugin install xxx, 使用elasticsearch-plugin list 查看插件列表
    • Customer Analyzer 自定义分词器 : 如 Mastering Elasticsearch & Elasticsearch in Action
      • 定义 Character Filters (去除原始杂项内容)Mastering Elasticsearch Elasticsearch in Action
      • 定义 Tokenizer (根据规则切分单词) Mastering Elasticsearch in Action
      • 定义 Token Filter (过滤不需要的单词) Mastering Elasticsearch Action
  2. 分词使用时机 : 索引过程 和 搜索过程均可使用

  3. 目的: 把全文本转换为一系列单词的过程(term/token)

  4. 可以使用 /_analyze api , 如

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    GET /_analyze
    {
    "analyzer": "standard",
    "text": "Mastering Elasticsearch , Elasticsearch in Action"
    }

    // 指定索引中的字段进行分词测试,索引已经设置分词器
    GET /test-index/_analyze
    {
    "field": "title",
    "text": "Mastering Elasticsearch , Elasticsearch in Action"
    }

    // 自定义分词器
    GET /_analyze
    {
    "tokenizer": "standard",
    "filter": [],
    "text": "Mastering Elasticsearch"
    }

    // 测试 1: standard
    GET /_analyze
    {
    "analyzer": "standard",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }


    GET /_analyze
    {
    "analyzer": "standard",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }


    GET /_analyze
    {
    "analyzer": "simple",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }


    GET /_analyze
    {
    "analyzer": "whitespace",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }


    GET /_analyze
    {
    "analyzer": "keyword",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }

    GET /_analyze
    {
    "analyzer": "pattern",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }

    GET /_analyze
    {
    "analyzer": "stop",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }

    GET /_analyze
    {
    "analyzer": "english",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }

    // 保留词根
    GET /_analyze
    {
    "analyzer": "english",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }


    // ###### 中文 分词 #######
    GET /_analyze
    {
    "analyzer": "icu_analyzer",
    "text": "他说的确实在理"
    }

    GET /_analyze
    {
    "analyzer": "standard",
    "text": "他说的确实在理"
    }

    GET /_analyze
    {
    "analyzer": "icu_analyzer",
    "text": "他说的确实在理"
    }
    // ik 支持自定义词库,支持热更新分词词典?

2022-03-25

  1. 数据导入问题

    • Kibana 的样例数据, 直接Web操作即可

    • GroupLens 数据集导入(先安装LogStash),并配置以下数据转化过程, 要注意的是需要认证识别(需要到kibana建立角色和用户并授予写入权限), 并且考虑移除多余的字段

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      // logstash -f logstash.conf
      input {
      file {
      path => "/home/es/ml-latest-small/movies.csv"
      start_position => "beginning"
      sincedb_path => "/dev/null"
      }
      }

      filter {
      csv {
      separator => ","
      columns => ["id","content","genre"]
      }

      mutate {
      split => { "genre" => "|" }
      remove_field => ["path", "host","@timestamp","message"]
      }

      mutate {
      split => ["content", "("]
      add_field => { "title" => "%{[content][0]}"}
      add_field => { "year" => "%{[content][1]}"}
      }

      mutate {
      convert => {
      "year" => "integer"
      }
      strip => ["title"]
      remove_field => ["path", "host","@timestamp","message","content","log","@version","event"]
      }
      }
      output {
      elasticsearch {
      cacert => '/home/es/certs/http_ca.crt'
      hosts => ["https://172.16.10.131:9200"]
      index => "movies"
      document_id => "%{id}"
      user => logstash_internal
      password => "dev123"
      }
      stdout {}
      }
    • 也可以通过kibana,web页面直接上传各种格式的文件,进行数据导入

  2. Precision (查准率) 和 Recall (查全率/召回率) 作为搜索质量检测重要标准,是怎么计算的?

    • precision : 已经返回的搜索结果中,相关的文档所占比例
    • recall : 已返回的搜索结果中相关的文档,在所有的实际相关文档中的所占比例
  3. URI Search

    • URI Query String Syntax :

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      // GET /movies/_search?q=2012&df=title&sort=year:desc&from=0&size=10&timeout=1s
      {
      "profile": true
      }
      // 指定默认字段(df)上进行查询(含有2012的),按年倒序分页,并设置超时时间
      GET /movies/_search?q=2012&df=title&sort=year:desc&from=0&size=20&timeout=1s
      {
      "profile": true
      }
      // 泛查询,如果不指定字段,则正对_all, 所有字段上进行搜索
      GET /movies/_search?q=2012
      {
      "profile": true
      }
      GET /movies/_search?q=title:2012
      {
      "profile": true
      }
      // 使用引号,Phrase查询
      GET /movies/_search?q=title:"Beautiful Mind"
      {
      "profile": true
      }

      // 如果这样,则Mind的部分,成为泛查询
      GET /movies/_search?q=title:Beautiful Mind
      {
      "profile": true
      }

      // 如果要做Term查询,则需要括号包裹
      GET /movies/_search?q=title:(Beautiful Mind)
      {
      "profile": true
      }

      GET /movies/_search?q=title:(Beautiful NOT Mind)
      {
      "profile": true
      }

      // AND 查询,两个单次都包括
      GET /movies/_search?q=title:(Beautiful %2BMind)
      {
      "profile": true
      }

      // 范围查询格式
      GET /movies/_search?q=year:>=1980
      {
      "profile": true
      }

      // 通配符
      GET /movies/_search?q=title:b*
      {
      "profile": true
      }

      // 模糊匹配 : 后缀容错
      GET /movies/_search?q=title:beautial~2
      {
      "profile": true
      }

      // 设定slot值
      GET /movies/_search?q=title:"Lord Rings"~2
      {
      "profile": true
      }
    • _cat 端点使用:

      1
      2
      3
      4
      // GET /_cat/indices/kibana*?v&s=index
      // GET /_cat/indices?v&health=green
      // GET /_cat/indices?v&s=docs.count:desc
      // GET /_cat/indeces?
  4. 查询表达式 - Query DSL 使用

    • source filtering : 可以过滤返回的source数据结构中存储的字段
    • match all
    • script field
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    // 脚本字段
    GET /kibana_sample_data_ecommerce/_search
    {
    "script_fields": {
    "new_field": {
    "script": {
    "lang": "painless",
    "source": "doc['order_date'].value + '_hello'"
    }
    }
    },
    "query": {
    "match_all": {}
    }
    }
    • Match
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    // DSL  - Match, 分词查询
    GET /movies/_search
    {
    "query": {
    "match": {
    "title": "Beautiful Mind"
    }
    }
    }

    GET /movies/_search
    {
    "query": {
    "match": {
    "title": {
    "query": "Beautiful Mind",
    "operator": "and"
    }
    }
    }
    }
  • Match Phrase

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    // match phrase 查询,可以设置文本容忍的编辑距离
    GET /movies/_search
    {
    "query": {
    "match_phrase": {
    "title": "one love"
    }
    }
    }

    GET /movies/_search
    {
    "query": {
    "match_phrase": {
    "title": {
    "query": "one love",
    "slop": 1
    }
    }
    }
    }
  • Query String Query : 将URI格式变为 结构化查询格式

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    POST movies/_search
    {
    "query": {
    "query_string": {
    "default_field": "title",
    "query": "Discovery"
    }
    }
    }

    POST movies/_search
    {
    "query": {
    "query_string": {
    "fields": ["title", "genre"],
    "query": "(These AND Drama) OR (Generation AND Action)"
    }
    }
    }
  • Simple Query String Query : 类似于上面的Query String,但是会忽略错误的语法,只支持部分查询语法,不支持AND OR NOT,会将他们当做字符串处理

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    POST movies/_search
    {
    "query": {
    "simple_query_string": {
    "query": "Adven ",
    "fields": ["title", "genre"],
    "default_operator": "AND"
    }
    }
    }
  1. Mapping 设置

    • 更改Mapping的规则: 新增字段可以,已有字段,则需要重建索引

    • 可以对mapping 的开关进行设置 truefalsestrict, 影响索引的文档索引、写入

      • 如果是 true : 则可以新增字段,并对该字段进行索引和查询

      • 如果是false : 则可以新增字段,也可以正常索引数据,但是不能对该字段进行查询,mapping中也不会有该字段的设置

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        // 验证关闭动态映射后,索引新字段不会再生成映射,但是可以正常存入数据,但是不能查询该字段
        PUT /my-map-test/_doc/3
        {
        "firstname": "Dong",
        "lastname": "FuQiang",
        "logindate": "2029-02-01T10:30:00"
        }

        PUT /my-map-test/_mapping
        {
        "dynamic": false
        }

        PUT /my-map-test/_doc/4
        {
        "newField": "new fields"
        }

        GET /my-map-test/_mapping

        GET /my-map-test/_search
        {
        "query": {
        "term": {
        "newField": {
        "value": "new fields"
        }
        }
        }
        }
      • 如果是strict : 则不能新增字段,不能索引新的字段数据,也不能查询该字段

    • 设定mapping

      • 设置字段不被索引 : "index": true|false

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        DELETE users

        PUT users
        {
        "mappings": {
        "properties": {
        "firstname": {
        "type": "text"
        },
        "lastname": {
        "type": "text"
        },
        "mobile": {
        "type": "text",
        // 设置字段不被索引
        "index": false
        }
        }
        }
        }


        PUT users/_doc/1
        {
        "firstname": "Run",
        "lastname": "Yiming",
        "mobile": "1234567"
        }

        // 执行异常,不能对未索引的字段进行查询
        GET users/_search
        {
        "query": {
        "match": {
        "mobile": "1234567"
        }
        }
        }
      • 索引选项配置 : index_options: docs(doc id)|freqs|postions(text类型的默认,记录doc id、term freq、term position)|offsets(多一个chracter offsets),记录的越多需要的存储空间越大

      • 自定义空值 : null_value

      • copy_to : 将多个字段拷贝到一个目标字段上,进行搜索,但实际上 copy_to 并不会出现在_source 中, 但是需要增加额外存储

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
        54
        55
        56
        57
        58
        59
        60
        61
        62
        63
        64
        65
        66
        67
        68
        69
        70
        71
        72
        73
        74
        75
        76
        77
        78
        79
        80
        81
        82
        83
        84
        85
        86
        87
        88
        89
        90
        91
        92
        93
        94
        95
        96
        97
        98
        99
        100
        101
        102
        103
        104
        105
        106
        107
        108
        109
        110
        111
        112
        113
        114
        115
        116
        117
        118
        119
        120
        121
        122
        123
        124
        125
        126
        127
        128
        129
        130
        131
        132
        133
        134
        135
        136
        137
        138
        139
        140
        141
        142
        143
        144
        145
        146
        147
        148
        149
        150
        151
        152
        153
        154
        155
        156
        157
        158
        159
        160
        161
        DELETE users
        PUT users
        {
        "mappings": {
        "properties": {
        "firstname": {
        "type": "text",
        "copy_to": "fullname"
        },
        "lastname": {
        "type": "text",
        "copy_to": "fullname"
        },
        "mobile": {
        "type": "keyword",
        // 设置字段控制
        "null_value": "NULL"
        }
        }
        }
        }

        PUT users/_doc/2
        {
        "firstname": "Run",
        "lastname": "Yiming",
        "mobile": null
        }

        // copy_to 将字段内容合并在一起查询,但是返回结构中并没有这个字段
        GET users/_search?q=fullname:(Run Yiming)
        {
        "query": {
        "bool": {
        "should": [
        {
        "match": {
        "mobile": "NULL"
        }
        }
        ]
        }
        }
        }

        // 尝试,使用数组结构时,如果使用了 copy_to 的效果,则只是最后一个元素起作用
        DELETE /my-index
        PUT my-index
        {
        "mappings": {
        "properties": {
        "comments": {
        "type": "nested",
        "properties": {
        "author": {
        "type": "text",
        "copy_to": "full_name"
        },
        "tags": {
        "type": "nested",
        "properties": {
        "interests" : {
        "type": "text",
        "copy_to": "full_name"
        }
        }
        }
        }
        }
        }
        }
        }

        PUT my-index/_doc/1?refresh
        {
        "comments": [
        {
        "author": "kimchy",
        "tags": [
        {
        "interests": "t1"
        }
        ]
        }
        ]
        }

        PUT my-index/_doc/2?refresh
        {
        "comments": [
        {
        "author": "kimchy",
        "tags": [
        {
        "interests": "t1"
        },
        {
        "interests": "t2"
        }
        ]
        },
        {
        "author": "nik9000",
        "tags": [
        {
        "interests": "t2"
        }
        ]
        }
        ]
        }

        PUT my-index/_doc/3?refresh
        {
        "comments": [
        {
        "author": "kimchy"
        }
        ]
        }


        POST /my-index/_search
        {
        "query": {
        "term": {
        "full_name": {
        "value": "t1"
        }
        }
        }
        }

        POST /my-index/_search
        {
        "query": {
        "match": {
        "full_name": "t1 kimchy"
        }
        }
        }
        POST my-index/_search
        {
        "query": {
        "nested": {
        "path": "comments",
        "query": {
        "bool": {
        "must_not": [
        {
        "term": {
        "comments.author": "nik9000"
        }
        }
        ]
        }
        }
        }
        }
        }

      • 数组类型 : 没有专门的数组类型,但是任何字段都可以包含多个相同的数据类型的值

2022-03-26

  1. 多字段特性

    • 对某个属性进行
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    PUT products
    {
    "mappings": {
    "properties": {
    "company": {
    "type": "text",
    "fields": {
    "keyword": {
    "type": "keyword"
    }
    }
    },
    "comment": {
    "type": "text",
    "fields": {
    "english_comment": {
    "type": "text",
    "analyzer": "english",
    "search_analyzer": "english"
    }
    }
    }
    }
    }
    }
  2. 自定义分词器

    • Chracter Filter : 多个进行串联处理
    • Tokenizer : 分词器
    • Token Filters: 分词过滤器
    • 各个分词器使用效果验证:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    // 自定义 html 分析器
    POST _analyze
    {
    "tokenizer": "keyword",
    "char_filter": ["html_strip"],
    "text": "<b>hello word</b>"
    }

    // 完成目标字符的替换
    POST _analyze
    {
    "tokenizer": "standard", // 以字母切分文本
    "char_filter": [
    {
    "type": "mapping", // 映射器转换
    "mappings": [
    "- => _"
    ]
    }
    ],
    "text": "123-456, I-test! test-9990 650-555-1234"
    }


    POST _analyze
    {
    "tokenizer": "standard",
    "char_filter": [
    {
    "type": "mapping",
    "mappings": [
    ":) => happy",
    ":( => sad"
    ]
    }
    ],
    "text": ["I am felling :)", "Felling:( today!"]
    }

    // 正则表达式替换
    POST _analyze
    {
    "tokenizer": "standard",
    "char_filter": [
    {
    "type": "pattern_replace",
    "pattern": "http://(.*)",
    "replacement": "$0"
    }
    ],
    "text": "http://www.elastic.co"
    }

    // 路径分词器
    POST _analyze
    {
    "tokenizer": "path_hierarchy",
    "text": "/user/ymruan/a/b/c/d/e"
    }

    // 空格和停用词分词, 需要先转小写
    POST _analyze
    {
    "tokenizer": "whitespace",
    "filter": ["stop"],
    "text": ["The rain in Spain falls mainly on the plain."]
    }

    POST _analyze
    {
    "tokenizer": "whitespace",
    "filter": [
    "lowercase",
    "stop"
    ],
    "text": [
    "The rain in Spain falls mainly on the plain."
    ]
    }

    // 在索引中自定义一个分词器
    PUT my_index
    {
    "settings": {
    "analysis": {
    "analyzer": {
    "my_custmer_analyzer": { // 声明自定义分析器,引用自定义的三大组件
    "type": "custom",
    "char_filter": [
    "emoticons"
    ],
    "tokenizer": "punctuation",
    "filter": [
    "lowercase",
    "english_stop"
    ]
    }
    },
    "tokenizer": {
    "punctuation": { // 自定义一个正则分词器,识别标点符号
    "type": "pattern",
    "pattern": "[ .,!?]"
    }
    },
    "char_filter": {
    "emoticons": {
    "type": "mapping",
    "mappings": [ // 自定义一个表情过滤转换器
    ":) => _happy_",
    ":( => _sad_"
    ]
    }
    },
    "filter": {
    "english_stop": {
    "type": "stop",
    "stopwords": "_english_" // 增加一个自定义停用词
    }
    }
    }
    }
    }

    // 测试索引的分词效果
    POST my_index/_analyze
    {
    "analyzer": "my_custmer_analyzer",
    "text": ["I'm a :) person, and you?"]
    }
  3. Index Template : 索引自动生成时,对索引的设置进行保证一致、完善、安全; 模板可以创建多个,es会自动合并设置

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    // 优先级 创建索引请求 > 高order index template > 低order值 index template > 默认的setting
    PUT _template/temp_default
    {
    "index_patterns": ["*"],
    "order": 0,
    "version": 1,
    "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
    }
    }

    PUT _template/temp_test
    {
    "index_patterns": ["test*"],
    "order": 1,
    "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 2
    },
    "mappings": {
    "date_detection": false, // 关闭日期自动检测
    "numeric_detection": true
    }
    }

    // 查看 template 信息
    GET /_template/temp*


    PUT ttemp/_doc/1
    {
    "somenumber": "1",
    "somedate": "2019/01/01"
    }

    GET ttemp/_mapping


    PUT /testtemplate/_doc/1
    {
    "somenumber": "1",
    "somedate": "2019/01/01"
    }

    // 发现命中索引模板的文档,字段已经被设置的template影响
    GET /testtemplate/_mapping

  4. Dynamic Template : Index Template 用来控制索引的创建设置,而动态模板,用来根据字段的名称,控制字段类型的设置(如:is开头的都是bool类型,long开头的都是值类型)

    • 需要在mapping中进行设置

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      PUT my_index/_doc/1
      {
      "fistName": "Ruan",
      "isVIP": "true"
      }

      GET my_index/_mapping
      DELETE my_index
      // 使用动态模板,识别特定格式的字段
      PUT my_index
      {
      "mappings": {
      "dynamic_templates": [
      {
      "string_as_boolean": {
      "match_mapping_type": "string",
      "match": "is*",
      "mapping": {
      "type": "boolean"
      }
      }
      },
      {
      "string_as_keywords": {
      "match_mapping_type": "string",
      "mapping": {
      "type": "keyword"
      }
      }
      }
      ]
      }
      }

      PUT my_index/_doc/1
      {
      "fistName": "Ruan",
      "isVIP": "true"
      }

      // 验证动态模板: 匹配到目标类型,使用匹配规则,匹配到字段(fistName,isVIP),然后转化设置的映射类型进行字段的设置
      GET my_index/_mapping
    • 更综合性的动态模板的设置,可以对字段进行复杂的过滤和匹配,然后设置

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    // 对字段进行规则匹配,成功后则进行 copy_to 的设置,将多个字段合并到一个字段进行搜索
    PUT my_index
    {
    "mappings": {
    "dynamic_templates": [
    {
    "full_name": {
    "path_match": "name.*",
    "path_unmatch": "*.middle",
    "mapping": {
    "type": "text",
    "copy_to": "full_name"
    }
    }
    }
    ]
    }
    }

    PUT my_index/_doc/1
    {
    "name": {
    "first": "John",
    "middle": "Winston",
    "last": "Lennon"
    }
    }

    GET /my_index/_mapping

    GET /my_index/_search?q=full_name:(John Le)

2022-03-27

  1. 聚合查询 : Bucket 、Metric、Pipeline、Matrix

  2. Bucket & Metric 简单使用

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    GET kibana_sample_data_flights/_search
    {
    "size": 0,
    "aggs": {
    "flight_test": {
    "terms": {
    "field": "DestCountry",
    "size": 3
    }
    }
    }
    }

    GET kibana_sample_data_flights/_search
    {
    "size": 0,
    "aggs": {
    "flight_test": {
    "terms": {
    "field": "DestCountry"
    },
    "aggs": {
    "avg_price": {
    "avg": {
    "field": "AvgTicketPrice"
    }
    },
    "max_price": {
    "max": {
    "field": "AvgTicketPrice"
    }
    },
    "min_price": {
    "min": {
    "field": "AvgTicketPrice"
    }
    },
    "weather": {
    "terms": {
    "field": "DestWeather",
    "size": 5
    }
    }
    }
    }
    }
    }

2022-03-28 section_45 _ 47

  1. s45 - 桶聚合和指标聚合

    • 多桶聚合 : min, max, avg , sum ,cardinality
    • 指标聚合: 单值分析 、 多值分析(stats, extended stas , percentile, percentile rank, top hits)
    • 实例
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    288
    289
    290
    291
    292
    293
    294
    295
    296
    297
    298
    299
    300
    301
    302
    303
    304
    305
    306
    307
    308
    309
    310
    311
    312
    313
    DELETE /employees

    PUT /employees
    {
    "mappings": {
    "properties": {
    "age": {
    "type": "integer"
    },
    "gender": {
    "type": "keyword"
    },
    "job": {
    "type": "text",
    "fields": {
    "keyword": {
    "type": "keyword",
    "ignore_above": 50
    }
    }
    },
    "name": {
    "type": "keyword"
    },
    "salary": {
    "type": "integer"
    }
    }
    }
    }

    GET /employees/_settings

    PUT /employees/_bulk
    { "index" : { "_id" : "1" } }
    { "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
    { "index" : { "_id" : "2" } }
    { "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
    { "index" : { "_id" : "3" } }
    { "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
    { "index" : { "_id" : "4" } }
    { "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
    { "index" : { "_id" : "5" } }
    { "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
    { "index" : { "_id" : "6" } }
    { "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
    { "index" : { "_id" : "7" } }
    { "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
    { "index" : { "_id" : "8" } }
    { "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
    { "index" : { "_id" : "9" } }
    { "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
    { "index" : { "_id" : "10" } }
    { "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
    { "index" : { "_id" : "11" } }
    { "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
    { "index" : { "_id" : "12" } }
    { "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
    { "index" : { "_id" : "13" } }
    { "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
    { "index" : { "_id" : "14" } }
    { "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
    { "index" : { "_id" : "15" } }
    { "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
    { "index" : { "_id" : "16" } }
    { "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
    { "index" : { "_id" : "17" } }
    { "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
    { "index" : { "_id" : "18" } }
    { "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
    { "index" : { "_id" : "19" } }
    { "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
    { "index" : { "_id" : "20" } }
    { "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
    { "index": {"_id": "21"}}
    {"name": "Daqiang", "age": null, "job": "CTO", "gender": null, "salary": 1000}



    # // 找出最低工资 、 最高工资、平均工资
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "min_salary": {
    "min": {
    "field": "salary"
    }
    },
    "max_salary": {
    "max": {
    "field": "salary"
    }
    },
    "avg_salary": {
    "avg": {
    "field": "salary"
    }
    }
    }
    }

    # // 一次聚合,多值输出
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "stats_salary": {
    "stats": {
    "field": "salary"
    }
    },
    "ext_stats_salary": {
    "extended_stats": {
    "field": "salary"
    }
    }
    }
    }

    # term aggregation : 需要打开 fielddata ,才能直接使用该key做聚合; 或者使用keyword,由于其默认支持了 doc_values

    # // 对keyword 聚合查询
    POST employees/_search
    {
    "size": 0,
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job.keyword", // 对text字段直接聚合,是不可以的(除非修改mapping如下)
    "size": 100
    }
    }
    }
    }

    # 修改text字段的terms分词设置,打开fielddata
    PUT /employees/_mapping
    {
    "properties": {
    "job": {
    "type": "text",
    "fielddata": true
    }
    }
    }

    POST employees/_search
    {
    "size": 0,
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job", // 改成了对分词进行桶聚合
    "size": 100
    }
    }
    }
    }

    # 比较 job.keyword 和 job 的term 基数聚合结果,分桶的总数不同
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "distinct_count_job": {
    "cardinality": {
    "field": "job"
    }
    },
    "distinct_count_job_keyword": {
    "cardinality": {
    "field": "job.keyword"
    }
    }
    }
    }

    # 性别的keyword 聚合
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "gender_agg": {
    "terms": {
    "field": "gender",
    "size": 10
    }
    }
    }
    }

    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "age_agg": {
    "terms": {
    "field": "age",
    "size": 10
    }
    }
    }
    }

    // 对term 排序时的预热设置(预先加载到内存的字段数据)
    PUT /employees
    {
    "mappings": {
    "properties": {
    "age": {
    "type": "keyword",
    "eager_global_ordinals": true
    }
    }
    }
    }


    // range 和 histogram 聚合
    POST employees/_search
    {
    "size": 0,
    "aggs": {
    "salary_range": {
    "range": {
    "field": "salary",
    "ranges": [
    {
    "to": 10000
    },
    {
    "from": 10000,
    "to": 20000
    },
    {
    "key": ">20000",
    "from": 20000
    }
    ]
    }
    }
    }
    }

    // 直方图 ,间隔聚合
    POST employees/_search
    {
    "size": 0,
    "aggs": {
    "salary_histogram": {
    "histogram": {
    "field": "salary",
    "interval": 5000,
    "extended_bounds": {
    "min": 0,
    "max": 100000
    }
    }
    }
    }
    }


    // 嵌套聚合
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "job_salary_stats": {
    "terms": {
    "field": "job.keyword",
    "size": 10
    },
    "aggs": {
    "salary_stats": {
    "stats": {
    "field": "salary"
    }
    }
    }
    }
    }
    }

    // 多级嵌套聚合
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "job_sex_salary": {
    "terms": {
    "field": "job.keyword",
    "size": 10
    },
    "aggs": {
    "sex_salary": {
    "terms": {
    "field": "gender",
    "size": 10
    },
    "aggs": {
    "salary_stats": {
    "stats": {
    "field": "salary"
    }
    }
    }
    }
    }
    }
    }
    }
  2. s46 - pipeline 聚合分析

    • 通过对聚合的结构进行条件操作,找到目标聚合结果
    • 分为两类 :
      • Sibling (结果和现有分析结果同级): Max 、Min 、 Avg 、Sum、Stats、Percentiles
      • Parent (结果内嵌到现有的聚合分析结果中): Derivative (求导)、Cumultive Sum(累计就和)、Moving Function(滑动窗口)
    • 实例
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    GET employees/_doc/21
    // 找到平均工资最低和最高的工种
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "job_class": {
    "terms": {
    "field": "job.keyword",
    "size": 2
    },
    "aggs": {
    "salary_avg": {
    "avg": {
    "field": "salary"
    }
    }
    }
    },
    "min_salary_by_job": {
    "min_bucket": {
    "buckets_path": "job_class>salary_avg"
    }
    },
    "max_salary_by_job": {
    "max_bucket": {
    "buckets_path": "job_class>salary_avg"
    }
    }
    }
    }

    // 平均工资统计分析
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job.keyword",
    "size": 10
    },
    "aggs": {
    "avg_salary": {
    "avg": {
    "field": "salary"
    }
    }
    }
    },
    "stats_salary_by_job": {
    "stats_bucket": {
    "buckets_path": "jobs>avg_salary"
    }
    },
    "percentiles_salary_by_job": {
    "percentiles_bucket": {
    "buckets_path": "jobs>avg_salary"
    }
    }
    }
    }

    // 按照年两对平均工资求导
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "age_histogram": {
    "histogram": {
    "field": "age",
    "min_doc_count": 0,
    "interval": 1
    },
    "aggs": {
    "salary_avg": {
    "avg": {
    "field": "salary"
    }
    },
    "salary_avg_derivative": {
    "derivative": {
    "buckets_path": "salary_avg"
    }
    }
    }
    }
    }
    }
  3. s47 - 作用范围和排序

    • 聚合的默认作用范围是 query 的查询范围
    • 但是,ES还能支持对聚合的进一步范围的调整 : Filter 、 Post Filter 、 Global,
    • Filter
    • Post Filter
    • Global
    • 实例:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    // query
    POST employees/_search
    {
    "size": 0,
    "query": {
    "range": {
    "age": {
    "gte": 20
    }
    }
    },
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job.keyword"
    }
    }
    }
    }
    // filter aggregation
    POST employees/_search
    {
    "size": 0,
    "aggs": {
    "older_person": {
    "filter": {
    "range": {
    "age": {
    "gte": 30
    }
    }
    },
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job.keyword",
    "size": 10
    }
    }
    }
    },
    "all_jobs": {
    "terms": {
    "field": "job.keyword"
    }
    }
    }
    }


    // post filter : 在聚合结果中,再继续过滤满足条件的聚合结果
    POST employees/_search
    {
    "size": 100,
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job.keyword"
    }
    }
    },
    "post_filter": {
    "match": {
    "job.keyword": "Dev Manager"
    }
    }
    }

    // global 聚合中忽略query查询条件
    POST /employees/_search
    {
    "size": 0,
    "query": {
    "range": {
    "age": {
    "gte": 40
    }
    }
    },
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job.keyword",
    "size": 10
    }
    },
    "all": {
    "global": {},
    "aggs": {
    "salary_avg": {
    "avg": {
    "field": "salary"
    }
    }
    }
    }
    }
    }

    // agg 排序, term 排序默认以文档数排序
    POST employees/_search
    {
    "size": 0,
    "query": {
    "range": {
    "age": {
    "gte": 20
    }
    }
    },
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job.keyword",
    "size": 10,
    "order": [
    {
    "_count": "asc"
    },
    {
    "_key": "desc"
    }
    ]
    }
    }
    }
    }

    // 以子指标聚合为排序依据
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job.keyword",
    "size": 10,
    "order": [
    {
    "avg_salary": "desc"
    }
    ]
    },
    "aggs": {
    "avg_salary": {
    "avg": {
    "field": "salary"
    }
    }
    }
    }
    }
    }


    // 以子指标聚合为排序依据
    POST /employees/_search
    {
    "size": 0,
    "aggs": {
    "jobs": {
    "terms": {
    "field": "job.keyword",
    "size": 10,
    "order": [
    {
    "stats_salary.min": "desc"
    }
    ]
    },
    "aggs": {
    "stats_salary": {
    "stats": {
    "field": "salary"
    }
    }
    }
    }
    }
    }

2022-03-29 section_24

  1. 词项和全文 搜索 :其中,基于term的查询,是表达语义的最小单位。搜索和利用统计的语言模型进行自然语言处理都需要处理Term

    • 复合查询 : Constant Score 转为 Filter
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    POST /products/_bulk
    { "index": { "_id": 1 }}
    { "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
    { "index": { "_id": 2 }}
    { "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
    { "index": { "_id": 3 }}
    { "productID" : "JODL-X-1937-#pV7","desc":"MBP" }

    POST /products/_search
    {
    "query": {
    "term": {
    "productID.keyword": {
    "value": "XHDK-A-1293-#fJ3"
    }
    }
    }
    }

    // 使用 constant_score 避免算分,并且利用filter缓存
    POST /products/_search
    {
    "query": {
    "constant_score": {
    "filter": {
    "term": {
    "productID.keyword": {
    "value": "XHDK-A-1293-#fJ3"
    }
    }
    }
    }
    }
    }

    • term 的查询需要注意索引阶段分词的影响
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    POST /products/_bulk
    { "index": { "_id": 1 }}
    { "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
    { "index": { "_id": 2 }}
    { "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
    { "index": { "_id": 3 }}
    { "productID" : "JODL-X-1937-#pV7","desc":"MBP" }

    // 数据被分词器做了小写转化,所以大写查不到
    POST /products/_search
    {
    "query": {
    "term": {
    "desc": {
    //"value": "iPhone"
    "value": "iphone"
    }
    }
    }
    }
    POST /products/_search
    {
    "query": {
    "term": {
    "productID": {
    //"value": "xhdk-a-1293-#fJ3",
    "value": "xhdk"
    }
    }
    }
    }
    POST /products/_search
    {
    "query": {
    "term": {
    "productID.keyword": {
    "value": "XHDK-A-1293-#fJ3"
    }
    }
    }
    }

    POST /_analyze
    {
    "analyzer": "standard",
    "text": ["XHDK-A-1293-#fJ3"]
    }
  2. Term Level Query: Term Query / Range Query / Exists Query / Prefix Query / Wildcard Query

  3. 默认情况下,Term查询不对输入条件做分词,而是作为一个整体,查找倒排索引,并且使用相关度算分公式为每个包含该词项的文档进行相关度算分

  4. 可以通过 Constant Score 将查询转换成一个 Filtering , 避免算法消耗,并且利用缓存

  5. 基于全文的查询 : Match Query / Match Phrase Query / Query String Query

    • 索引和搜索都会进行分词,查询词会进行分词后,分别查询, 汇总各自得分
    • Match Query

2022-03-30/31 section_25 - section_29

  1. 结构化搜索,指的是对结构化的数据进行搜索, 主要分为布尔、时间、日期、数字、文本

    • 结构化搜索,可以做精确匹配或者部分匹配, 如: Term查询 和 Prefix 查询
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    // 结构化搜索,精确
    DELETE products
    POST /products/_bulk
    { "index": { "_id": 1 }}
    { "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
    { "index": { "_id": 2 }}
    { "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
    { "index": { "_id": 3 }}
    { "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
    { "index": { "_id": 4 }}
    { "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

    GET products/_mapping
    // 对布尔值 match 查询,含有算分
    POST /products/_search
    {
    "profile": true,
    "explain": true,
    "query": {
    "term": {
    "avaliable": "true"
    }
    }
    }

    // 无算分
    POST /products/_search
    {
    "profile": true,
    "explain": true,
    "query": {
    "constant_score": {
    "filter": {
    "term": {
    "avaliable": "true"
    }
    }
    }
    }
    }

    GET products/_search
    {
    "query": {
    "constant_score": {
    "filter": {
    "range": {
    "price": {
    "gte": 20,
    "lte": 30
    }
    }
    },
    "boost": 1.2
    }
    }
    }

    // Date Math Expresstions : y 年、M 月、 w 周、 d 天、 H/h 小时、m 分钟、s 秒
    //
    GET /products/_search
    {
    "query": {
    "constant_score": {
    "filter": {
    "range": {
    "date": {
    "gte": "now-234y"
    }
    }
    },
    "boost": 1.2
    }
    }
    }

    // Exists
    POST /products/_search
    {
    "query": {
    "constant_score": {
    "filter": {
    "exists": {
    "field": "date"
    }
    },
    "boost": 1.2
    }
    }
    }



    POST /movies/_bulk
    { "index": { "_id": 1 }}
    { "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy"}
    { "index": { "_id": 2 }}
    { "title" : "Dave","year":1993,"genre":["Comedy","Romance"] }
    // 处理多值字段,term 查询是 包含,而不是等于,如果要求对多值进行严格匹配,可以单独加一个数组计数字段,也可以runtime_field
    POST /movies/_search
    {
    "query": {
    "constant_score": {
    "filter": {
    "term": {
    "genre.keyword": "Comedy"
    }
    }
    }
    }
    }
  2. 相关性和相关性算分

    • ES5 之前默认使用 TF-IDF 进行相关性算分,目前采用 BM25算法,目标是计算出查询语句和文档之间的匹配程度
    • TF (词频) : Term Freq , 检索词出现在文档中的频率是 词项总次数/文档总字数, 如果是多个词项,则分别TF计算后求和,则为整体TF,如果是停用词,虽然出现很多次,但是不考虑其TF值
    • IDF (逆文档频率): DF (检索词在所有文档中出现的频率),而 IDF 则是 log(全部文档数量/检索词出现过的文档总数), 所以表示一个词项出现的越多文档里,则值越小
    • TF-IDF 算法只是把TF的求和,变为加权求和: 各个检索词项的 TF * IDF 相加, 证明过程
    • BM25 算法, 引入全部文档的的平均长度,使每个文档的长度与之比值作为一个影响TF的因子,进行调节
    • 可以在索引设置中,定制相似性设置,并制定算分函数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    PUT /my-index
    {
    "settings": {
    "index": {
    "similarity": {
    "my_similarity": {
    "type": "DFR",
    "basic_model": "g",
    "after_effect": "l",
    "normalization": "h2",
    "normalization.h2.c": "3.0"
    }
    }
    }
    },
    "mappings": {
    "properties": {
    "title": {
    "type": "text",
    "similarity": "my_similarity"
    }
    }
    }
    }
    • explain 查看算分, 可以看到使用算分算法的公式各个因子作用的过程及来源还有对应的解释,方便手动计算进行校验
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    // 通过 Explain API 查看TF-IDF
    PUT testscore/_bulk
    { "index": { "_id": 1 }}
    { "content":"we use Elasticsearch to power the search" }
    { "index": { "_id": 2 }}
    { "content":"we like elasticsearch" }
    { "index": { "_id": 3 }}
    { "content":"The scoring of documents is caculated by the scoring formula" }
    { "index": { "_id": 4 }}
    { "content":"you know, for search" }

    // 每个命中的文档,都会进行一次详细的算分过程,并通过explanation进行输出
    POST /testscore/_search
    {
    "explain": true,
    "query": {
    "match": {
    "content": "you"
    // "content": "elasticsearch"
    // "content": "the"
    //"content": "the elasticsearch"
    }
    }
    }
  3. Query Context & Filter Context

    • Query 会进行相关性的算分,而Filter则只回答是否并且可以进行缓存
    • bool查询 : must 和 should有算分,must_not 和 filter 属于Filter Context,不进行算分
    • 如果结构化查询中,对数组对象进行查询,es是包含查询,而不是精确相等,所以可以在设置mapping时,增加一个计数器字段
    • 存疑???bool 查询语句的结构,会影响相关度的算分 (同一层级下的竞争字段,具有相同的权重;嵌套的bool查询可以改变算分的影响) ,如:
    1
    2
    3
    4
    5
    6
    7
    8
    PUT /animals/_doc/1
    {"content": "2 running Quick brown-foxes leap over lazy dog in the summer evening."}

    GET /animals/_search
    {"explain":true,"query":{"bool":{"should":[{"term":{"content":{"value":"brown"}}},{"term":{"content":{"value":"red"}}},{"term":{"content":{"value":"quick"}}},{"term":{"content":{"value":"dog"}}}]}}}

    GET /animals/_search
    {"explain":true,"query":{"bool":{"should":[{"term":{"content":{"value":"quick"}}},{"term":{"content":{"value":"dog"}}},{"bool":{"should":[{"term":{"content":{"value":"brown"}}},{"term":{"content":{"value":"red"}}}]}}]}}}
    • 控制字段的 Boosting, 大于1则提升,小于1降低权重; 小于0,贡献负分
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    DELETE blogs
    POST /blogs/_bulk
    { "index": { "_id": 1 }}
    {"title":"Apple iPad", "content":"Apple iPad,Apple iPad" }
    { "index": { "_id": 2 }}
    {"title":"Apple iPad,Apple iPad", "content":"Apple iPad" }

    POST blogs/_search
    {
    "query": {
    "bool": {
    "should": [
    {
    "match": {
    "title": {
    "query": "apple,ipad",
    "boost": 4 // 让命中的评分增益排名往前
    }
    }
    }
    ]
    }
    }
    }
    // 要求苹果公司产品信息优先则
    DELETE news
    POST /news/_bulk
    { "index": { "_id": 1 }}
    { "content":"Apple Mac" }
    { "index": { "_id": 2 }}
    { "content":"Apple iPad" }
    { "index": { "_id": 3 }}
    { "content":"Apple employee like Apple Pie and Apple Juice" }

    POST /news/_search
    {
    "query": {
    "bool": {
    "must": [
    {
    "match": {
    "content": "apple"
    }
    }
    ]
    }
    }
    }

    POST /news/_search
    {
    "query": {
    "bool": {
    "must": [
    {
    "match": {
    "content": "apple"
    }
    }
    ],
    "must_not": [ // 排除了该条件的文档,不会返回,对比一下 boosting 查询
    {
    "match": {
    "content": "pie"
    }
    }
    ]
    }
    }
    }

    // 使用 boosting 查询,则降低权重往后排,但是可以返回
    POST /news/_search
    {
    "query": {
    "boosting": {
    "positive": {
    "match": {
    "content": "apple"
    }
    },
    "negative": { // 弱化
    "match": {
    "content": "pie"
    }
    },
    "negative_boost": 0.5 // 减益因子,让弱化匹配向后排
    }
    }
    }
  4. 多字段 - 单字符串查询 (dis_max): 如果要跨字段查询同样字符串,则考虑评分影响,考虑 dis_max 和 bool should的对比

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    DELETE blogs
    PUT /blogs/_doc/1
    {
    "title": "Quick brown rabbits",
    "body": "Brown rabbits are commonly seen."
    }

    PUT /blogs/_doc/2
    {
    "title": "Keeping pets healthy",
    "body": "My quick brown fox eats rabbits on a regular basis."
    }


    // 普通查询,1文档在标题和内容中都要brown单词,文档2只在body中包含 brown fox,但是与条件看起来更相关,(虽然文档1都在两个字段都出现了相关性内容)
    POST blogs/_search
    {
    "explain": true,
    "query": {
    "bool": {
    "should": [
    {
    "match": {
    "title": "Brown fox"
    }
    },
    {
    "match": {
    "body": "Brown fox"
    }
    }
    ]
    }
    }
    }

    // title 和 body 字段的查询,在should中进行了计分的叠加,但是我们需要对其突出重点,找到单个最佳匹配,所以使用 Disjuction Max Query,做到与任一匹配到的文档作为结果返回,但是评分按照最匹配的评分进行返回
    POST /blogs/_search
    {
    "query": {
    "dis_max": { // 这样一来,每个文档的最佳匹配的分值会返回
    "queries": [
    {
    "match": {
    "title": "Brown fox"
    }
    },
    {
    "match": {
    "body": "Brown fox"
    }
    }
    ]
    }
    }
    }

    // 设置 tie_breaker, 打破平局,让最佳评分获得非最佳评分的增益,以区分其他文档
    POST /blogs/_search
    {
    "query": {
    "dis_max": {
    "queries": [
    {
    "match": {
    "title": "Quick pets"
    }
    },
    {
    "match": {
    "body": "Quick pets"
    }
    }
    ]
    //,"tie_breaker": 0.1 // 如果没有此设置,则本查询结果的各文档分值一样。可以做到: 获得最佳的匹配语句的评分;然后,让其他匹配项与tie_breaker值相乘;最终汇总两者评分规范化输出。这个值介于0-1之间,0代表使用最佳匹配返回;1代表每个匹配项同等重要;
    }
    }
    }
  5. 多字段 - 单字符串查询 使用 Mutil Match , 处理竞争关系,当字段相互竞争(指文档和文档之间计分比较时), 又相关关联(指文档内部多字段进行匹配时,计分会叠加影响)

    • Best Fields : 评分来自于各匹配文档中,最佳匹配(评分最高的字段评分),并且可以指定 tie_breaker (引入其他因素进行算分) 和 minimum_match
    • Most Fields : 最多匹配,可以让多个匹配字段共同起作用,匹配的字段项越多越好,并可以突出设置一些权重,提升算分
    • Cross Fields : 整合全部字段的内容进行跨字段搜索,匹配的算分越多越好
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    // 和上面的 dis_max 有相似之处
    POST /blogs/_search
    {
    "query": {
    "multi_match": {
    "type": "best_fields", // 默认best
    "query": "Quick pets", // 两个文档评分相同
    "tie_breaker": 0.2,
    "minimum_should_match": 2,
    "fields": ["title", "body"]
    }
    }
    }
    // 案例: 精确度降低,时态信息丢失,要靠增加子字段,并通过most_fileld,尽可能多的让每个字段匹配算分,得出结果
    PUT /titles
    {
    "mappings": {
    "properties": {
    "title": {
    "type": "text",
    "analyzer": "english"
    }
    }
    }
    }

    POST /titles/_bulk
    {"index": {"_id": 1}}
    {"title": "My dog barks"}
    {"index": {"_id": 2}}
    {"title": "I see a lot of barking dogs on the road "}

    GET titles/_search
    {
    "query": {
    "match": {
    "title": "barking dogs" // 英语的时态语法丢失,导致相关性精确度降低
    }
    }
    }

    DELETE /titles
    // 使用子字段,使用不同的分词器进行设置
    PUT /titles
    {
    "mappings": {
    "properties": {
    "title": {
    "type": "text",
    "analyzer": "english",
    "fields": {
    "std": {
    "type": "text",
    "analyzer": "standard"
    }
    }
    }
    }
    }
    }

    GET /titles/_search
    {
    "query": {
    "multi_match": {
    "query": "barking dogs",
    "type": "most_fields", // 将两个字段的算分进行叠加
    "fields": ["title", "title.std"]
    }
    }
    }

    GET /titles/_search
    {
    "query": {
    "multi_match": {
    "query": "barking dogs",
    "type": "most_fields", // 将两个字段的算分进行叠加
    "fields": ["title^2", "title.std"] // 有时候,需要控制某些字段单独做一些权重提升
    }
    }
    }

    // Cross Fields, 将多个字段内容整合在一起,进行综合计算算分,进行匹配
    PUT /address/_doc/1
    {
    "street": "5 Poland Street",
    "city": "London",
    "country": "United Kingdom",
    "postcode": "W1V 3DG"
    }

    PUT /address/_doc/2
    {
    "street": "5 Poland Street",
    "city": "London",
    "country": "United Kingdom",
    "postcode": "W1S 3DG"
    }

    POST /address/_search
    {
    "query": {
    "multi_match": {
    "query": "Poland Street W1V",
    "type": "cross_fields",
    "fields": ["street", "city", "country", "postcode"]
    }
    }
    }

    POST address/_search
    {
    "query": {
    "multi_match": {
    "query": "Poland Street W1V",
    "type": "cross_fields",
    "fields": ["street", "city", "country", "postcode"]
    }
    }
    }

2022-04-02 section_30 - section_31

  1. 多语言以及中文分词检索

    • 人类语言不一定和查询条件不能完全匹配,所以如果抽取词根、归一化词元(清除变音符号)、包含同义词、拼写错误或者同音异形词的范畴匹配

    • 多语言存储和查询 : 可以使用多个字段进行存储

      • 识别用户上下文或者浏览器使用的语言,地理位置,甚至概率估算用户使用语言,进行搜索
      • 一个文本内容中出现多种语言, 如英文中出现德文,则德文相对算分更高,因为更稀有
    • 英文分词的方式 : 如对 You’re 是一个还是多个的问题,对 Half-baked 要分割,还是不分割

    • 中文分词的方式: 应对组合型歧义、交集性歧义、真歧义的问题,如:中华人民共和国 、美国会通过对台售武法案、上海仁和服装厂

      • 最小长度分词法、统计语言模型、基于统计的机器学习算法 (HMM、CRF、SVM、深度学习等算法),考虑上下文

      • HanLP 分词器(https://www.hanlp.com/)、 IK分词器 、Pinyin 分词器(用于汉字拼音搜索)

      • 配置远程词典 ,可以远程扩展字典、停止词词典等

      • ik支持字典热更新

      • 安装方式:

        1
        2
        3
        4
        5
        6
        7
        8
        9
        <!-- elstcisearch-plugin install https://github.com/KennFalcon/elaticsearch-analysis-hanlp/release/download/hanlp.zip -->
        <!-- ES_HOME/config/analysis-hanlp/hanlp-remote.xml -->
        <properties>
        <comment>HanLP Analyzer 扩展配置</comment>
        <!-- 配置远程扩展字典 -->
        <entry key="remote_ext_dict">words_location</entry>
        <!-- 配置远程扩展停止词字典 -->
        <entry key="remote_ext_stopwords">stop_words_location</entry>
        </properties>
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        # 安装ik分词器插件
        curl -XPOST http://localhost:9200/index/_mapping -H 'Content-Type:application/json' -d'
        {
        "properties": {
        "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
        }
        }
        }
        '
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
        54
        55
        56
        57
        58
        59
        60
        61
        62
        // hanlp 分词器: 没有找到 8.0 版本进行实践
        // hanlp_standard
        // hanlp_index : 索引分词器
        // hanlp_nlp : NLP分词器
        // hanlp_n_short : N - 最短路径分词
        // hanlp_dijkstra : 最短路分词
        // hanlp_crf : CRF 分词
        // hanlp_speed : 极速词典分词
        POST _analyze
        {
        "analyzer": "hanlp_standard",
        "text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
        }

        POST /_analyze
        {
        "analyzer": "pinyin",
        "text": ["刘德华"]
        }

        POST _analyze
        {
        "analyzer": "ik_max_word",
        "text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
        }


        // Pinyin 分词器
        DELETE artists
        PUT /artists
        {
        "settings": {
        "analysis": {
        "analyzer": {
        "user_name_analyzer": {
        "tokenizer": "whitespace",
        "filter": "pinyin_first_letter_and_full_pinyin_filter"
        }
        },
        "filter": {
        "pinyin_first_letter_and_full_pinyin_filter": {
        "type": "pinyin",
        "keep_first_leeter": true,
        "keep_full_pinyin": false,
        "keep_none_chinese": true,
        "keep_original": false,
        "limit_first_letter_length": 16,
        "lowercase": true,
        "trim_whitespace": true,
        "keep_none_chinese_in_first_letter": true
        }
        }
        }
        }
        }

        GET /artists/_analyze
        {
        "text": ["刘德华 张学友 郭富城 黎明 四大天王"],
        "analyzer": "user_name_analyzer"
        }

  2. Space Jam, 搜索实例 - TMDB API

    • python脚本: 默认分词器新建索引,并导入数据,进行查询,相关度不明显
    • python脚本: 使用英文分词器,重建索引,并增加英文分词器,进行查询,提升相关性
    • 增加高亮,查看相关查询的效果

2022-04-09 section_32

  1. 搜索模板 - Search Template

    • 解耦ES功能和程序,让优化查询过程 和 程序逻辑分开
    • 使用搜索模板的过程:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    // 创建一个搜索模板
    POST _scripts/tmdb
    {
    "script": {
    "lang": "mustache",
    "source": {
    "size": 20,
    "_source": [
    "title", "overview"
    ],
    "query": {
    "multi_match": {
    "query": "{{q}}",
    "fields": ["title", "overview"],
    // "fields": ["title^10", "overview"] // 后台升级模板,也不会影响前端使用模板进行查询
    }
    }
    }
    }
    }
    DELETE _scripts/tmdb
    GET _scripts/tmdb


    // 验证模板
    POST _render/template
    {
    "id": "tmdb",
    "params": {
    "q": "basketball with catoon aliens"
    }
    }

    // 使用模板进行搜索
    POST tmdb/_search/template
    {
    "id": "tmdb", // 使用search template id,进行搜索
    "params": {
    "q": "basketball with cartoon aliens"
    }
    }

    // 还可以使用模板进行批量的查询,如: my-index/_msearch/template
  2. Index Alias 实现运维零停机 (section 32)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    GET _cat/aliases
    PUT /movies-2019/_doc/1
    {
    "name": "AAA",
    "time": "2019-01-01 00:00:00",
    "rating": 100
    }

    GET /movies-2019
    // 增加别名,让索引可以不间断的进行维护,不影响搜索
    POST _aliases
    {
    "actions": [
    {
    "add": {
    "index": "movies-2019",
    "alias": "movies-latest"
    }
    },
    {
    "add": {
    "index": "movies",
    "alias": "movies-latest"
    }
    }
    ]
    }

    // 查看索引中的设置已经加入别名
    GET /movies-2019

    DELETE /movies-2019

    // 使用别名进行搜索
    POST movies-latest/_search
    {
    "size": 10,
    "query": {
    "match_all": {}
    }
    }

    // 在别名中预设一个过滤器,做为基础条件
    POST _aliases
    {
    "actions": [
    {
    "add": {
    "index": "movies-2019",
    "alias": "movies-latest-highrate",
    "filter": {
    "range": {
    "rating": {
    "gte": 100
    }
    }
    }
    }
    }
    ]
    }

    // 获取别名列表
    GET _alias/mov*

    // 执行搜索,则经过别名设置的条件过滤后,可以看到结果
    POST movies-latest-highrate/_search
    {
    "query": {
    "match_all": {}
    }
    }
  3. Function Score Query (section 33)

    • 在查询结束后,对每一个匹配的文档进行一些重新算分,生成新的分数
      • weight 设置, 可以对每个文档设置权重
      • Field Value Factor : 使用该数值来修改 _score , 将文档中影响算分的数值因子,参与分数计算,例如”热度“和”点赞数“的因素
      • Random Score : 为每一个用户使用一个不同的,随机的算分结果
      • 衰减函数 : 以某个字段为标准,距离某个值越近,得分越高
      • Script Score : 可以自定义脚本完全控制算分逻辑
    • 实践 : 对点赞多的博客,靠前显示,同时基础逻辑依然通过搜索的评分作为依据, 定义公式: 更新后的算分 = 基础算分 * 投票数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124

    // Example 1 : 根据欢迎程度,在基础匹配的前提下,提升博客的权重

    DELETE blogs

    // 文档都一样的情况下,观察算分效果
    PUT /blogs/_doc/1
    {
    "title": "About popularity",
    "content": "In this post we will talk about...",
    "votes": 0
    }

    PUT /blogs/_doc/2
    {
    "title": "About popularity",
    "content": "In this post we will talk about...",
    "votes": 100
    }

    PUT /blogs/_doc/3
    {
    "title": "About popularity",
    "content": "In this post we will talk about...",
    "votes": 1000000
    }

    POST /blogs/_search
    {
    "query": {
    "multi_match": { // 由于文档都一样,所以,基础匹配评分都一样
    "query": "popularity",
    "fields": [
    "title",
    "content"
    ]
    }
    }
    }

    // 此处算分逻辑为: 新算分 = 原算分 * 投票数
    // 这时会有两个问题: 1> 投票为0时,如何? 2> 投票数很大时,如何?
    // 总体思想是,要让结果之间的 差别不能特别大,要显的很平滑
    POST /blogs/_search
    {
    "query": {
    "function_score": {
    "query": {
    "multi_match": {
    "query": "popularity",
    "fields": [
    "title",
    "content"
    ]
    }
    },
    "field_value_factor": { // 可以提升投票数
    "field": "votes"
    // ,"modifier": "log1p" // 因为纯粹的求积运算,对分数的影响放大效果超出合理范围,所以通过更改作用函数,达到影响平滑,支持: none、log、log1p、log2p、ln、ln1p、ln2p、square、sqrt、reciprocal
    //,"factor": 0.5 // 增加因子, 如 new_score = old_score + log(1 + factor * vote_count)
    }
    }
    }
    }


    // Boot Mode 和 Max Boost
    // 上述算分模式,使用了 Boost Mode 的 Multipy 模式,进行算分和函数值的乘积方法
    // 还可以有: Sum(算分和函数的和)、Min/Max(算分和函数之间取大/小)、Replace(使用函数值取代算分)

    // Max Boost : 可以让算分的结果,控制在一个最大的限度内进行

    POST /blogs/_search
    {
    "query": {
    "function_score": {
    "query": {
    "multi_match": {
    "query": "popularity",
    "fields": [
    "title",
    "content"
    ]
    }
    },
    "field_value_factor": { // 可以提升投票数
    "field": "votes",
    "modifier": "log1p",
    "factor": 0.1
    }
    ,"boost_mode": "sum" // 两分相加得出结果
    ,"max_boost": 3 // 控最大分值
    }
    }
    }


    // 一致性随机函数 , 使用同一个种子值,得到的排序则是一样的
    // 场景联想 : 网站的广告需要提高展现率
    // 要求1 : 每个用户看到不同的广告排名
    // 要求2 : 对于一个用户,多次访问,排名稳定

    PUT /_cluster/settings
    {
    "transient": {
    "indices": {
    "id_field_data": {
    "enabled": true
    }
    }
    }
    }

    POST /blogs/_search
    {
    "query": {
    "function_score": {
    "random_score": {
    "seed": 991199, // 改变种子值,可以看到结果相对于seed,保持稳定的排序
    "field": "votes"
    }
    }
    }
    }
  4. Suggest as you type (键入时建议)功能实现,俗称:纠错查询 (Session 34)

    • 例如: Google搜索,elastosearch, 虽然输入错误,但是可以得到一个纠错建议
    • ES 提供了: Suggest API,来实现此种功能,原理是通过将输入文本进行分解,得到Token,然后在索引字典中查找相似的Term,并返回
    • ES 提供4种类别的 Suggesters : (1-2) Term & Phrase Suggester ; (3-4) Complete & Context Suggester
    • 本质,Suggester 是一种特殊类型的搜索
    • 实践 :
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    // 词条推荐算法 : Levenshtein Edit Distance ,基于修改幅度达成和另一个词汇相同,同时通过其他参数控制计算过程的相似性的模糊程度,如"max_edits
    DELETE articles
    PUT articles
    {
    "mappings": {
    "properties": {
    "title_completion":{
    "type": "completion"
    }
    }
    }
    }

    POST articles/_bulk
    { "index" : { } }
    { "title_completion": "lucene is very cool"}
    { "index" : { } }
    { "title_completion": "Elasticsearch builds on top of lucene"}
    { "index" : { } }
    { "title_completion": "Elasticsearch rocks"}
    { "index" : { } }
    { "title_completion": "elastic is the company behind ELK stack"}
    { "index" : { } }
    { "title_completion": "Elk stack rocks"}
    { "index" : {} }

    POST /articles/_search
    {
    "size": 0,
    "suggest": {
    "article-suggest": {
    "prefix": "elk", // 前缀补全匹配
    "completion": {
    "field": "title_completion"
    }
    }
    }
    }


    DELETE articles

    POST articles/_bulk
    { "index" : { } }
    { "body": "lucene is very cool"}
    { "index" : { } }
    { "body": "Elasticsearch builds on top of lucene"}
    { "index" : { } }
    { "body": "Elasticsearch rocks"}
    { "index" : { } }
    { "body": "elastic is the company behind ELK stack"}
    { "index" : { } }
    { "body": "Elk stack rocks"}
    { "index" : {} }
    { "body": "elasticsearch is rock solid"}


    POST /articles/_search
    {
    "size": 1,
    "query": {
    "match": {
    "title_completion": "lucen rock" // 查询不到相应文档, 可以看建议的结果
    }
    },
    "suggest": {
    "term-suggestion": {
    "text": "lucen rock", // 但是通过建议搜索,找到lucen 的相近词条的推荐,而rock是拼写正确的,则不会有所建议,
    "term": {
    "suggest_mode": "missing", // 缺失补全,如果索引中已经存在,就不提供建议; Popular(频率更高)、Always(无论是否存在,都提供建议)
    "field": "body"
    }
    }
    }
    }


    POST /articles/_search
    {
    "suggest": {
    "term-sug": {
    "text": "lucen rock",
    "term": {
    "field": "body",
    "suggest_mode": "popular" // 此时返回了rocks,因为算法认为文档中存在的词,流行程度不一样,但是依然可以找到
    }
    }
    }
    }

    POST /articles/_search
    {
    "suggest": {
    "term-sug": {
    "text": "lucen rock",
    "term": {
    "field": "body",
    "suggest_mode": "always"
    }
    }
    }
    }


    POST /articles/_search
    {
    "suggest": {
    "term-sug": {
    "text": "lucen hocks", // 两个词都拼写错误
    "term": {
    "field": "body",
    "suggest_mode": "always",
    //"prefix_length": 0, // 如果能让hocks得到推荐,则可以控制此参数
    "sort": "frequency"
    }
    }
    }
    }

    // Phrase Suggester, 可以控制更多的容错参数
    POST /articles/_search
    {
    "suggest": {
    "my-sug": {
    "text": "lucne and elasticsear rock hello world ",// lucene 和 elasticsearch拼写错误,同时添加了冗余词
    "phrase": {
    "field": "body",
    "max_errors": 2,
    "confidence": 0,
    //"confidence": 2, // 修改成2,则改变了返回的结果
    "direct_generator": [
    {
    "field": "body",
    "suggest_mode": "always"
    }
    ],
    "highlight": {
    "pre_tag": "<em>",
    "post_tag": "</em>"
    }
    }
    }
    }
    }
  5. Autocomplete 自动补全,每次输入,都进行网络查询,进行匹配项查找

    • ES 提供了 completion suggestion 进行实现
    • 性能要求很高,所以ES使用了不同的数据结构,不是倒排索引,而是将Analyze的数据编码成 FST 和索引一起存放。FST整体加载近内存,速度会查询的非常快
    • FST 只能用于进行前缀查找
    • 定义方式 : 需要在mapping 中对字段类型进行定义,然后索引数据,最后使用
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    DELETE /articles
    PUT articles
    {
    "mappings": {
    "properties": {
    "title_completion": {
    "type": "completion" // 定义补全类型
    }
    }
    }
    }

    POST articles/_bulk
    { "index" : { } }
    { "title_completion": "lucene is very cool"}
    { "index" : { } }
    { "title_completion": "Elasticsearch builds on top of lucene"}
    { "index" : { } }
    { "title_completion": "Elasticsearch rocks"}
    { "index" : { } }
    { "title_completion": "elastic is the company behind ELK stack"}
    { "index" : { } }
    { "title_completion": "Elk stack rocks"}
    { "index" : {} }


    POST articles/_search?pretty
    {
    "size": 0,
    "suggest": {
    "article-sug": {
    "prefix": "el", // 进行前缀查询
    "completion": {
    "field": "title_completion"
    }
    }
    }
    }


    // Context Suggester
    // 其他 : 需要结合用户上下文感知,如:star,可以感知用户是咖啡店,还是电影院
    // ES 支持上下文有 : Category(任意字符串) 和 Geo(地理位置)

    DELETE comments
    PUT comments
    {
    "mappings": {
    "properties": {
    "comment_autocomplete": { // 在索引中建立一个上下文补全存储结构,用于存储将来前缀自动补全的词库
    "type": "completion",
    "contexts": [
    {
    "type": "category",
    "name": "comment_category"
    }
    ]
    }
    }
    }
    }


    POST comments/_doc
    {
    "comment":"I love the star war movies",
    "comment_autocomplete":{
    "input":["star wars"],
    "contexts":{
    "comment_category":"movies"
    }
    }
    }

    POST comments/_doc
    {
    "comment":"Where can I find a Starbucks",
    "comment_autocomplete":{
    "input":["starbucks"],
    "contexts":{
    "comment_category":"coffee"
    }
    }
    }

    POST comments/_search
    {
    "suggest": {
    "YOUR_SUGGESTION": {
    "prefix": "star",
    "completion": {
    "field": "comment_autocomplete",
    "contexts": {
    // "comment_category": "movies" // 进行上下文中的匹配
    "comment_category": "coffee" // 进行上下文中的匹配, 配出咖啡店县关
    }
    }
    }
    }
    }



    // 总结:
    精准度 : Completion > Phrase > Term
    召回率 : Term > Phrase > Completion
    性能 : Completion > Phrase > Term
  6. 跨集群搜索

    • ES 一个集群中,单个Active Master 的问题,会导致元数据存储的越来越多,从而导致整个集群无法正常工作,所以可以继续集群级别扩展
    • 早期的Tribe Node方式,实现跨集群访问,需要Client Node加入每个集群,每个集群的Master信息变更都需要得到Tribe Node 的回应
    • Tribe Node 不保存Cluster State 信息,一旦重启,初始化很慢,另外,索引重名也只能设置一种Prefer规则,
    • 所以ES5.3后推出了 Cross Cluster Search 的方案,允许任何节点,扮演 federated 节点,将搜索请求进行代理,也不需要Client Node的形式加入其他集群
    • 使用方式:
    1
    # 首先,需要启动三个集群,进行试验
    1

  7. ss