ElasticSearch 23 种映射参数详解

3年前 (2022) 程序员胖胖胖虎阿

209 0 0

松哥原创的 Spring Boot 视频教程已经杀青，感兴趣的小伙伴戳这里-->Spring Boot+Vue+微人事视频教程

hello 各位小伙伴，Es 继续更新。从今天开始我们来看 Es 中常见的 23 种映射参数，由于这里涉及到的东西比较多，因此松哥也录制了多个视频来讲解，每次两集，估计可以分三次讲完，今天我们先来学习 analyzer、search_analyzer 以及 normalizer 三种映射参数。

本文是ElasticSearch 系列第十四篇，和大家聊一聊索引的基本操作，前十三篇传送门：

打算出一个 ElasticSearch 教程，谁赞成，谁反对？
ElasticSearch 从安装开始
ElasticSearch 第三弹，核心概念介绍
ElasticSearch 中的中文分词器该怎么玩？
ElasticSearch 索引基本操作
ElasticSearch 文档的添加、获取以及更新
ElasticSearch 文档的删除和批量操作
ElasticSearch 文档路由，你的数据到底存在哪一个分片上？
ElasticSearch 并发的处理方式：锁和版本控制
ElasticSearch 中的倒排索引到底是什么？
ElasticSearch 动态映射与静态映射
ElasticSearch 四种字段类型详解
ElasticSearch 中的地理类型和特殊类型

analyzer 与 search_analyzer 参数：

normailzer 参数：

如果大家觉得视频风格还能接受，也可以看看松哥的付费视频：Spring Boot+Vue+微人事视频教程

以下是视频笔记：

注意，笔记只是视频内容的一个简要记录，因此笔记内容比较简单，完整的内容可以查看视频。

11.1 analyzer

定义文本字段的分词器。默认对索引和查询都是有效的。

假设不用分词器，我们先来看一下索引的结果，创建一个索引并添加一个文档：

PUT blog

PUT blog/_doc/1
{
  "title":"定义文本字段的分词器。默认对索引和查询都是有效的。"
}

查看词条向量（term vectors）

GET blog/_termvectors/1
{
  "fields": ["title"]
}

查看结果如下：

{
  "_index" : "blog",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 0,
  "term_vectors" : {
    "title" : {
      "field_statistics" : {
        "sum_doc_freq" : 22,
        "doc_count" : 1,
        "sum_ttf" : 23
      },
      "terms" : {
        "义" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 1,
              "end_offset" : 2
            }
          ]
        },
        "分" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 7,
              "start_offset" : 7,
              "end_offset" : 8
            }
          ]
        },
        "和" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 15,
              "start_offset" : 16,
              "end_offset" : 17
            }
          ]
        },
        "器" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 9,
              "start_offset" : 9,
              "end_offset" : 10
            }
          ]
        },
        "字" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 4,
              "start_offset" : 4,
              "end_offset" : 5
            }
          ]
        },
        "定" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 1
            }
          ]
        },
        "对" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 12,
              "start_offset" : 13,
              "end_offset" : 14
            }
          ]
        },
        "引" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 14,
              "start_offset" : 15,
              "end_offset" : 16
            }
          ]
        },
        "效" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 21,
              "start_offset" : 22,
              "end_offset" : 23
            }
          ]
        },
        "文" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 2,
              "end_offset" : 3
            }
          ]
        },
        "是" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 19,
              "start_offset" : 20,
              "end_offset" : 21
            }
          ]
        },
        "有" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 20,
              "start_offset" : 21,
              "end_offset" : 22
            }
          ]
        },
        "本" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 3,
              "end_offset" : 4
            }
          ]
        },
        "查" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 16,
              "start_offset" : 17,
              "end_offset" : 18
            }
          ]
        },
        "段" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 5,
              "start_offset" : 5,
              "end_offset" : 6
            }
          ]
        },
        "的" : {
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 6,
              "start_offset" : 6,
              "end_offset" : 7
            },
            {
              "position" : 22,
              "start_offset" : 23,
              "end_offset" : 24
            }
          ]
        },
        "索" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 13,
              "start_offset" : 14,
              "end_offset" : 15
            }
          ]
        },
        "认" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 11,
              "start_offset" : 12,
              "end_offset" : 13
            }
          ]
        },
        "词" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 8,
              "start_offset" : 8,
              "end_offset" : 9
            }
          ]
        },
        "询" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 17,
              "start_offset" : 18,
              "end_offset" : 19
            }
          ]
        },
        "都" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 18,
              "start_offset" : 19,
              "end_offset" : 20
            }
          ]
        },
        "默" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 10,
              "start_offset" : 11,
              "end_offset" : 12
            }
          ]
        }
      }
    }
  }
}

可以看到，默认情况下，中文就是一个字一个字的分，这种分词方式没有任何意义。如果这样分词，查询就只能按照一个字一个字来查，像下面这样：

GET blog/_search
{
  "query": {
    "term": {
      "title": "定"
    }
  }
}

无意义！！！

所以，我们要根据实际情况，配置合适的分词器。

给字段设定分词器：

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "type":"text",
        "analyzer": "ik_smart"
      }
    }
  }
}

存储文档：

PUT blog/_doc/1
{
  "title":"定义文本字段的分词器。默认对索引和查询都是有效的。"
}

查看词条向量：

GET blog/_termvectors/1
{
  "fields": ["title"]
}

查询结果如下：

{
  "_index" : "blog",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 1,
  "term_vectors" : {
    "title" : {
      "field_statistics" : {
        "sum_doc_freq" : 12,
        "doc_count" : 1,
        "sum_ttf" : 13
      },
      "terms" : {
        "分词器" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 4,
              "start_offset" : 7,
              "end_offset" : 10
            }
          ]
        },
        "和" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 8,
              "start_offset" : 16,
              "end_offset" : 17
            }
          ]
        },
        "字段" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 4,
              "end_offset" : 6
            }
          ]
        },
        "定义" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 2
            }
          ]
        },
        "对" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 6,
              "start_offset" : 13,
              "end_offset" : 14
            }
          ]
        },
        "文本" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 2,
              "end_offset" : 4
            }
          ]
        },
        "有效" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 11,
              "start_offset" : 21,
              "end_offset" : 23
            }
          ]
        },
        "查询" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 9,
              "start_offset" : 17,
              "end_offset" : 19
            }
          ]
        },
        "的" : {
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 6,
              "end_offset" : 7
            },
            {
              "position" : 12,
              "start_offset" : 23,
              "end_offset" : 24
            }
          ]
        },
        "索引" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 7,
              "start_offset" : 14,
              "end_offset" : 16
            }
          ]
        },
        "都是" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 10,
              "start_offset" : 19,
              "end_offset" : 21
            }
          ]
        },
        "默认" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 5,
              "start_offset" : 11,
              "end_offset" : 13
            }
          ]
        }
      }
    }
  }
}

然后就可以通过词去搜索了：

GET blog/_search
{
  "query": {
    "term": {
      "title": "索引"
    }
  }
}

11.2 search_analyzer

查询时候的分词器。默认情况下，如果没有配置 search_analyzer，则查询时，首先查看有没有 search_analyzer，有的话，就用 search_analyzer 来进行分词，如果没有，则看有没有 analyzer，如果有，则用 analyzer 来进行分词，否则使用 es 默认的分词器。

11.3 normalizer

normalizer 参数用于解析前（索引或者查询）的标准化配置。

比如，在 es 中，对于一些我们不想切分的字符串，我们通常会将其设置为 keyword，搜索时候也是使用整个词进行搜索。如果在索引前没有做好数据清洗，导致大小写不一致，例如 javaboy 和 JAVABOY，此时，我们就可以使用 normalizer 在索引之前以及查询之前进行文档的标准化。

先来一个反例，创建一个名为 blog 的索引，设置 author 字段类型为 keyword：

PUT blog
{
  "mappings": {
    "properties": {
      "author":{
        "type": "keyword"
      }
    }
  }
}

添加两个文档：

PUT blog/_doc/1
{
  "author":"javaboy"
}

PUT blog/_doc/2
{
  "author":"JAVABOY"
}

然后进行搜索：

GET blog/_search
{
  "query": {
    "term": {
      "author": "JAVABOY"
    }
  }
}

大写关键字可以搜到大写的文档，小写关键字可以搜到小写的文档。

如果使用了 normalizer，可以在索引和查询时，分别对文档进行预处理。

normalizer 定义方式如下：

PUT blog
{
  "settings": {
    "analysis": {
      "normalizer":{
        "my_normalizer":{
          "type":"custom",
          "filter":["lowercase"]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "author":{
        "type": "keyword",
        "normalizer":"my_normalizer"
      }
    }
  }
}