elasticsearch之数据导入 · 柳溪的个人博客

数据批量导入

通过bulk API可以批量导入新建的文档。格式由两个JSON格式的文档组成，每个文档由空行隔开，各占一行。第一个文档由操作类型如(index，index会覆盖已存在的相同ID的文档，如果不想覆盖则用create操作方法）和元数据组成,如(index,type,ID)；第二个文档则是要index文档的内容。文档代码示例如下：

{"index":{"_index":"get-together", "_type":"group", "_id":"10"}}
{"name":"Elasticsearch Bucharest"}
{"index":{"_index":"get-together", "_type":"group", "_id":"11"}}
{"name":"Big Data Bucharest"}

然后在终端输入以下命令
REQUESTS_FILE=/tmp/test_bulk
curl -XPOST localhost:9200/_bulk –data-binary @$REQUESTS_FILE
我们也可以把index,type类型放入url中，这样就不必要在每个操作中都注明了。
curl -XPOST localhost:9200/get-together/group/_bulk –data-binary @$REQUESTS_FILE
此时批量导入的文档就可以简写成以下这样了。

{ "index": {}}
{"name":"Elasticsearch Bucharest"}
{ "index": {}}
{"name":"Big Data Bucharest"}

如何快速生成这种格式的数据，我个人方面使用的技巧就是使用linux sed命令处理文档。假如我们的文档如下：
{“BookNum”:1,”BookDescription”:”With”,”Publisher”:”The Pragmatic Programmers”,”By”:”Mike Clark”,”ISBN”:”978-0-9787-3922-5”,”Year”:”2008”,”Pages”:”464”},{“BookNum”:2,”BookDescription”:”With “,”Publisher”:”The Pragmatic Programmers”,”By”:”Mike Clark”,”ISBN”:”978-0-9787-3922-5”,”Year”:”2008”,”Pages”:”464”},{“BookNum”:3,”BookDescription”:”Rails “,”By”:[“Sam Ruby”,”Dave Thomas”,”David Heinemeier Hansson”],”ISBN”:”978-1-93435-616-6”,”Year”:”2009”,”Pages”:”792”}
通过简单的一个命令就可以生成上面那种格式了。-i 代表直接在当前目录的json文档上进行修改。

sed -i -e 's/},{/}\n{ "index": {}}\n{/g' json文档

mappings的设置

通过上面批量导入的文档，elasticsearch会自动探测文档字段的类型。我们可以通过以下命令查询当前文档mappings的设置

curl 'localhost:9200/get-together/group/_mapping?pretty'

但是有时候自动探测出的字段类型并不是我们想要的，因此在导入文档前，我们首先要设置好mapping值。我们可以通过在当前目录下建立mapping.json文档，然后通过从外部文件引入的方式在建立数据库的时候设置好mapping值以及其他的一些配置。

curl -s -XPOST "localhost:9200/get-together" -d@mapping.json
mapping.json示例文档如下：这个示例文件通过settings和mappings两个设置几乎覆盖了elasticsearch的大部分内容点。以后再具体分析。
{
      "settings" : {
            "number_of_shards" : 2,
            "number_of_replicas" : 1,
            "index": {
                  "analysis": {
                    "analyzer": {
                          "myCustomAnalyzer": {
                                "type": "custom",
                                "tokenizer": "myCustomTokenizer",
                                "filter": ["myCustomFilter1", "myCustomFilter2"],
                                "char_filter": ["myCustomCharFilter"]
                          }
                    },
                    "tokenizer": {
                          "myCustomTokenizer": {
                                "type": "letter"
                          },
                          "myCustomNGramTokenizer": {
                            "type" : "ngram",
                            "min_gram" : 2,
                            "max_gram" : 3
                          }

                },
                "filter": {
                      "myCustomFilter1": {
                            "type": "lowercase"
                      },
                      "myCustomFilter2": {
                            "type": "kstem"
                      }
                },
                "char_filter": {
                      "myCustomCharFilter": {
                            "type": "mapping",
                        "mappings": ["ph=>f", " u => you ", "ES=>Elasticsearch"]
                      }
                    }
            }
        }
    },
      "mappings" : {
            "group" : {
                  "_source" : {
                    "enabled" : true
                  },
                  "_all" : {
                    "enabled" : true
                  },
                  "properties" : {
                    "organizer" : { "type" : "string" },
                    "name" : { "type" : "string" },
                    "description" : {
                          "type" : "string",
                          "term_vector": "with_positions_offsets"
                    },
                    "created_on" : {
                          "type" : "date",
                          "format" : "yyyy-MM-dd"
                    },
                    "tags" : {
                          "type" : "string",
                          "index" : "analyzed",
                          "fields": {  
                                "verbatim" : {
                                      "type" : "string",
                                      "index" : "not_analyzed"
                                }
                          }
                    },
                    "members" : { "type" : "string" },
                    "location_group" : { "type" : "string" }
                  }
            },
            "event" : {
                  "_source" : {
                    "enabled" : true
                  },
                  "_all" : {
                    "enabled" : false
                  },
                  "_parent" : {
                    "type" : "group"
                  },
                  "properties" : {
                    "host" : { "type" : "string" },
                    "title" : { "type" : "string" },
                    "description" : {
                          "type" : "string",
                          "term_vector": "with_positions_offsets"
                    },
                "attendees" : { "type" : "string" },
                "date" : {
                    "type" : "date",
                    "format" : "date_hour_minute"
                },
                "reviews" : {
                    "type" : "integer",
                    "null_value" : 0
                },
                "location_event": {
                    "type" : "object",
                    "properties" : {
                        "name" : { "type" : "string" },
                        "geolocation" : { "type" : "geo_point" }
                    }
                    }
                  }
            }
      }
}

也可以通过直接在终端配置mappings

curl -XPUT 'localhost:9200/get-together/_mapping/new-events' -d '{
    "new-events" : {
        "properties" : {
            "host": {
                "type" : "string"
            }
        }
    }
}'

如果新建的数据库已存在，则会设置失败，先删掉原来的数据库再执行上面的命令。

curl -s -XDELETE 'localhost:9200/get-together > /dev/null

设置好mapping值后，通过以下命令可以看到所有已建立的数据库及配置生效情况

curl 'localhost:9200/_cat/indices?v'

####文档的新建，更新，删除

新建 index

文档的新建和更新可以通过同一个index API,在response中会新增一个_version属性，代表当前的是新建的还是更新的文档。

curl -XPUT 'localhost:9200/index/type/id' -d '{...}'

上面的方法是会覆盖原来存在的文档，如果我们想要新建一个全新的文档，而不用担心覆盖已存在的文档。我们可以让elasticsearch自动为我们生成ID号，这时我们要使用POST方法(“store this document under this URL”).而不是PUT方法(“store this document at this URL”)。很容易可以看出PUT的url是固定的，而POST的url在type后面还有一个随机值。

curl -XPOST 'localhost:9200/index/type' -d '{...}'

但是如果我们想要使用自己的ID新建一个全新的文档呢？可以通过以下两种方法实现：

curl -XPUT 'localhost:9200/index/type/id?op_type=create' -d '{...}'
curl -XPUT 'localhost:9200/index/type/id/_create' -d '{...}'

取文档 retrieve

curl -XGET 'localhost:9200/index/type/id?pretty' 
curl -XGET 'localhost:9200/index/type/id?_source=title,text 取某文档的特定字段
curl -XGET 'localhost:9200/index/type/id/_source 返回的只有字段内容，没有元数据

更新 update

除了使用index API

curl -XPUT 'localhost:9200/index/type/id' -d '{...}'

还可以使用update API 部分更新文档。通过一次请求完成更新文档所需要的所有操作——取文档，更新文档，重新建立文档，到后来的删除文档（再新建文档时删除，首先只是标记删除）
最简单的更新请求就是接受部分文档为doc参数，然后和文档合并，已存在的被覆盖，没有的新增

curl -XPOST 'localhost:9200/index/type/id/_update' -d '{
    "doc":{
        "tags":["testing"],
        "views": 0
        }
}'

删除 delete

curl -XDELETE 'localhost:9200/index/type/id'

冲突解决

文档搜索

搜索请求最基本的几个组件

当你选择对哪个数据库进行搜索的时候，你需要配置几个最基本的组件来决定你需要返回哪些条件的文档以及返回的数量等等。

query
最重要的搜索请求组件，通过query DSL和filter DSL进行配置。
size
返回的满足条件文档的数量
from
和size搭配使用，通常用于分页。从满足条件文档的第几个开始返回。
_source
设置_source字段如何返回。默认情况下是返回全部。注意不要在mappings设置里面disable _source字段
sort
默认情况下返回的文档是按照得分高低进行排序，你可以通过sort设置自己的排序条件
完整的一个示例：

curl ‘localhost:9200/index/type/_search’ -d ‘
{
```
"query":{
    "match_all": {}
},
"from": 0,
"size": 10,
"_source": ["name","organizer","description"],
"sort": [{"created_on":"desc"}]
```
}’