what is search
process
Inverted Index
data structure which hold a mapping between the term and the document to which it is found
Parsing Steps
step #1
step #2
- calculate the frequency of the word in the corpus i.e multiple docs make a corpus
Creating Inverted Index (Postings list - in search jargon)
words frequency document
docs1
{id: 1, words: “winter is coming”}
doc2
{id: 2, words: “it snows in winter”}
doc3
{id: 3, words: “I love hot chocolate in winter”}
winter 3 1, 2, 3
is 1 1
coming 1 1
it 1 1
snows 1 1
in 2 2, 3
i 1 3
love 1 3
hot 1 3
chocolate 1 3
Search Results doc ids ========================== winter 1, 2, 3 hot 3 snows || hot 1, 3 snows && hot None
if you want to search for words ending with LATE, then we reverse all words in the index and try to match words starting with LATE
i.e. chocolate => ETALOCOHC
so LATE becomes ETAL
search based on substrings => use ngram analysis
Ngram analysis
YOURS => yo you your ou our ours ur urs rs
What is Elasticsearch
Commands
./bin/elasticsearch -Ecluster.name=foo -Enode.name=node1
Schema
types => logical groupings of documents
index => made up of different document types
example
1. blog engine
types = blog post => {title, content, date}
type comment => {user, content, date, }
Sharding and Replication
sharding - process of splitting the index onto multiple nodes or physical machines i.e. every node will only have a subset of your data
Healthcheck
Get cluster health
http://localhost:9200/_cat/health?&pretty
cluster status
GREEN - all shards and replicas available for search
YELLOW - some replicas not available for query
RED - some shards of certain indexes not available
Get node health
http://localhost:9200/_cat/nodes?&pretty
ip, heap, ram, cpu, load_1m, load_5m, load_15, role, master, node_name
Creating Indexes and documents
create new index
curl -XPUT localhost:9200/products
get index info
http://localhost:9200/_cat/indices?&pretty
health, status, index, pri, rep, docs.count, size
Add new product
curl -XPUT localhost:9200/products/mobiles/1 -d {
"name": "",
"storage": ""
"reviews": ["foo", "bar"]
}response
{
"_index": "products",
"_type": "mobiles",
"_id": 1,
"_version": 1,
"_shards": {
"total": 2
}
}to add laptops
curl -XPUT localhost:9200/products/laptops/ -d {
“name”: “”,
“storage”: “”
“reviews”: [“foo”, “bar”]
}
here ES will autogenerate ID if none is passed.
Retrieving documents
http://locahost:9200/products/phones/1
response
{
"_index": "products",
"_type": "mobiles",
"_id": 1,
"_version": 1,
_source": {
"name": "",
"storage": ""
"reviews": ["foo", "bar"]
}
}localhost:9200/products/phones/1&_source=false
{
"_index": "products",
"_type": "mobiles",
"_id": 1,
"_version": 1
}localhost:9200/products/phones/1&_source=name
{
"_index": "products",
"_type": "mobiles",
"_id": 1,
"_version": 1,
_source": {
"name": ""
}
}Updating docs
Full update
curl -XPUT localhost:9200/products/mobiles/1 -d {
"name": "",
"storage": ""
"reviews": ["foo", "bar"]
}response
{
"_index": "products",
"_type": "mobiles",
"_id": 2, <<<<Delete doc
curl -XDELETE localhost:9200/products/mobiles/1
check if doc exists
curl -i -XHEAD localhost:9200/products/mobiles/1
response: 404
delete index
curl -XDELETE localhost:9200/products
Bulk Operations
Method #1
curl -XPOST localhost:9200/_bulk -d
{“index”: {“_index”: “products”, “_type”: “mobiles”, “_id”: 10}}
{“name”: “foo”, “storage”: “”, “reviews”: “”}
{“index”: {“_index”: “products”, “_type”: “mobiles”, “_id”: 11}}
{“name”: “bar”, “storage”: “”, “reviews”: “”}
line 1 -> index and id info
line 2 -> data
Method #2
==
create a .json file with all data
data.json
{"index": {}}
{"name": "foo", "storage": "", "reviews": ""}
{"index": {}}
{"name": "bar", "storage": "", "reviews": ""}curl -H “”
-XPOST
localhost:9200/products/mobiles/_bulk
data-binary @”data.json”
Search Query
there are two contexts in ES query
===
search all documents in customers index for word=foo in any field
localhost:9200/customers/q=foo
===
sort on age desc
localhost:9200/customers?q=foo&sort=age:desc
====
state florida and return 2 results from the 10
localhost:9200/customers?q=state:florida&from=10&size=2
===
returns all documents, also no relevance score is calculate and all docs will have score=1.0
localhost:9200/customers/_search -d
{
“query”: {“match_all”: {}}
}
===
localhost:9200/customers/_search -d
{
"query": {"match_all": {}},
"sort": {"age": {"order": "desc"}},
"from": 5
"size": 2
}Term Searches
search for docs that contains foo in the name field
localhost:9200/customers/_search -d
{
"source": false,
"query": {
"term": {
"name": "foo"
}
}
}response:
{
“hits”:{ “total”: 2, max_score: 4.1, “hits”: []}
}
localhost:9200/customers/_search -d
{
"source": [
"includes": ["name"],
"excludes": ["description"]
],
"query": {"term": {
"name": "foo"
}
}
}Full Text Queries
match with options
simple match
==
search for foo in name - but this is full text search not just the word. It depends on how the field has been analyzed - so it takes care of caps etc
localhost:9200/customers/_search -d
{
"query": {
"match": {
"name": "foo"
}
}
}match name field if either foo or bar exists
localhost:9200/customers/_search -d
{
"query": {
"match": {
"name": {
"query": "foo bar",
"operator": "or"
}
}
}
}match with prefix
==
search for all names starting with F
localhost:9200/customers/_search -d
{
"query": {
"match": {
"name": "f"
}
}
}Boolean Query
MUST
find docs which MUST have street address as magnolia bridge. This normally gives us less results.
localhost:9200/customers/_search -d
{
"query": {
"bool": {
"must": [
"match": { "street": "magnolia" },
"match": { "street": "bridge" }
]
}
}
}find docs which could have street address as magnolia bridge. Both terms might not be present. This gives us more results
localhost:9200/customers/_search -d
{
"query": {
"bool": {
"must": [
"match": { "street": "magnolia" },
"match": { "street": "bridge" }
]
}
}
}Boosted Term Search
==
find docs which have state as CA or FL but boost all CA docs with a factor of 2 - so they will appear higher in search.
localhost:9200/customers/_search -d
{
"query": {
"bool": {
"should": [
"term": { "state": {"value": "CA", "boost": 2} },
"term": { "state": {"value": "FL"} },
]
}
}
}Query with filter + bool
range query
==
localhost:9200/customers/_search -d
{
"query": {
"bool": {
"must": {"match_all": {}}
"filter": {
"range": {"age": {"gte": 20, "lte": 30}}
}
}
}
}
search query with filter
==
find female greater than age 20 in CA
localhost:9200/customers/_search -d
{
"query": {
"bool": {
"must": [
"term": { "state": {"value": "CA"} }
],
"filter": {
"term": {"gender": "female"},
"range": {"age": {"gte": 20}}
}
}
}
}Aggregations in ES
Metric
- sum, average, min, max, count etc
Bucketing
- logically group docs based on search
Matrix
Pipeline
Metric agg - average
combine aggs with query
localhost:9200/customers/_search -d
{
"size": 0,
"aggs": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}
find average age of all residents of CA
localhost:9200/customers/_search -d
{
"size": 0,
"query": {
"bool": {
"filter": {
"match": {
"state": "CA"
}
}
}
}
"agg": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}Stats
localhost:9200/customers/_search -d
{
"size": 0,
"stats": {
"age_stats": {
"stage": {
"field": "age"
}
}
}
}this calculates all the stats for field = age
- count, min, max, avg, sum