Skip to main content

Elasticsearch in Practice: Indexing, Searching, and Relevance Scoring

Table of Contents
Elasticsearch is not just a database with a search box. It is a distributed relevance engine built on inverted indexes, and understanding that distinction changes how you design schemas, queries, and aggregations.

I have run Elasticsearch in production for log analytics, product search, and document retrieval pipelines. The same mistakes appear every time a team treats it like a relational database. This post covers what you actually need to know: mappings, query semantics, relevance scoring, and aggregations as they really work.

Core Concepts: Index, Document, Shard, Replica
#

An index is a logical namespace. A document is a JSON object stored in an index. Elasticsearch distributes documents across shards (horizontal partitions), and each shard can have one or more replicas for fault tolerance.

Warning

Do not use Elasticsearch as your primary datastore. It does not support transactions, and its near-real-time consistency model (documents become searchable after a ~1-second refresh interval by default) means you can read your own write and not see it. Store authoritative data in PostgreSQL or another ACID store, and sync to Elasticsearch for search.

A typical cluster topology for production: 3 dedicated master nodes, N data nodes sized for your shard count and data volume. Avoid co-locating masters with data nodes in production.

Index Mappings: Explicit vs Dynamic
#

Elasticsearch can infer field types from the first document you index (dynamic mapping), but in production you should always define explicit mappings.

The most important field type decision is text vs keyword:

  • text fields are analyzed: the string is tokenized, lowercased, and stemmed. Use for full-text search.
  • keyword fields are stored verbatim. Use for exact-match filtering, sorting, and aggregations.

The analyzer pipeline for a text field is: character filters (strip HTML, normalize unicode) -> tokenizer (split on whitespace, punctuation) -> token filters (lowercase, stop words, stemming).

create-index.sh
curl -X PUT "localhost:9200/products" \
  -H "Content-Type: application/json" \
  -u "elastic:changeme" \
  -d '{
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "analysis": {
        "analyzer": {
          "english_custom": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase", "english_stop", "english_stemmer"]
          }
        },
        "filter": {
          "english_stop": { "type": "stop", "stopwords": "_english_" },
          "english_stemmer": { "type": "stemmer", "language": "english" }
        }
      }
    },
    "mappings": {
      "properties": {
        "name":        { "type": "text",    "analyzer": "english_custom" },
        "sku":         { "type": "keyword" },
        "price":       { "type": "double" },
        "category":    { "type": "keyword" },
        "description": { "type": "text",    "analyzer": "english_custom" },
        "created_at":  { "type": "date",    "format": "strict_date_optional_time" },
        "in_stock":    { "type": "boolean" }
      }
    }
  }'
Note

Once an index exists, you cannot change the type of an existing field. You must reindex into a new index. Plan your mappings carefully before going to production, or use index aliases to hide the reindex operation from clients.

Indexing Documents: PUT vs POST, and Bulk for Performance
#

Use PUT /<index>/_doc/<id> when you control the document ID (idempotent upsert). Use POST /<index>/_doc when you want Elasticsearch to generate the ID.

For bulk ingest, always use the _bulk API. Sending individual documents over HTTP for large datasets is a significant performance bottleneck.

bulk-index.sh
# Each action line must be followed by the document line.
# The blank line at the end is required.
curl -X POST "localhost:9200/_bulk" \
  -H "Content-Type: application/x-ndjson" \
  -u "elastic:changeme" \
  --data-binary '
{"index": {"_index": "products", "_id": "1"}}
{"name": "Widget Pro", "sku": "WGT-001", "price": 29.99, "category": "widgets", "in_stock": true}
{"index": {"_index": "products", "_id": "2"}}
{"name": "Gadget Max", "sku": "GDG-002", "price": 49.99, "category": "gadgets", "in_stock": false}
'

For high-throughput pipelines, tune refresh_interval to 30s or -1 during bulk loads, then set it back to 1s. This avoids Lucene segment merges competing with ingest.

Match Query, Bool Query, and Filter vs Must
#

The match query performs full-text search: it analyzes the input and scores documents by relevance. The bool query composes multiple clauses:

  • must: document must match, contributes to score
  • filter: document must match, does NOT contribute to score (cached)
  • should: boosts score if matched, optional unless no must is present
  • must_not: document must not match, no scoring
bool-query.sh
curl -X GET "localhost:9200/products/_search" \
  -H "Content-Type: application/json" \
  -u "elastic:changeme" \
  -d '{
    "query": {
      "bool": {
        "must": [
          { "match": { "name": "widget" } }
        ],
        "filter": [
          { "term":  { "category": "widgets" } },
          { "term":  { "in_stock": true } },
          { "range": { "price": { "lte": 50.0 } } }
        ]
      }
    }
  }'
Tip

Use filter for structured criteria (category, price range, date range, boolean flags). Filter clauses are cached at the segment level and reused across queries, making repeated filtered searches much faster. Use must only for the free-text portion where relevance scoring matters.

Relevance Scoring: BM25, Boost, and Explain
#

Elasticsearch uses BM25 (Best Match 25) as its default similarity function. BM25 scores a document based on:

  • Term frequency: how often the term appears in the field (with diminishing returns)
  • Inverse document frequency: how rare the term is across the index
  • Field length normalization: shorter fields with the same term get a higher score

You can influence scoring with boost:

{
  "query": {
    "bool": {
      "should": [
        { "match": { "name":        { "query": "widget", "boost": 3.0 } } },
        { "match": { "description": { "query": "widget", "boost": 1.0 } } }
      ]
    }
  }
}

To debug why a document scored the way it did, add "explain": true to your query. The response includes a tree of score contributions for each matched document.

explain-query.sh
curl -X GET "localhost:9200/products/_search" \
  -H "Content-Type: application/json" \
  -u "elastic:changeme" \
  -d '{
    "explain": true,
    "query": { "match": { "name": "widget" } }
  }'

Aggregations: Query-Time Analytics
#

Aggregations are computed at query time on the result set. They are part of the search request body, not part of index creation. A common mistake is thinking you can bake aggregation results into the index structure at write time. You cannot.

aggregations.sh
# terms + date_histogram nested aggregation
curl -X GET "localhost:9200/products/_search" \
  -H "Content-Type: application/json" \
  -u "elastic:changeme" \
  -d '{
    "size": 0,
    "query": { "term": { "in_stock": true } },
    "aggs": {
      "by_category": {
        "terms": { "field": "category", "size": 10 },
        "aggs": {
          "avg_price": {
            "avg": { "field": "price" }
          }
        }
      },
      "sales_over_time": {
        "date_histogram": {
          "field": "created_at",
          "calendar_interval": "month"
        }
      }
    }
  }'

Setting "size": 0 tells Elasticsearch not to return hits, only aggregation results. This is a meaningful performance optimization for dashboards that only need aggregate counts.

Index Templates for Multi-Index Patterns
#

When you have time-series indices (logs-2025-01, logs-2025-02, etc.), use index templates to apply consistent mappings and settings automatically.

index-template.sh
curl -X PUT "localhost:9200/_index_template/logs_template" \
  -H "Content-Type: application/json" \
  -u "elastic:changeme" \
  -d '{
    "index_patterns": ["logs-*"],
    "priority": 100,
    "template": {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 1
      },
      "mappings": {
        "properties": {
          "timestamp": { "type": "date" },
          "level":     { "type": "keyword" },
          "service":   { "type": "keyword" },
          "message":   { "type": "text" }
        }
      }
    }
  }'

Combine with ILM (Index Lifecycle Management) to roll over indices at a size or age threshold and delete old data automatically.

Elasticsearch vs OpenSearch
#

AWS forked Elasticsearch 7.10.2 in 2021 to create OpenSearch, after Elastic changed the license to SSPL. The APIs are highly compatible at the query level, but they diverge in:

  • You need the latest Lucene features (vector search with HNSW, ES|QL query language)
  • You are running on-premises or on a cloud that offers managed Elastic
  • You rely on Kibana or Elastic APM integrations
  • Your team has existing Elastic expertise
  • You are deploying on AWS and want managed service without license concerns
  • You need to stay on a fully open-source (Apache 2.0) stack
  • You use AWS Cognito / IAM for access control (native OpenSearch Service integration)
  • Cost is a priority: OpenSearch Serverless scales to zero

For new projects on AWS, OpenSearch is the pragmatic default. The operational overhead of self-managing Elasticsearch clusters rarely pays off.

Common Mistakes
#

Using match where term is correct
match analyzes the query string. If you run match on a keyword field with a value like "widgets", you may get unexpected results because the analysis pipeline transforms the query. For exact-value matching on keyword, boolean, or numeric fields, always use term or terms inside a filter clause.
Dynamic mapping explosion
If you index arbitrary JSON with unknown keys, dynamic mapping creates a new field for every unique key. This can balloon your mapping to thousands of fields, causing memory pressure on the cluster and degrading query performance. Either define explicit mappings or use dynamic: "strict" to reject unknown fields.
Not using filter for non-scoring criteria
Putting range and term conditions in must instead of filter forces Elasticsearch to compute BM25 scores for every matching document, then discard them. Filter clauses are not scored and are cached. This is a 2-10x performance difference for common dashboard queries.
Storing large binary data in documents
Elasticsearch is not an object store. Indexing large binary payloads (PDF content, base64-encoded images) in document fields bloats the index, slows replication, and kills heap utilization. Store binaries in S3 and keep only the extracted text and metadata in Elasticsearch.
Confusing aggregations with indexed data structures
The original post on this blog placed "aggs" inside a PUT mapping body. That is invalid. Aggregations are query-time operations sent with GET /<index>/_search. You cannot pre-aggregate at write time through the mapping API. If you need pre-aggregated data for performance, use transform jobs or roll-up indices.

If you want to go deeper on any of this, I offer 1:1 coaching sessions for engineers working on AI integration, cloud architecture, and platform engineering. Book a session (50 EUR / 60 min) or reach out at manuel.fedele+website@gmail.com.

Related