Elasticsearch Refresh API vs Flush API

Refresh api and flush api may seem producing same results, but they have differences about performance and persistency.

As we talked in the previous post, each elasticsearch shard is a Lucene index. To understand refresh api and flush api in elasticsearch, we need to talk about two lucene command, reopen and commit.

Reopen command, like its name refer, re open an index when it is called. After indexing some documents by calling reopen command, you can make documents searchable. This command creates a new segment in memory and write documents from memory buffer into new segment. But still these operations are done in memory. Documents will be lost if some problems occurs in server.

Commit command, on the other hand, merges documents from different segments and write them into the disk. So, documents become persistent. But these operations are source consuming and expensive.

Elasticsearch Refresh API calls lucene’s reopen command and makes documents searchable. Refresh api call creates new segments because of lucene nature.

Like we mentioned above, refresh api call is a reopen operation and it is in memory. Data will be lost on server failures. To prevent this, elasticsearch writes data into translog (one translog per shard) at the same time with writing into memory buffer. Translog datas are fsynced to disk. So they provide persistency even if documents disappear in memory.

With some intervals or if translog file is big enough, data  on the the translog committed to lucene index and becomes persistent. Implicitly, flush api is called and a commit is done. You can call flush api to make a lucene commit explicitly.

In short, refresh api just makes documents searchable in memory. But flush api makes lucene commit and make documents persistent. Flush api must be called carefully because of its expensive operations.

 

Elasticsearch Basics

Last time I was working with Elasticsearch, I thought that writing a post about refresh api usage on bulk operations may help newcomers. But then I realized that it is better to explain elasticsearch api basics and refresh api first.

In this post I will talk about Elasticsearch basics.

Elasticsearch, with the words of itself, is a distributed, RESTful search and analytics engine. It is built on Apache Lucene. Apache Lucene creates and manages inverted indexes and Elasticsearch gets its power from searching indexes instead of searching texts. Inverted indexes keep a list of all unique words in a document and for each word, list of the documents that contains this word.

Elasticsearch is near real time search platform. After you add some documents to an index, it will be ready for search in seconds. (1 second in default).

To understand Elasticsearch structure well, we can compare the terms with RDMS terms.

ES RDMS
Index Table
Document Row
Field Column

At the past, It is told that  ES Type is similar to RDMS Table and ES Index is similar to RDMS Database. But Types are removed in new versions of Elasticsearch.

Personally, I am adding prefixes to Index Names and these prefixes provide me a logical database.

New records in Elasticsearch is called Document and storing new document is called Indexing. Indexes contains one or more Documents and Documents contains one or more Fields.

When you insert a new document, it is firstly added to a memory buffer and Translog. At this level, new documents are not be able to search and it is just in memory. Translog (transaction log) helps elasticsearch to recover datas that has not committed yet and exists in memory in case of crash.

The reason that why Lucene commit is not used instead of Translog is Lucene commits are heavy operations and doing Lucene commit after every document insert/delete/update will decrease performance of the app drastically. To prevent this performance decrease, first data written to Translog and fsynced to Translog with some interval or after requests. These can be changed by settings.

Translog file datas are committed with some intervals to Lucene. So to big translog files are prevented and new translog is created. Commit to Lucene operation can be done with Flush API, we will talk about it in the next post.

translog

 

After the first refresh request to Index (We will talk about fresh api in the next post too), documents that is in Memory Buffer,copied to a new segment in memory again. From now on, documents are searchable.

segment

We will talk about persistence problem, refresh api and flush api in the next post.

Let’s continue with Elasticsearch architecture a bit more.

Elasticsearch can be scaled horizontally by adding new nodes to clusters. A cluster is group of servers that keep your data. It organizes your indexing and searching operations between nodes.

As we talked just before, a node is a server as a part of cluster and it joins the searching and indexing operations. You can add any number of nodes to a cluster.

Shards

An elasticsearch index is comprised of shards. A shard is a lucene index and it cannot be breakable. If you have only one shard in your elasticsearch index and your data is too much for your server capacity, It will  cause problems. Shards can be hosted different nodes so you can separate your elastic search indexes to different nodes. Pros of shards:

  • Split content into different nodes
  • Scale horizontally
  • Improve performance

Segments

Shards contain segments that are actualy data structures with inverted indexes. Datas are kept in segments. Every indexing operation creates new segments. These segments are searched sequentially. Therefore, If there are too many segments it will decrease the performence. To prevent this, elasticsearch merges similar size segments into bigger segments. This merge operation is done by some cpu and i/o operations that effects bulk indexing operation’s performance. To prevent this, it is a good advice disabling merges before bulk operations.

Segments are immutable, so if you update a document, it is marked as deleted and new document is created. Documents that is marked as deleted, is deleted completely by merge operations.

Replicas

Replicas are copy of primary shards but still shards. They are not stored in the same nodes with primary shards. So, they provide that if some server failures or connection problems occur, elasticsearch continues to response requests over replicas.

In first time bulk indexing, setting number of replicas to 0 will improve performance but you may loose your data if some server failure occurs while indexing. After the bulk operation completed. You can set the number of replicas whatever you want.

In this post, I tried to make a introduction Elasticsearch world, If you have any comment or questions, please share below. I will talk about refresh api and flush api in my next post.

Thanks 🙂