, ,

Exploring ElasticSearch with Python

Exploring Elasticsearch with Python.

What’s Elasticsearch precious? I feel like Gollum when confronted by taters. Elasticsearch has been around for awhile now, based on Lucene, it’s become a well known name in the field of text and semi structured data storage, analysis and retrieve category. Even though it’s popular enough to get name recognition I’ve rarely run across it in the wild. We are going to dip our toes into Elasticsearch by working on a small project to store and search a book(s). It just give us enough simple problems to solve that by the end we should have at least a basic understanding of how to connect, store, and retrieve simple documents with Elasticsearch.

Elasticsearch comes with two python wrappers. The low level and high level. The first ends up being a lot of writing JSON, the second wraps common use cases into Python objects, less verbose.

pip install elasticsearch
pip install elasticsearch-dsl

First things first, we need a document to work with. I always fall back onto one of my favorites. Confessions by St. Augustine, an early church father. It’s out of copy write and available via the Gutenberg project. I have a little Python library for working with Gutenberg on GitHub if your interested in scrapping a lot of free books.

All about elasticsearch

One of the first things you will learn about Elasticsearch if you start reading around on the iterwebs is that it’s good and fast at indexing and retrieving smallish documents. Meaning, we probably don’t just want to dump a whole book as one document into Elasticsearch (ES), (we could) besides that would not be much help when searching or retrieving information about a book(s). I think it makes more sense in our case to break our book up into sentences and store each one separate.

If you’re bored read the documentation about documents and indexing in ES. I found this one-liner to be very helpful in understanding data stored in ES.

“An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data.”

Let’s get started by describing how we might break up Confessions into pieces for storage in ES. This would be an example of a document that is the first paragraph and first sentence. In our case Confessions doesn’t have chapters so we will default to 0.

{
"book_id": 3296,
"author_id": 5,
"category": 3,
"chapter_id": 0,
"paragraph": 1,
"sentence": 1,
"sentence_text": "Great art Thou, O Lord, and greatly to be praised; great is Thy power,
and Thy wisdom infinite"
}

The code to interact with elasticsearch

The next piece of code really has nothing to do with Elasticsearch but it helped me think about the document(s) and visualize how I wanted them stored is ES, and if it would be a good idea or not. This set of code (available on GitHub) creates Objects for an Author and a Book. It also allows for the breaking up of a book into paragraphs and sentences with unique indexes, creating a final data packet similar to the one above for storage into ES.

I have this all running on my Linode ES cluster.

from elasticsearch import Elasticsearch
from random import randint


class Book:
    def __init__(self, book_id: int, title: str, author: object, sub_title = None):
        self.book_id = book_id
        self.title = title,
        self.sub_title = sub_title
        self.author = author
        self.raw_text = None
        self.sentence_delimiter = '.'
        self.paragraph_delimiter = '\n\n\n'
        self.paragraphs = None
        self.indexed_paragraphs = []

    def load_raw_text(self):
        with open('downloads/{book}-mod.txt'.format(book=self.book_id)) as f:
            self.raw_text = f.read()

    def split_text_into_paragraphs(self):
        self.paragraphs = self.raw_text.split(self.paragraph_delimiter)
        self.raw_text = None

    def index_paragraphs(self):
        p_counter = 1
        for paragraph in self.paragraphs:
            self.indexed_paragraphs.append({"index": p_counter, "paragraph": paragraph})
            p_counter += 1
        self.paragraphs = None

    def split_paragraphs_into_sentences(self):
        s_counter = 1
        for paragraph in self.indexed_paragraphs:
            sentences = paragraph["paragraph"].split(self.sentence_delimiter)
            for sentence in sentences:
                s_counter += 1
                elastic_book_packet = self.create_data_packet(paragraph, s_counter, sentence)
            yield elastic_book_packet
        self.indexed_paragraphs = None

    def create_data_packet(self, paragraph, s_counter, sentence):
        return {"book_id": self.book_id,
                "author_id": self.author.author_id,
                "category": self.author.category,
                "chapter_id": 0,
                "paragraph": paragraph["index"],
                "sentence_id": s_counter,
                "sentence_text": sentence.replace('\n', '')}


class Author:
    def __init__(self, first_name: str, last_name: str, category: str, middle_name = None):
        self.first_name = first_name
        self.last_name = last_name
        self.middle_name = middle_name
        self.category = category
        self.author_id = randint(1,10000)


class ElasticSink:
    def __init__(self):
        try:
            self.client = Elasticsearch()
        except Exception as e:
            print('Sorry, problem trying to create Elasticsearch client.')
            exit(1)

    def index_document(self, data_packet: dict, index='books', doc_type='sentence'):
        unique_index_id = '{book_id}_{sentence_id}'.format(book_id=data_packet["book_id"],
                                                           sentence_id=data_packet["sentence_id"])
        try:
            response = self.client.index(index=index,
                                     doc_type=doc_type,
                                     body=data_packet,
                                     id=unique_index_id)
            print(response)
        except Exception as e:
            print(f'Something went wrong and I could not index.. {data_packet}')

    def search_for_word_match(self, word: str, index: str, field: str):
        result = self.client.search(index=index,body={'query':{'match':{field:word}}})
        for hit in result["hits"]:
            print(hit)

    def search_and_filter(self, index: str, field: str, word: str, author_id: str):
        result = self.client.search(index=index,
                                    body={
                                      "query": {
                                             "bool" : {
                                                  "must" : [{"term" : {field : word}},],
                                                  "filter": [{"term" : {"author_id" : author_id}}]
                                                  }
                                      }
                                    }

                                )
        for hit in result["hits"]:
            print(hit)


if __name__ == '__main__':
    a = Author(first_name='St.',
               last_name='Augustine',
               category='Early Church Father')
    b = Book(book_id=3296,
             title='The Confessions Of Saint Augustine',
             author=a)
    b.load_raw_text()
    b.split_text_into_paragraphs()
    b.index_paragraphs()
    packets = b.split_paragraphs_into_sentences()
    es = ElasticSink()
    for packet in packets:
        es.index_document(packet)
    es.search_for_word_match(word='faith',
                             index='books',
                             field='sentence_text')
    es.search_and_filter(word='faith',
                             index='books',
                             field='sentence_text',
                             author_id=1168)

the results

I really could not believe how easy it was to interact initially with Elasticsearch via Python. 90% of the code I wrote was to work with my text data and prepare it for ingestion. Actually storing the data packets into ES was almost to easy. I ran the Python script locally on my main ES node, so my connection string defaulted correctly without any configuration.

self.client = Elasticsearch()

Next I just passed my prepared document into the < index > method on my ES client.

response = self.client.index(index=index,
                             doc_type=doc_type,
                             body=data_packet,
                             id=unique_index_id)

So easy. I want to note a few things on the above code. Apparently there are three things you need besides a document you are going to store. They appear to be ways to group documents at first glance. First, the index. This relates to a database on a typical relational database. The second is called < doc_type >, think of this like a table maybe? Third, you have to give the index/document a unique id.

Response of Elasticsearch cluster to requests to index documents.

Above you can see a bunch of responses from Elasticsearch based on the above code to index each sentence from the book. You can see the index is called books, the type is sentence, the id of document is a combination of bookid_sentence_id, along with if it was successful or not. I can also tell you it was amazingly fast to do index that full document.

Simple elasticsearch…..search!

Let’s try a simple search to see if it’s just as easy. I added this method to my ElasticSink class. It simply searches whatever index and field containing a the word I’m looking for. < { ‘query’ : { ‘match’ : { field : word } } } > , notice the word query, obviously, then match, as well as the field I’m calling out.

    def search_for_word_match(self, word: str, index: str, field: str):
        result = self.client.search(index=index,body={'query':{'match':{field:word}}})
        print(result)

When I called it in my code it looked like this.

es.search_for_word_match(word='faith',
                         index='books',
                         field='sentence_text')

It’s amazing the wealth of information you get back. I get some general information about the request like time it took, number of hits, a score with each hit, then the document information itself. In this case the book_id, author_id, category, chapter_id, paragraph, and sentence_id, and of course the full text of the document that contained the word I was looking for.

So in this case apparently St. Augustine in Confessions used the word “faith” 2 times.

I want to note another other easy method for searching Elasticsearch data that seem very powerful. First one is called Filter. Let’s say for example I wanted to do the same search across many books and authors, but wanted to filter down to certain authors, my query might look something like this…

{
   "query": {
      "bool" : {
                "must" : [{"term" : {field : word}},],
                "filter": [{"term" : {"author_id" : author_id}}]
                }
            }
  }

In conclusion

Overall I’m very impressed with the easy of using Elasticsearch. It took about two minutes to install on my Ubuntu servers that already had Java. The two Python packages available for working with ES are easy to use and the out of the box capabilities of storing documents and querying them is amazing. I’m surprised I haven’t seen and heard of more people using ES, makes me wonder why not. Knowing that I haven’t even scratched the surface of Elasticsearch I’m looking forward to diving in and learning more.