Skip to content

Building a Tagged Index (or Replacing Elasticsearch)

Introduction

Elasticsearch is a widely used tool for comparing data across domains, but it comes at a significant cost. It is resource-intensive and requires moving away from the relational SQL world to TCP, which introduces latency due to additional network round-trips.

In short, Elasticsearch is only beneficial for very large, high-performance shared setups—most likely, not for yours. This guide explains how to achieve similar functionality much more efficiently using a relational database, reducing hardware costs and lowering electricity consumption.

Setup

First, we need a generic table that maps to all other tables. With ContentType, this is already handled.

Next, we need a tagging mechanism. There are multiple approaches:

  • Tags with separate key and value fields.
  • Tags with merged key-value fields.
  • Tags with unique key-value pairs, either separate or merged. (Note: Some databases impose a 255-character limit.)

The choice depends on whether MySQL and similar databases need to be supported.

For storing tags, we can use text fields with a key-value syntax. A simple syntax like key=value works well. However, it's important to split only on the first = character, which can be tricky in some programming languages. Using a regular expression, such as /^([^=]+)=(.*)/ in JavaScript, helps handle this correctly.

Detecting Collisions (Optional)

To detect duplicate data across different tables, we use a collision_key. A hashing method adapted from RDF techniques can be applied:

  1. Merge keys and values using a separator like = (or just use the tags) into an array.
  2. Sort the array.
  3. Hash each entry individually.
  4. Generate a final hash from all individual hashes as if they were a single byte string.
import edgy

database = edgy.Database("sqlite:///db.sqlite")
models = edgy.Registry(database=database, with_content_type=True)


class ContentTypeTag(edgy.Model):
    # this prevents the normally set ContentTypeField and replaces it with a common ForeignKey
    content_type = edgy.fields.ForeignKey("ContentType", related_name="tags")
    tag = edgy.fields.TextField()

    class Meta:
        registry = models


class Person(edgy.Model):
    first_name = edgy.fields.CharField(max_length=100)
    last_name = edgy.fields.CharField(max_length=100)

    class Meta:
        registry = models
        unique_together = [("first_name", "last_name")]


class Organisation(edgy.Model):
    name = edgy.fields.CharField(max_length=100, unique=True)

    class Meta:
        registry = models


class Company(edgy.Model):
    name = edgy.fields.CharField(max_length=100, unique=True)

    class Meta:
        registry = models


async def main():
    async with database:
        await models.create_all()
        person = await Person.query.create(first_name="John", last_name="Doe")
        await person.content_type.tags.add({"tag": "name=John Doe"})
        await person.content_type.tags.add({"tag": "type=natural_person"})
        org = await Organisation.query.create(name="Edgy org")
        await org.content_type.tags.add({"tag": "name=Edgy org"})
        await org.content_type.tags.add({"tag": "type=organisation"})
        comp = await Company.query.create(name="Edgy inc")
        await comp.content_type.tags.add({"tag": "name=Edgy inc"})
        await comp.content_type.tags.add({"tag": "type=organisation"})
        # now we can query via content_type
        assert await models.content_type.query.filter(tags__tag="type=organisation").count() == 2


edgy.run_sync(main())

Note

Each entry must be processed (either via hashing or another encoding method) to prevent malicious users from injecting = in value fields, which could cause collisions.

Note

The separator is up to you. I use = because it was used in the Secretgraph project, but any character is valid. You can find additional logic in that project.

Alternative Implementations

If you prefer not to use a shared field for key-value operations, you can separate keys and values into distinct fields. Additionally, more powerful databases like PostgreSQL allow enforcing uniqueness constraints on tag fields.

Operations

Searching for a Key

registry.content_type.query.filter(tags__tag__startswith='key=')

Searching for a Key with a Value Prefix

registry.content_type.query.filter(tags__tag__startswith='key=value_start')

References

Secretgraph