Building a Tagged Index (or Replacing Elasticsearch)¶
Introduction¶
Elasticsearch is a widely used tool for comparing data across domains, but it comes at a significant cost. It is resource-intensive and requires moving away from the relational SQL world to TCP, which introduces latency due to additional network round-trips.
In short, Elasticsearch is only beneficial for very large, high-performance shared setups—most likely, not for yours. This guide explains how to achieve similar functionality much more efficiently using a relational database, reducing hardware costs and lowering electricity consumption.
Setup¶
First, we need a generic table that maps to all other tables. With ContentType
, this is already handled.
Next, we need a tagging mechanism. There are multiple approaches:
- Tags with separate key and value fields.
- Tags with merged key-value fields.
- Tags with unique key-value pairs, either separate or merged. (Note: Some databases impose a 255-character limit.)
The choice depends on whether MySQL and similar databases need to be supported.
For storing tags, we can use text fields with a key-value syntax. A simple syntax like key=value
works well. However, it's important to split only on the first =
character, which can be tricky in some programming languages. Using a regular expression, such as /^([^=]+)=(.*)/
in JavaScript, helps handle this correctly.
Detecting Collisions (Optional)¶
To detect duplicate data across different tables, we use a collision_key
. A hashing method adapted from RDF techniques can be applied:
- Merge keys and values using a separator like
=
(or just use the tags) into an array. - Sort the array.
- Hash each entry individually.
- Generate a final hash from all individual hashes as if they were a single byte string.
import edgy
database = edgy.Database("sqlite:///db.sqlite")
models = edgy.Registry(database=database, with_content_type=True)
class ContentTypeTag(edgy.Model):
# this prevents the normally set ContentTypeField and replaces it with a common ForeignKey
content_type = edgy.fields.ForeignKey("ContentType", related_name="tags")
tag = edgy.fields.TextField()
class Meta:
registry = models
class Person(edgy.Model):
first_name = edgy.fields.CharField(max_length=100)
last_name = edgy.fields.CharField(max_length=100)
class Meta:
registry = models
unique_together = [("first_name", "last_name")]
class Organisation(edgy.Model):
name = edgy.fields.CharField(max_length=100, unique=True)
class Meta:
registry = models
class Company(edgy.Model):
name = edgy.fields.CharField(max_length=100, unique=True)
class Meta:
registry = models
async def main():
async with database:
await models.create_all()
person = await Person.query.create(first_name="John", last_name="Doe")
await person.content_type.tags.add({"tag": "name=John Doe"})
await person.content_type.tags.add({"tag": "type=natural_person"})
org = await Organisation.query.create(name="Edgy org")
await org.content_type.tags.add({"tag": "name=Edgy org"})
await org.content_type.tags.add({"tag": "type=organisation"})
comp = await Company.query.create(name="Edgy inc")
await comp.content_type.tags.add({"tag": "name=Edgy inc"})
await comp.content_type.tags.add({"tag": "type=organisation"})
# now we can query via content_type
assert await models.content_type.query.filter(tags__tag="type=organisation").count() == 2
edgy.run_sync(main())
Note
Each entry must be processed (either via hashing or another encoding method) to prevent malicious users from injecting =
in value fields, which could cause collisions.
Note
The separator is up to you. I use =
because it was used in the Secretgraph project, but any character is valid. You can find additional logic in that project.
Alternative Implementations¶
If you prefer not to use a shared field for key-value operations, you can separate keys and values into distinct fields. Additionally, more powerful databases like PostgreSQL allow enforcing uniqueness constraints on tag fields.
Operations¶
Searching for a Key¶
registry.content_type.query.filter(tags__tag__startswith='key=')
Searching for a Key with a Value Prefix¶
registry.content_type.query.filter(tags__tag__startswith='key=value_start')