Building a tagged index (or replace elasticsearch)¶
Introduction¶
Elasticsearch is a commonly used software for comparing data across domains. It comes at a hefty price: It is resource hungry and you have to leave the relational sql world for TCP which introduces round-trips on a by magnitudes slower lane.
In short: it is only useful for very big shared high performance setups. Not for yours, most probably. Here I explain how to do it much simpler and with way less resources in a relational db (this saves you hardware costs and shorts your electricity bill).
Setup¶
First we need a generic table which maps to all other tables. We have ContentType. Done.
Second we need a Tag. Here are many flavors possible:
- Tags with seperate key, value fields.
- Tags with merged key, value fields.
- Tags with unique key, values; seperate or merged. Note: some dbs have 255 char limit.
Depending if mysql and others shall be supported
Secondly we need tags, that are text fields with a key value syntax.
We can use TextFields for this. In my projects I use a syntax: key=value
. Stupidly simple but you have to check that you only seperate on the first
=
which is a bit hard in some programming languages (use regex, e.g. /^([^=]+)=(.*)/
for seperating in js).
Third (optionally): It would be nice to detect collisions among data of different tables --> collision_key. For building a hash for a collision key we can leverage an hash method adapted from the rdf guys.
First merge the keys with values with a seperator like =
(or just use the tags) into an array. Sort the array.
The entries are now hashed (each entry) and afterwards a hash is build from all the hashes as if they would be a long bytestring.
import edgy
database = edgy.Database("sqlite:///db.sqlite")
models = edgy.Registry(database=database, with_content_type=True)
class ContentTypeTag(edgy.Model):
# this prevents the normally set ContentTypeField and replaces it with a common ForeignKey
content_type = edgy.fields.ForeignKey("ContentType", related_name="tags")
tag = edgy.fields.TextField()
class Meta:
registry = models
class Person(edgy.Model):
first_name = edgy.fields.CharField(max_length=100)
last_name = edgy.fields.CharField(max_length=100)
class Meta:
registry = models
unique_together = [("first_name", "last_name")]
class Organisation(edgy.Model):
name = edgy.fields.CharField(max_length=100, unique=True)
class Meta:
registry = models
class Company(edgy.Model):
name = edgy.fields.CharField(max_length=100, unique=True)
class Meta:
registry = models
async def main():
async with database:
await models.create_all()
person = await Person.query.create(first_name="John", last_name="Doe")
await person.content_type.tags.add({"tag": "name=John Doe"})
await person.content_type.tags.add({"tag": "type=natural_person"})
org = await Organisation.query.create(name="Edgy org")
await org.content_type.tags.add({"tag": "name=Edgy org"})
await org.content_type.tags.add({"tag": "type=organisation"})
comp = await Company.query.create(name="Edgy inc")
await comp.content_type.tags.add({"tag": "name=Edgy inc"})
await comp.content_type.tags.add({"tag": "type=organisation"})
# now we can query via content_type
assert await models.content_type.query.filter(tags__tag="type=organisation").count() == 2
edgy.run_sync(main())
Note
It is crucial that each entry is mangled (either by hash or an other mangling method) because otherwise malicious users could inject =
in the value data and provoke
collisions.
Note
The seperator is up to you. I just =
because I used this in the secretgraph project, but all chars are elligable. More logic you can lookup there.
Alternative implementations¶
If you don't like the shared field for key value operations you may want seperate fields for both. Also it would be possible (for postgres and other more powerful dbs) to make the tag field unique.
Operations¶
Searching for a key:
use registry.content_type.query.filter(tags__tag__startswith='key=')
Searching for a key and a value starting with:
use registry.content_type.query.filter(tags__tag__startswith='key=value_start')