neilsriv blog

building semantic and keyword search on cockroachDB


why is search interesting? #

search as a concept has dominated software since even before Google’s success in monetizing it from graph searching algorithms to networking to vector databases. personally, I find the implementation details quite interesting because there are a myriad of optimizations, algorithms, and implementations to help store relevant information about a piece of data, such that you can identify when it is the correct time to fetch that data.

it’s not the most commonly worked on topic either. the majority of software engineers can build a simple CRUD application with sign-in, but they often use third-party libraries or services for search implementations. and we aren’t talking about pure frontend search over arrays of objects. I’m referring to specialized indexes and databases of search documents.

I was also partially inspired by cockroach labs post on building vector search in the first place.

why not elasticsearch #

for the simple reason of not wanting to pay for a managed search service or use valuable memory and storage on a dedicated search container and index on my deploy. plus, the added benefit of showing that you don’t need the fanciest tools to get a basic search implementation.

the last thing to talk about before getting into the technicals is semantic search. there are simple implementations of text search that essentially just do “keyword searching” and elasticsearch includes features like indexing on attributes of data, but both of these solutions usually fail to parse the meaning of a piece of text before indexing it. meaning can be subjective but typically it’s helpful to think about the context of a word in a sentence or paragraph. it’s sometimes helpful to know that a phrase is “positive” or perhaps more relevant to specific locations and people rather than if someone searches for an exact keyword that is in the search entry.

implementation overview #

the basic implementation plan is to create a table of search documents that includes a raw text column, a column for keyword search values, and a column for semantic search values. the keyword search values will be computed using a stored/computed db column with to_tsvector, a postgres built-in function to convert text to a form optimized for text search. the semantic search values are computed using optimized text embedding models like nomic-embed-text and google’s models/text-embedding-004. lastly, I’ll add an index for each column: an inverted index for keyword search and a vector index for semantic search.

I’ll also need to create an association table to map the models I want to search over to their relevant search documents. for now it will be fine to assume that each model will only correspond to one document.

cockroachdb custom set-up #

referring to cockroach labs vector search post and their full-text search docs, I knew that pgvector was supported, vector indexing was coming in v25.2.0, and tsvector was already supported.

I did have to pull the unstable v25.2.0 image (cockroachdb/cockroach-unstable:v25.2.0-rc.1) for local development and enable vector indexing with

1SET CLUSTER SETTING feature.vector_index.enabled = true;

document model #

here’s a brief mermaid diagram outlining our models and relationships, I’ll break this down incrementally.

  erDiagram
    HybridSearchDocument ||--o{ HybridSearchDocumentAssociation : has
    HybridSearchDocument {
        int id
        string raw_text
        tsvector text_tsv
        vector[768] text_embedding_768
    }

    HybridSearchDocumentAssociation {
        int id
        int hybrid_search_document_id FK
        string model_api_identifier
        string model_type
    }

    HybridSearchDocumentAssociation }o--|| User : references
    HybridSearchDocumentAssociation }o--|| Group : references
    HybridSearchDocumentAssociation }o--|| Letter : references
    HybridSearchDocumentAssociation }o--|| Question : references
    HybridSearchDocumentAssociation }o--|| Response : references

    User {
        string api_identifier PK
    }

    Group {
        string api_identifier PK
    }

    Letter {
        string api_identifier PK
    }

    Question {
        string api_identifier PK
    }

    Response {
        string api_identifier PK
    }

first, take a look at the document model

 1from pgvector import Vector
 2
 3class HybridSearchDocument:
 4	...
 5    raw_text: Mapped[str] = mapped_column(nullable=False)
 6
 7    text_tsv_expr_literal = literal_column("text_tsv", type_=TSVECTOR)
 8
 9    text_embedding_768: Mapped[Vector] = mapped_column(
10        Vector(dim=768), nullable=False
11    )

there are only 3 interesting columns:

  • raw_text: the raw text that users may be searching for. This will be compiled at document creation time and filled with relevant content for the corresponding model, like name, description, creator, etc.
  • text_tsv_expr_literal: this is a literal column, meaning not managed by sqlalchemy directly since sqlalchemy seemed to have issues with computed columns. the column is instantiated through a database migration and is a “STORED” column that will directly compute the ts_vector value of the raw_text column:
1ALTER TABLE hybrid_search_document ADD COLUMN IF NOT EXISTS text_tsv TSVECTOR AS (to_tsvector('english', raw_text)) STORED;
  • text_embedding_768: the vector generated by an embedding model based off the raw_text. This will be computed at document creation time. I use 768 as the dimension size since that is the most common dimension size computed by embedding models.

indexes #

to more efficiently execute an actual search query, I need some important indexes:

an inverted index on the computed tsvector column

1CREATE INVERTED INDEX IF NOT EXISTS content_search_inverted_idx ON hybrid_search_document (text_tsv);

inverted indexes are probably the most common kind of index for search data since I am pretty much always searching for the actual contents of a document and never for a specific document itself. Basically I am never going from document -> search values.

and a Vector index on the embedded Vector column

1CREATE VECTOR INDEX IF NOT EXISTS embedding_768_vector_idx ON hybrid_search_document (text_embedding_768);

unfortunately, neither of these indexes could be managed by sqlalchemy either and had to be manually added to the tables via migration as well.

so the search document table is actually fully set-up now and I can add documents and construct search queries. But how will I know what model/data the document actually represents? simplest solution is an association table

association table #

the association table will actually be fairly abstract in this scenario since I don’t want to add ForeignKeys every time a new searchable model is added. so I use weak references to a model’s unique identifier and hydrate the model references after each search query. this can be optimized with batching queries per model table as well.

 1class HybridSearchDocumentAssociation:
 2    hybrid_search_document_id: Mapped[int] = mapped_column(
 3        ForeignKey(
 4            "hybrid_search_document.id",
 5            name="association_hybrid_search_document_id_fkey",
 6            ondelete="CASCADE",
 7        ),
 8        nullable=False,
 9    )
10    model_api_identifier: Mapped[str] = mapped_column(nullable=False)
11    model_type: Mapped[str] = mapped_column(nullable=False)
12
13    document: Mapped["HybridSearchDocument"] = relationship(
14        back_populates="associations",
15    )
16
17class HybridSearchDocument:
18	...
19	associations: Mapped[list[HybridSearchDocumentAssociation]] = relationship(
20	"HybridSearchDocumentAssociation", back_populates="document"
21    )

the 3 interesting columns this time are:

  • hybrid_search_document_id: a foreign key column that links an association to a document. the delete behavior is set to remove associations if a document is removed
  • model_api_identifier: a weak reference to the unique identifier for the model that is related to the search query. for now, each model will be one-to-one with a search document.
  • model_type: just a quick categorization column to help batch hydration queries

I also added sqlalchemy relationships on both the document class and the association class so that I can reference the associations from its document after a search query.

search queries #

I ended up writing 3 query functions: one for semantic search, one for keyword search, and one for a union of the two.

 1def semantic_search_hybrid_search_document(
 2    db: Session, query: str, limit: int = 10
 3) -> list[HybridSearchDocument]:
 4    text_embedding = _generate_text_embedding(query)
 5    return (
 6        db.query(HybridSearchDocument)
 7        .filter(
 8            HybridSearchDocument.text_embedding_768.l2_distance(text_embedding)
 9            < 0.5
10        )
11        .order_by(
12            HybridSearchDocument.text_embedding_768.l2_distance(text_embedding)
13        )
14        .limit(limit)
15        .all()
16    )

for semantic search, it’s necessary to first generate a text embedding from the search query and use that to find search results. Vector columns from pgvector have an l2_distance comparator which compares the euclidian distance from an input embedding. and I arbitrarily chose 0.5 as my distance limit.

 1def keyword_search_hybrid_search_document(
 2    db: Session, query: str, limit: int = 10
 3) -> list[HybridSearchDocument]:
 4    tsquery = func.plainto_tsquery("english", query)
 5    return (
 6        db.query(HybridSearchDocument)
 7        .filter(HybridSearchDocument.text_tsv_expr_literal.op("@@")(tsquery))
 8        .order_by(
 9            func.ts_rank(
10                HybridSearchDocument.text_tsv_expr_literal, tsquery
11            ).desc()
12        )
13        .limit(limit)
14        .all()
15    )

for keyword search, I used a specific query function which will tokenize the search query. and then the actual comparator operator is @@ which is the postgres text search match operator.

 1def dual_search_hybrid_search_document(
 2    db: Session, query: str, limit: int = 10
 3) -> list[HybridSearchDocument]:
 4    text_embedding = _generate_text_embedding(query)
 5    tsquery = func.plainto_tsquery("english", query)
 6    return (
 7        db.query(HybridSearchDocument)
 8        .filter(
 9            or_(
10                HybridSearchDocument.text_embedding_768.l2_distance(
11                    text_embedding
12                )
13                < 0.5,
14                HybridSearchDocument.text_tsv_expr_literal.op("@@")(tsquery),
15            )
16        )
17        .order_by(
18            HybridSearchDocument.text_embedding_768.l2_distance(
19                text_embedding
20            ),
21            func.ts_rank(
22                HybridSearchDocument.text_tsv_expr_literal, tsquery
23            ).desc(),
24        )
25        .limit(limit)
26        .all()
27    )

lastly for dual search, I just or the two filters together.

result hydration #

the search queries just return search documents, so next I have to “hydrate” my search results with the models they are associated with.

I already have model getters based on their reference ids and the tables have the necessary index to make it fast

1def get_models(
2    db: Session, model_cls: type[API_CLS], api_ids: list[str]
3) -> Sequence[API_CLS]:
4	return db.query(model_cls)
5		.filter(model_cls.api_identifier.in_(api_ids))
6		.all()

the more interesting part was linking each model to it’s search association rows by HybridSearchDocumentAssociation.model_type. I use a registry pattern to register searchable sqlalchemy models with their search model_type:

 1SEARCH_MODEL_REGISTRY: RegistrationDict[str, type[APIIdentified]] = (
 2    RegistrationDict("SEARCH_MODEL_REGISTRY")
 3)
 4
 5def register_searchable_model(model_type: str):
 6    def decorator(cls: type[APIIdentified]) -> type[APIIdentified]:
 7        @wraps(cls)
 8        def wrapper(*args, **kwargs):
 9            return cls(*args, **kwargs)
10
11        SEARCH_MODEL_REGISTRY[model_type] = cls
12        return cls
13
14    return decorator

the actual usage becomes very simple:

1@register_searchable_model("letter")
2class Letter:
3	...
caveat about registration pattern

this relies on all models being imported when my app starts up to ensure that the SEARCH_MODEL_REGISTRY is fully populated. I achieve this by ensuring that I run a function import_all_sqla_models which imports all sqla models at startup

what’s left? #

document creation and backfills #

creation #

I have 5 different models that will be searchable in the initial release: two party-type models

  • User
  • Group content-type models that have large text strings
  • Letter
  • Question
  • Response

I need to add logic for creating a search document when each individual model is created, something like

1def create_user(...):
2	user = User.create(...)
3	# TODO: define the raw text for a user
4	raw_text = f"name:{user.name};email:{user.email}"
5	user_document = create_hybrid_search_document(raw_text, user.api_identifier, "user")

defining the text shape for each model will be the bulk of the work here but there is lots of room to iterate and test things out.\

backfills #

I’ll need to create scripts to backfill documents for all existing models in production. these scripts will be idempotent so they can be re-run to regenerate search documents in case the logic for converting models to representative text changes.

search filtering #

it would be useful to filter to just types of models that a user wanted in particular or even filter on the attributes of a model. for example, searching for only users or searching for users that are in a particular group

model relationships #

if you think about all the models as a connected, bi-directional, cyclic graph, keeping track of the edges could be useful from a search perspective. for example, if I search for a user’s email, maybe it would be good to also see all the groups they are in.

the overhead of maintaining this may be tough though, especially because we are not dealing with a dag or a polytree.

future use-cases for embedded text vectors #

  • RAG-implementations can use semantic search to find related content to improve the context for an LLM
  • data-science use-cases like clustering and anomaly detection
  • recommendations by finding content that similar to a user’s content

#blog #cockroach #cockroachlabs #database #postgres #rds #software #swe #tech #search #elasticsearch #vector #vectorsearch #semantic #semanticsearch #pgvector #embedding #semantic-search