Implement search index that works with multiple API server instances #655

nscuro · 2023-07-06T14:18:56Z

One of the challenges with making the API server horizontally scalable (#375), is the question of what to do with the local Lucene indexes.

Lucene indexes use write locks, such that only concurrent write operations by multiple processes are not possible. As a consequence, it is not possible to share an indexes across multiple application instances.

Additionally, index modifications are "requested" through DT's internal event system. For example, this is roughly what happens when a new component is created via REST API call:

sequenceDiagram
    Client ->> ComponentResource: PUT /api/v1/component
    ComponentResource ->> QueryManager: Create Component
    QueryManager ->> Database: INSERT INTO "COMPONENT" ...
    QueryManager ->> SingleThreadedEventService: IndexEvent(Action.CREATE, component)
    SingleThreadedEventService ->> IndexTask: inform(indexEvent)
    IndexTask ->> IndexManager: add(indexEvent.component)
    QueryManager ->> SingleThreadedEventService: IndexEvent(Action.COMMIT, Component.class)
    SingleThreadedEventService ->> IndexTask: inform(indexEvent)
    IndexTask ->> IndexManager: commit()

The procedure is similar for when components are updated or deleted. The usage of internal events means that an API server instance can only ever update indexes with changes it itself has made.

If we were to refactor index access such that only one instance could perform writes, and all others only reads, components created or modified by readers would never reflect in the index.

There are a few options I see for dealing with this:

Drop usage of local Lucene indexes entirely
- This involves finding a (optimally) better replacement: Using text search capabilities of Postgres, or utilizing a centralized search server like ElasticSearch
Have each instance of the API server maintain their own Lucene index
- Instead of publishing index changes to the internal event system, publish them to a Kafka topic
- There's no repercussions towards consistency guarantees, as even the current implementation is only eventually consistent

Option (2) would look roughly like this:

sequenceDiagram
    par
        loop continuously
            IndexKafkaConsumer ->> Kafka: Consume index events
            loop for each event
                IndexKafkaConsumer ->> IndexManager: add / delete / commit
            end
        end
    and
        Client ->> ComponentResource: PUT /api/v1/component
        ComponentResource ->> QueryManager: Create Component
        QueryManager ->> Database: INSERT INTO "COMPONENT" ...
        QueryManager ->> Kafka: IndexEvent(Action.CREATE, component)
        QueryManager ->> Kafka: IndexEvent(Action.COMMIT, Component.class)
    end

Note
For this to work, each API server instance must use a dedicated consumer group ID for its IndexKafkaConsumer. All instances must receive all events, if they form a consumer group this will not be the case.

Because the order of write operations on the index matter (CREATE should be processed before COMMIT), the Kafka consumer must be single-threaded. This also means that the only reason to have more than one partition for the Kafka topic would be availability, but not parallelism.

Warning
A notable implication is that the index state can be different across API server instances, depending on consumer lag, event processing failures, etc.

The text was updated successfully, but these errors were encountered:

nscuro · 2023-07-07T09:33:16Z

Few complications:

If new instances of the API server are started later, previous index events may have already been deleted from the Kafka topic; As a consequence, the instance will not be able to build a complete index
While an instance is catching up on processing index events, it may return search results that have been outdated for a long time, depending on the retention policy of the topic and how for long the instance was inactive
Indexes will eventually diverge between instances, so each instance needs a mechanism to detect drift in its own index and repair it; We have such a consistency check currently

nscuro · 2023-07-07T11:00:08Z

Idea from @VithikaS: If tables like COMPONENT and PROJECT had CREATED_AT, UPDATED_AT etc. columns, it would be possible to build indexes incrementally. This would make (re-)indexing solely via database queries a lot more viable. The big benefit being that we could avoid a lot of consistency issues we'd run into if we rely on messaging to propagate index updates.

nscuro · 2023-07-11T08:30:53Z

Leaning onto Full Text Search capabilities of the database we're already using might be preferable for multiple reasons:

Search results are always consistent with what's in the database
Search capability is available to all services connected to the database
Reduced overhead by not having to maintain and operate yet another technology

What needs to be checked is how well FTS is supported in other RDBMSes besides PostgreSQL, as we will eventually have to tackle #642 and can't rely on Postgres-exclusive features.

nscuro · 2023-07-11T08:32:18Z

For a short-term solution, we decided to drop Lucene entirely: #661

nscuro added proposal architecture component/api-server spike/research Requires more research before implementation labels Jul 6, 2023

nscuro changed the title ~~Propagate index commands to multiple API server instances via Kafka~~ Propagate Lucene index updates to multiple API server instances via Kafka Jul 6, 2023

nscuro added the size/L High effort label Jul 7, 2023

nscuro changed the title ~~Propagate Lucene index updates to multiple API server instances via Kafka~~ Propagate Lucene index updates to multiple API server instances Jul 10, 2023

nscuro mentioned this issue Jul 10, 2023

Remove Lucene indexes from API server #661

Closed

nscuro changed the title ~~Propagate Lucene index updates to multiple API server instances~~ Implement search index that works with multiple API server instances Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement search index that works with multiple API server instances #655

Implement search index that works with multiple API server instances #655

nscuro commented Jul 6, 2023 •

edited

Loading

nscuro commented Jul 7, 2023

nscuro commented Jul 7, 2023

nscuro commented Jul 11, 2023

nscuro commented Jul 11, 2023

Implement search index that works with multiple API server instances #655

Implement search index that works with multiple API server instances #655

Comments

nscuro commented Jul 6, 2023 • edited Loading

nscuro commented Jul 7, 2023

nscuro commented Jul 7, 2023

nscuro commented Jul 11, 2023

nscuro commented Jul 11, 2023

nscuro commented Jul 6, 2023 •

edited

Loading