Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement search index that works with multiple API server instances #655

Open
nscuro opened this issue Jul 6, 2023 · 4 comments
Open
Labels
architecture component/api-server proposal size/L High effort spike/research Requires more research before implementation

Comments

@nscuro
Copy link
Member

nscuro commented Jul 6, 2023

One of the challenges with making the API server horizontally scalable (#375), is the question of what to do with the local Lucene indexes.

Lucene indexes use write locks, such that only concurrent write operations by multiple processes are not possible. As a consequence, it is not possible to share an indexes across multiple application instances.

Additionally, index modifications are "requested" through DT's internal event system. For example, this is roughly what happens when a new component is created via REST API call:

sequenceDiagram
    Client ->> ComponentResource: PUT /api/v1/component
    ComponentResource ->> QueryManager: Create Component
    QueryManager ->> Database: INSERT INTO "COMPONENT" ...
    QueryManager ->> SingleThreadedEventService: IndexEvent(Action.CREATE, component)
    SingleThreadedEventService ->> IndexTask: inform(indexEvent)
    IndexTask ->> IndexManager: add(indexEvent.component)
    QueryManager ->> SingleThreadedEventService: IndexEvent(Action.COMMIT, Component.class)
    SingleThreadedEventService ->> IndexTask: inform(indexEvent)
    IndexTask ->> IndexManager: commit()
Loading

The procedure is similar for when components are updated or deleted. The usage of internal events means that an API server instance can only ever update indexes with changes it itself has made.

If we were to refactor index access such that only one instance could perform writes, and all others only reads, components created or modified by readers would never reflect in the index.

There are a few options I see for dealing with this:

  1. Drop usage of local Lucene indexes entirely
    • This involves finding a (optimally) better replacement: Using text search capabilities of Postgres, or utilizing a centralized search server like ElasticSearch
  2. Have each instance of the API server maintain their own Lucene index
    • Instead of publishing index changes to the internal event system, publish them to a Kafka topic
    • There's no repercussions towards consistency guarantees, as even the current implementation is only eventually consistent

Option (2) would look roughly like this:

sequenceDiagram
    par
        loop continuously
            IndexKafkaConsumer ->> Kafka: Consume index events
            loop for each event
                IndexKafkaConsumer ->> IndexManager: add / delete / commit
            end
        end
    and
        Client ->> ComponentResource: PUT /api/v1/component
        ComponentResource ->> QueryManager: Create Component
        QueryManager ->> Database: INSERT INTO "COMPONENT" ...
        QueryManager ->> Kafka: IndexEvent(Action.CREATE, component)
        QueryManager ->> Kafka: IndexEvent(Action.COMMIT, Component.class)
    end
Loading

Note
For this to work, each API server instance must use a dedicated consumer group ID for its IndexKafkaConsumer. All instances must receive all events, if they form a consumer group this will not be the case.

Because the order of write operations on the index matter (CREATE should be processed before COMMIT), the Kafka consumer must be single-threaded. This also means that the only reason to have more than one partition for the Kafka topic would be availability, but not parallelism.

Warning
A notable implication is that the index state can be different across API server instances, depending on consumer lag, event processing failures, etc.

@nscuro nscuro added proposal architecture component/api-server spike/research Requires more research before implementation labels Jul 6, 2023
@nscuro nscuro changed the title Propagate index commands to multiple API server instances via Kafka Propagate Lucene index updates to multiple API server instances via Kafka Jul 6, 2023
@nscuro
Copy link
Member Author

nscuro commented Jul 7, 2023

Few complications:

  • If new instances of the API server are started later, previous index events may have already been deleted from the Kafka topic; As a consequence, the instance will not be able to build a complete index
  • While an instance is catching up on processing index events, it may return search results that have been outdated for a long time, depending on the retention policy of the topic and how for long the instance was inactive
  • Indexes will eventually diverge between instances, so each instance needs a mechanism to detect drift in its own index and repair it; We have such a consistency check currently

@nscuro nscuro added the size/L High effort label Jul 7, 2023
@nscuro
Copy link
Member Author

nscuro commented Jul 7, 2023

Idea from @VithikaS: If tables like COMPONENT and PROJECT had CREATED_AT, UPDATED_AT etc. columns, it would be possible to build indexes incrementally. This would make (re-)indexing solely via database queries a lot more viable. The big benefit being that we could avoid a lot of consistency issues we'd run into if we rely on messaging to propagate index updates.

@nscuro nscuro changed the title Propagate Lucene index updates to multiple API server instances via Kafka Propagate Lucene index updates to multiple API server instances Jul 10, 2023
@nscuro
Copy link
Member Author

nscuro commented Jul 11, 2023

Leaning onto Full Text Search capabilities of the database we're already using might be preferable for multiple reasons:

  • Search results are always consistent with what's in the database
  • Search capability is available to all services connected to the database
  • Reduced overhead by not having to maintain and operate yet another technology

What needs to be checked is how well FTS is supported in other RDBMSes besides PostgreSQL, as we will eventually have to tackle #642 and can't rely on Postgres-exclusive features.

@nscuro nscuro changed the title Propagate Lucene index updates to multiple API server instances Implement search index that works with multiple API server instances Jul 11, 2023
@nscuro
Copy link
Member Author

nscuro commented Jul 11, 2023

For a short-term solution, we decided to drop Lucene entirely: #661

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture component/api-server proposal size/L High effort spike/research Requires more research before implementation
Projects
None yet
Development

No branches or pull requests

1 participant