RFC: Improving the build of large indexes on replicas #11018

Gerold103 · 2025-01-08T19:00:42Z

Gerold103
Jan 8, 2025
Collaborator

Reviewers

Main Reviewer: @locker
Second Reviewer: @Serpentian
Team Lead: @sergepetrenko
CTO: @sergos

Tickets

Non-blocking index build blocks synchro queue (and can even crash) #10766

Summary

When a space is large enough, building a new index on it can be quite long. Minutes, hours, depending on the space size. The same is about index alter - it might require index rebuild, space fullscan. That isn't a big deal locally on the instance, because the build is asynchronous - transactions can still be processed, even on the changing space.

But it gets complicated in a cluster due to the following reasons.

Replication gets stuck in a replicated cluster. Yes, the index build is async fiber-wise, but it blocks the current fiber. The blockage happens on-replace into _index space. Not on-commit. Because of that the applier's feature of committing the txns asynchronously doesn't help. The longest part happens before the commit.

The replica's lag will grow, it won't receive any new data until the build is finished. But the replication still is alive, and at least it doesn't block the transaction processing on the master when the replication is asynchronous. Unlike the next problem.

Master transaction processing gets stuck in a synchronously replicated cluster. Because the index build transaction on the master blocks the limbo until the appliers also apply it and write to their WALs. And that will last until the quorum of replicas have finished the index build.

Essentially, in a synchro cluster with large spaces it becomes impossible to create new indexes. It requires hacks, like creating a new space with all the needed indexes and same format, then slowly copy the data from the old space, in multiple small transactions, then delete the old space. Sounds not complex really, but it requires the user to change their code to maintain this "migration" process by writing into both old and new spaces while the copying is in progress.

The document tries to suggest solutions how people could create large indexes in a replicaset without blocking the replication.

⭐️⭐️ Solution 1: do nothing

The issue in the ticket isn't really a bug. It is an inconvenience, which has a workaround explained above.

The only problem is that the user would have to support that in their code.

Lets repeat the solution here for clarity. When a user wants a new index or alter an existing one in a non-trivial way, they do this:

Create a new space with all the same metadata and indexes as the original one.
User creates the needed new index on the new space (or alters the one that needs a change).
User duplicates all the replaces and deletes of the old space to the new space. This can be done, for example, by setting an on_replace trigger on the old space, which does the same work on the new space.
User iterates over the old space and copies the tuples into the new space, with yields every N tuples.
When the work is done, the user in a single transaction drops the old space, and renames the new space to the old one's name.

Pros: don't need to do anything, already works.

Cons:

Memory waste on the duplicate data until the old space is dropped.
Inconvenient, easy to make a mistake in the migration process.

⭐️ Solution 2: space alter-clone

Not a bug, as said above. But the inconvenience is quite unhandy. The solution provided above Tarantool could wrap into a nice API available out of the box.

That is, Tarantool would allow to clone a space with any of its indexes and metadata altered. Once the cloning is done, the user could do the final "drop + rename" themselves.

If designed carefully, this could be an interesting tool to do more than just a new index creation, like:

Add more than one index.
Rebuild an index (change its parts in some not forward-inclusive way).

Note, that the problem in the ticket also concerns index alterations which couldn't be completed instantly and require an index scan (for duplicates, values having incompatible type).

If the solution looks interesting enough, a proper API and behaviour design could be proposed. It could be something like old_space:copy(new_space). An example:

-- User creates a new space with any indexes and format.
new_space = create_new_space()

-- User starts non-blocking copying using space:copy() method.
-- It works in transactions max 1000 rows each.
--
-- Until the copying is complete, the new space is attached to the old one, and
-- all the changes to the already copied data are also repeated.
old_space:copy(new_space, {batch_size = 1000})

-- When the copying is complete, the old and new space get detached and become 2
-- different unrelated spaces. Now the user can do drop-old + rename-new in a
-- single txn. It must be done without yields after :copy() is done. Otherwise
-- any DML on the old_space would be missed.
--
-- Alternatively we could consider explicit copy-start and copy-end so the
-- spaces remain glued together until user calls copy-end or the old space is
-- dropped.
name = old_space.name
box.begin()
old_space:drop()
new_space:rename(name)
box.commit()

More interesting outcomes:

Copying could be done for spaces of different engines (as long as we introduce cross-engine transactions).
Memtx space cloning might be relatively cheap if the spaces would have the same format. The new space would just ref the tuples of the old one.

Pros:

Works for any 2 spaces having compatible formats.
Reliable.

Cons:

Potential memory waste on the duplicate data until the old space is dropped.
Many steps to make for the user. Especially problematic that it needs an accurate finalization step.
If master changes during the copying, the process would break. Needs a cleanup and restart on the new master.

⭐️⭐️⭐️ Solution 3: replica-local index

The problem of index creation/alter is hitting the replication hard. One approach could be to attack the replication shortcomings then. That is, drop the replication from the process.

Lets imagine that the replicas and master could build the same indexes independently, fully local. And when finished, the master would in a single small DDL transaction "enable" this index.

The index creation would then be a 2 step process. 1 - create a local index on all replicas. 2 - turn the local index into a global one on the master.

This needs 2 features which aren't available yet, but aren't hard to add:

Replica-local index.
Local->global index promotion.

Replica-local DDL is not unusual for Tarantool. There is right now a space type temporary (not to confuse with data-temporary). It can be created on read-only replicas, can have its own indexes, is visible in _space and its indexes in _index, but it is not replicated, and its data isn't stored in WAL.

Replica-local persistent data also is not a new thing. Tarantool does have "local" spaces. They have replicaset-global meta (_space and _index rows) and their data is persisted, but not replicated. They can only be created by master, but can take DML on any instance and it is not replicated.

The proposal is to introduce replica-local indexes. They can be created by any replica, even read-only, on absolutely any space. This index is persisted in _index and is not replicated.

Creation of the index will not affect replication at all, and won't block the limbo, because replica-local transactions are not synchronous by definition.

To create a new global index, the user then would then go and create a replica-local index on each instance.

Then to make it global the user would on the master instance make index:alter{is_global = true}. Locally it works instantly. When this txn comes to the replicas, it will try to find a replica-local index in this space with all the same meta besides index ID. If found, it also works instantly, by changing the index ID to the global one (ID is primary, so it would mean moving local index's data to the new global index with the global ID, and dropping the empty local index). If not found, a new index is created as usual.

The solution not only allows to create/alter indexes in the cluster bypassing the replication, but also allows the user to purposefully create replica-local indexes without ever making them global. It could be handy to reduce memory usage on the master and speed-up master's DML. Master would only store the unique indexes and handle DML, and the replicas would store the other indexes + serve DQL.

The cons is that the user has to visit each replica to create the replica-local indexes on the first step.

Pros: introduces a new feature - replica-local indexes, which can be used not only for replicaset-wide index building.

Cons: needs 2 steps, one of them to be done on each instance in the replicaset. Including new instances, where this index won't appear automatically.

⭐️⭐️⭐️⭐️ Solution 4: lazy index

Consider another angle - a long index build blocks replication because the transaction can't be committed until the index is built. Then lets just allow this.

Lazy index creation is when its entry is added to _index instantly, launching a background building process, which would run in a special fiber, global or one per each lazy index.

Such an index would be visible and droppable but can't be used or altered until the building is complete. Any usage attempt would return an error.

When the building is complete, the index is usable like any other. If the building fails, the index would report that in its status.

On restart it would behave like a normal index, i.e. block box.cfg{} until the build is finished. Except that if the build has failed, box.cfg{} still finishes ok, and the index status reports it as broken.

The user can later drop the lazy flag from the index options to turn it into a regular index.

An example:

-- User creates a new space with any indexes and format and fills it with a lot
-- of data.
space = create_new_space()
fill_with_data(space)

-- A lazy index creation. Works instantly. Still goes through the limbo and all,
-- works like a regular transaction.
idx = space:create_index('idx', {lazy = true})
assert(idx.status == 'building')

-- Wait until it is complete.
-- ...

-- Check the status.
assert(idx.status == 'complete' or idx.status == 'broken')

Pros:

Can be valuable even without replication, to avoid blocking any specific fiber.
Single manual step on a single master instance.

Cons: implementation can be tricky.

Proposal

The solution-4 (lazy index) looks the most promising. It solves the problem, require minimal action from the user, and can be even considered a feature.

If there are no other suggestions and everybody agrees, I would then describe solution-4 in more details.

Alternatives

Solution 5: index build on replicas is not blocking applier's fiber

The idea was that lets not block the replica's side on the index build. Make the _index transactions in separate fibers. Won't work because the replication is still stuck. Limbo on the master would be blocked anyway. It would be still waiting on the master until the applier commits the index build transaction.

locker · 2025-01-14T08:09:50Z

locker
Jan 14, 2025
Maintainer

Solution No.4 is the best IMO.

Lazy index creation is when its entry is added to _index instantly, launching a background building process, which would run in a special fiber, global or one per each lazy index.

It's unclear whether this fiber should be run on all replicas or only on the master and; what happens if the master is switched while the build is in progress; should we persist the build progress somehow and maybe even replicate the changes; what happens if the build fails (e.g. if the unique constraint is violated). I assume you'll describe the procedure in more details in the next RFC.

On restart it would behave like a normal index, i.e. block box.cfg{} until the build is finished. Except that if the build has failed, box.cfg{} still finishes ok, and the index status reports it as broken.

This would be unacceptable for Vinyl. I think that in case of Vinyl "lazy" (I'd rather name them "disabled") indexes should remain disabled after restart and continue building after box.cfg{} returns.

1 reply

locker Jan 14, 2025
Maintainer

Personally, I don't like solution No.3 because I think we should move towards synchronous replication and a replicaset acting as a single process (i.e. the persistent state and schema should be the same on all instances of a replica set). I don't see any practical use-case for replica-local indexes in this paradigm.

sergepetrenko · 2025-01-14T09:23:02Z

sergepetrenko
Jan 14, 2025
Maintainer

Definitely not solution 1. In some cases with a single large space the cloning process would take up 2x the memory.

Besides, solution 1 basically makes the user write all the code we already have for index build (which's rather complicated):

Take care of transactions which got started before we started the cloning process, but are not yet committed.
Take care of various cases when a transaction comes before or after the current processed position (I suppose the user would either copy the write to both spaces if it falls into the already copied range or only put it into the original space if it falls into not yet processed range).
so on

Solution 2 may be better memory-wise but still has a problem: we would have to take care of master changes, which seems rather complicated: finding the right instance to continue the process (there might be multiple writable instances), taking care of "original" writes which should be duplicated to the new space, versus the ones coming from an existing master and so on.

Solutions 3 and 4 look good to me.
I'd even argue with one of the cons of solution 3:

Cons: needs 2 steps, one of them to be done on each instance in the replicaset. Including new instances, where this index won't appear automatically.

If the index is already built and turned global on all instances of the replicaset, new replicas will simply receive it during join process, like they always do.

OTOH, we don't even have to make the index global if we say that all schema is defined in centralized configuration. In this case each instance will have the same set of indexes built locally, and everything will work as expected, no?

0 replies

Serpentian · 2025-01-14T11:21:13Z

Serpentian
Jan 14, 2025
Collaborator

Solution 1

We should not go this way, IMHO.

This can be done, for example, by setting an on_replace trigger on the old space, which does the same work on the new space.

This won't work for vinyl, insert may yield and abort original transaction. The only reliable option is to modify app's logic and insert into several spaces at once, which seems like a really bad way in terms of user experience.

User iterates over the old space and copies the tuples into the new space, with yields every N tuples.

Agree with Sergey here, user will have to deal with insert of the data not yet moved to another space, which is not trivial.

Solution 2

As you said, it indeed requires user intervention in the case of master change, which I don't really like

Solution 3

This solution doesn't fix the bug, we're discussing here, replication hang. It proposes the alternative way of creating indexes, which users won't use and won't probably know about it. If the locally build index doesn't exist, we fallback to the old behavior, with the same problem.

Moreover, this will require introduction of the local rows in the global (and moreover, synchronous) _index space, which is not trivial in terms of implementation, AFAIR (I've already written the document about this for replication filters).

Solution 4

I like this one, since no user intervention is required. Even if user doesn't change the code, it'll work the new way, without blocking the replication process. I propose to consider this solution more precisely.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tarantool

RFC: Improving the build of large indexes on replicas #11018

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tarantool

RFC: Improving the build of large indexes on replicas #11018

Gerold103 Jan 8, 2025 Collaborator

Reviewers

Tickets

Summary

⭐️⭐️ Solution 1: do nothing

⭐️ Solution 2: space alter-clone

⭐️⭐️⭐️ Solution 3: replica-local index

⭐️⭐️⭐️⭐️ Solution 4: lazy index

Proposal

Alternatives

Solution 5: index build on replicas is not blocking applier's fiber

Replies: 3 comments · 1 reply

locker Jan 14, 2025 Maintainer

locker Jan 14, 2025 Maintainer

sergepetrenko Jan 14, 2025 Maintainer

Serpentian Jan 14, 2025 Collaborator

Solution 1

Solution 2

Solution 3

Solution 4

Gerold103
Jan 8, 2025
Collaborator

Replies: 3 comments 1 reply

locker
Jan 14, 2025
Maintainer

locker Jan 14, 2025
Maintainer

sergepetrenko
Jan 14, 2025
Maintainer

Serpentian
Jan 14, 2025
Collaborator