[improve][client] PIP-393: Improve performance of Negative Acknowledgement #23600

thetumbled · 2024-11-14T08:58:23Z

Motivation

There are many issues with the current implementation of Negative Acknowledgement in Pulsar:

the memory occupation is high.
the code execution efficiency is low.
the redelivery time is not accurate.
multiple negative ack for messages in the same entry(batch) will interfere with each other.
All of these problem is severe and need to be solved.

Memory occupation is high

After the improvement of #23582, we have reduce half more memory occupation
of NegativeAcksTracker by replacing HashMap with ConcurrentLongLongPairHashMap. With 100w entry, the memory occupation decrease from 178Mb to 64Mb. With 1kw entry, the memory occupation decrease from 1132Mb to 512Mb.
The average memory occupation of each entry decrease from 1132MB/10000000=118byte to 512MB/10000000=53byte.

But it is not enough. Assuming that we negative ack message 1w/s, assigning 1h redelivery delay for each message,
the memory occupation of NegativeAcksTracker will be 3600*10000*53/1024/1024/1024=1.77GB, if the delay is 5h,
the required memory is 3600*10000*53/1024/1024/1024*5=8.88GB, which increase too fast.

Code execution efficiency is low

Currently, each time the timer task is triggered, it will iterate all the entries in NegativeAcksTracker.nackedMessages,
which is unnecessary. We can sort entries by timestamp and only iterate the entries that need to be redelivered.

Redelivery time is not accurate

Currently, the redelivery time is controlled by the timerIntervalNanos, which is 1/3 of the negativeAckRedeliveryDelay.
That means, if the negativeAckRedeliveryDelay is 1h, the redelivery time will be 20min, which is unacceptable.

Multiple negative ack for messages in the same entry(batch) will interfere with each other

Currently, NegativeAcksTracker#nackedMessages map (ledgerId, entryId) to timestamp, which means multiple nacks from messages in the same batch share single one timestamp.
If we let msg1 redelivered 10s later, then let msg2 redelivered 20s later, these two messages are delivered 20s later together. msg1 will not be redelivered 10s later as the timestamp recorded in NegativeAcksTracker#nackedMessages is overrode by the second nack call.

we can reproduce this problem with test code below:

Consumer consumer = client.newConsumer()
                .topic("persistent://public/default/testNack")
                .subscriptionName("sub2")
                .subscriptionType(SubscriptionType.Shared)
                .negativeAckRedeliveryDelay(20, TimeUnit.SECONDS) // fixed delay with 20s.
                .subscribe();
        // receive first message and nack it.
        Message msg = consumer.receive();
        MessageIdAdv batchMessageId = (MessageIdAdv) msg.getMessageId();
        int batchIndex = batchMessageId.getBatchIndex();
        log.info("Message received, timestamp:{}, message id:{}, batch index:{}", getTime(), batchMessageId, batchIndex);
        consumer.negativeAcknowledge(msg);
        
        // receive the secode message and sleep for 10s, then nack it.
        msg = consumer.receive();
        batchMessageId = (MessageIdAdv) msg.getMessageId();
        batchIndex = batchMessageId.getBatchIndex();
        log.info("Message received, timestamp:{}, message id:{}, batch index:{}", getTime(), batchMessageId, batchIndex);
        Thread.sleep(10000);
        consumer.negativeAcknowledge(msg);

We expect the second message redelivered 10s later than the first message, as it call nack 10s later than the first one.
However, we will receive two messages together.

You can also reproduce this problem with the test code in this PR: org.apache.pulsar.client.impl.NegativeAcksTest#testNegativeAcksWithBatch

Modifications

Refactor the NegativeAcksTracker to solve the above problems.

Verifying this change

Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: thetumbled#64

thetumbled · 2024-11-14T09:33:47Z

I will implement a space efficient map structure for ConcurrentTripleLong2LongHashMap later, which for now use hashmap to ensure the logical correction.

lhotari · 2024-11-14T09:37:49Z

I will implement a space efficient map structure for ConcurrentTripleLong2LongHashMap later, which for now use hashmap to ensure the logical correction.

@thetumbled Fastutil could be a better source of space efficient map data structures. I believe that there's a templating solution where it's possible to generate code for efficient implementations. In this case, is there a need for the data structure to be concurrent? Following a single writer principle could result in simpler and more performant designs. One way to address message passing from other threads to a single writer thread is to use message passing queues from JCTools which we already use in Pulsar. Just some food for thought.

...rc/main/java/org/apache/pulsar/common/util/collections/ConcurrentTripleLong2LongHashMap.java

thetumbled · 2024-11-14T09:47:28Z

I will implement a space efficient map structure for ConcurrentTripleLong2LongHashMap later, which for now use hashmap to ensure the logical correction.

@thetumbled Fastutil could be a better source of space efficient map data structures. I believe that there's a templating solution where it's possible to generate code for efficient implementations. In this case, is there a need for the data structure to be concurrent? Following a single writer principle could result in simpler and more performant designs. One way to address message passing from other threads to a single writer thread is to use message passing queues from JCTools which we already use in Pulsar. Just some food for thought.

If any solution from Fastutil prove to be more space efficient, i am glad to adopt it. Nack Tracker will consume enormous amount of memory, we have to choose the best one.

lhotari · 2024-11-14T10:06:53Z

If any solution from Fastutil prove to be more space efficient, i am glad to adopt it. Nack Tracker will consume enormous amount of memory, we have to choose the best one.

The main purpose of Fastutil is performance and space efficiency.
Instead of adding very complex keys such as triple keys, there's a chance to use a datastructure that is hierarchical. That's the approach I took in the PendingAcksMap class:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/PendingAcksMap.java

Line 101 in 3d0625b

    
           private final Long2ObjectSortedMap<Long2ObjectSortedMap<IntIntPair>> pendingAcks;

There's a huge benefit when the keys are primitives and you can only achieve that with a hierarchical data structure.
In the long run, I'd rather get rid of our custom collection implementations instead of adding more of them.
It's not trivial to create bug free collections. Using existing libraries for that purpose is strongly preferred.

@thetumbled Have you considered a hierarchical data structure? (such as map of maps)

thetumbled · 2024-11-14T10:09:58Z

If any solution from Fastutil prove to be more space efficient, i am glad to adopt it. Nack Tracker will consume enormous amount of memory, we have to choose the best one.

The main purpose of Fastutil is performance and space efficiency. Instead of adding very complex keys such as triple keys, there's a chance to use a datastructure that is hierarchical. That's the approach I took in the PendingAcksMap class:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/PendingAcksMap.java

Line 101 in 3d0625b

private final Long2ObjectSortedMap<Long2ObjectSortedMap<IntIntPair>> pendingAcks;

There's a huge benefit when the keys are primitives and you can only achieve that with a hierarchical data structure.
In the long run, I'd rather get rid of our custom collection implementations instead of adding more of them.
It's not trivial to create bug free collections. Using existing libraries for that purpose is strongly preferred.
@thetumbled Have you considered a hierarchical data structure? (such as map of maps)

It is a good point, i will test it.

lhotari · 2024-11-14T10:19:42Z

If any solution from Fastutil prove to be more space efficient, i am glad to adopt it. Nack Tracker will consume enormous amount of memory, we have to choose the best one.

The main purpose of Fastutil is performance and space efficiency. Instead of adding very complex keys such as triple keys, there's a chance to use a datastructure that is hierarchical. That's the approach I took in the PendingAcksMap class:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/PendingAcksMap.java

Line 101 in 3d0625b

private final Long2ObjectSortedMap<Long2ObjectSortedMap<IntIntPair>> pendingAcks;

There's a huge benefit when the keys are primitives and you can only achieve that with a hierarchical data structure.
In the long run, I'd rather get rid of our custom collection implementations instead of adding more of them.
It's not trivial to create bug free collections. Using existing libraries for that purpose is strongly preferred.
@thetumbled Have you considered a hierarchical data structure? (such as map of maps)

It is a good point, i will test it.

@thetumbled In certain cases when tracking existence (true/false), it's worth considering to use space efficient bit maps. In Pulsar, we use the RoaringBitmap library.
I think that it should be used for storing nacks.

lhotari · 2024-11-14T10:24:19Z

I think that it should be used for storing nacks.

I guess it's not applicable in this case.

@thetumbled I checked the NegativeAcksTracker class and it seems that the actual key is (ledgerId, entryId).
The partitionIndex and timestamp are part of the value.
partitionIndex doesn't have to be a long value.

It's easy to implement (ledgerId, entryId) as map of maps.

lhotari · 2024-11-14T10:27:16Z

This is a very poor solution in the current implementation:

pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/NegativeAcksTracker.java

Lines 80 to 88 in 22cfa54

    
           nackedMessages.forEach((ledgerId, entryId, partitionIndex, timestamp) -> { 
        
               if (timestamp < now) { 
        
                   MessageId msgId = new MessageIdImpl(ledgerId, entryId, 
        
                           // need to covert non-partitioned topic partition index to -1 
        
                           (int) (partitionIndex == NON_PARTITIONED_TOPIC_PARTITION_INDEX ? -1 : partitionIndex)); 
        
                   addChunkedMessageIdsAndRemoveFromSequenceMap(msgId, messagesToRedeliver, this.consumer); 
        
                   messagesToRedeliver.add(msgId); 
        
               } 
        
           });

There should be a separate datastructure (a list or queue) which contains the entries in timestamp order. The benefit of that is that iterating could stop after the timestamp condition no longer holds.

lhotari · 2024-11-14T10:30:33Z

@thetumbled It looks like there's no need for a map data structure in the first place. That's completely unnecessary for implementing NegativeAcksTracker

thetumbled · 2024-11-14T12:11:12Z

@thetumbled It looks like there's no need for a map data structure in the first place. That's completely unnecessary for implementing NegativeAcksTracker

You are right. We need to improve the code execution efficiency too. some kind of structures sorted by timestamp.

lhotari · 2024-11-14T17:25:50Z

You are right. We need to improve the code execution efficiency too. some kind of structures sorted by timestamp.

Fastutil contains multiple PriorityQueue implementations: https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/PriorityQueue.html.
For example, this would work: https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/objects/ObjectArrayPriorityQueue.html
or this one: https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/objects/ObjectHeapPriorityQueue.html

thetumbled · 2024-11-15T08:02:00Z

You are right. We need to improve the code execution efficiency too. some kind of structures sorted by timestamp.

Fastutil contains multiple PriorityQueue implementations: https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/PriorityQueue.html. For example, this would work: https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/objects/ObjectArrayPriorityQueue.html or this one: https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/objects/ObjectHeapPriorityQueue.html

I propose a pip to fix several issues with nack tracker, with a new data structure :

Long2ObjectSortedMap<Long2ObjectMap<Roaring64Bitmap>> nackedMessages = new Long2ObjectAVLTreeMap<>();

This PR become the implementation PR for PIP-393: #23601.
I will implement PIP-393 soon.

thetumbled added 3 commits November 14, 2024 14:09

add test code.

fa42e8f

test code.

e7f9ccb

fix bug.

b06cba6

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Nov 14, 2024

thetumbled mentioned this pull request Nov 14, 2024

[fix][client] fix multiple nack from messages in the same batch interfere each other. thetumbled/pulsar#64

Open

fix checkstyle.

f711cf1

thetumbled requested review from BewareMyPower, poorbarcode, dao-jun, lhotari, nodece, codelipenghui and Technoboy- November 14, 2024 09:31

lhotari reviewed Nov 14, 2024

View reviewed changes

...rc/main/java/org/apache/pulsar/common/util/collections/ConcurrentTripleLong2LongHashMap.java Outdated Show resolved Hide resolved

lhotari reviewed Nov 14, 2024

View reviewed changes

...rc/main/java/org/apache/pulsar/common/util/collections/ConcurrentTripleLong2LongHashMap.java Outdated Show resolved Hide resolved

thetumbled changed the title ~~[fix][client] fix multiple nack from messages in the same batch interfere each other.~~ [improve][client] PIP-393: Improve performance of Negative Acknowledgement Nov 15, 2024

thetumbled mentioned this pull request Nov 15, 2024

[improve][client] PIP-393: Improve performance of Negative Acknowledgement #23601

Open

15 tasks

Merge remote-tracking branch 'apache/master' into Fix_NackInBatch

29fc52c

thetumbled added 3 commits November 15, 2024 18:29

part one.

f04fc84

support precise time control.

bd8a230

add code.

3fe4e3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][client] PIP-393: Improve performance of Negative Acknowledgement #23600

[improve][client] PIP-393: Improve performance of Negative Acknowledgement #23600

thetumbled commented Nov 14, 2024 •

edited

Loading

thetumbled commented Nov 14, 2024

lhotari commented Nov 14, 2024

thetumbled commented Nov 14, 2024

lhotari commented Nov 14, 2024 •

edited

Loading

thetumbled commented Nov 14, 2024

lhotari commented Nov 14, 2024

lhotari commented Nov 14, 2024

lhotari commented Nov 14, 2024

lhotari commented Nov 14, 2024

thetumbled commented Nov 14, 2024

lhotari commented Nov 14, 2024 •

edited

Loading

thetumbled commented Nov 15, 2024

[improve][client] PIP-393: Improve performance of Negative Acknowledgement #23600

Are you sure you want to change the base?

[improve][client] PIP-393: Improve performance of Negative Acknowledgement #23600

Conversation

thetumbled commented Nov 14, 2024 • edited Loading

Motivation

Memory occupation is high

Code execution efficiency is low

Redelivery time is not accurate

Multiple negative ack for messages in the same entry(batch) will interfere with each other

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

thetumbled commented Nov 14, 2024

lhotari commented Nov 14, 2024

thetumbled commented Nov 14, 2024

lhotari commented Nov 14, 2024 • edited Loading

thetumbled commented Nov 14, 2024

lhotari commented Nov 14, 2024

lhotari commented Nov 14, 2024

lhotari commented Nov 14, 2024

lhotari commented Nov 14, 2024

thetumbled commented Nov 14, 2024

lhotari commented Nov 14, 2024 • edited Loading

thetumbled commented Nov 15, 2024

thetumbled commented Nov 14, 2024 •

edited

Loading

lhotari commented Nov 14, 2024 •

edited

Loading

lhotari commented Nov 14, 2024 •

edited

Loading