Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some komodo_...() functions needs to be wrapped by cs_main lock #482

Closed
dimxy opened this issue Aug 20, 2021 · 12 comments
Closed

some komodo_...() functions needs to be wrapped by cs_main lock #482

dimxy opened this issue Aug 20, 2021 · 12 comments

Comments

@dimxy
Copy link
Collaborator

dimxy commented Aug 20, 2021

Several functions komodo_chainactive komodo_heightstamp komodo_chainactive_timestamp access the chainActive object that needs to be protected by cs_main critical section.
It looks like calls to these functions are not surrounded with LOC(cs_main) when required. At least the nspv server part implementation may call those functions without cs_main lock.
That needs to be added.
Also it is suggested to add AssertLockHeld(cs_main) into those functions to track locks (this is activated by macro DEBUG_LOCKORDER)

@jmjatlanta
Copy link

An option to consider: move the functionality of those methods to within the CChain class and do the locking there. Sometimes this can centralize the mutex to only be needed within the class, easing maintenance, and removing the need for the assertion.

@dimxy
Copy link
Collaborator Author

dimxy commented Aug 20, 2021

I thought about that, but checking the code I think it is not as simple as that: sometimes several different calls need to be wrapped by the same lock object, so it is only the caller who knows which code to lock.
At least I could see NSPV part does not make locks at all and probably may get corrupted, so as I am currently working on NSPV I am going to add missed locks

@jmjatlanta
Copy link

jmjatlanta commented Aug 27, 2021

With little research, I can (and will) do a PR to add the necessary LOCK(cs_main) to the methods mentioned. I feel we're playing a bit of "Whack-A-Mole" here.

What I've found:

  • cs_main appears in the code 396 times. Some of those times are in comments, but only a few.
  • LOCK(cs_main (locking it or locking it with something else) appears 156 times
  • With such a quantity of uses, it is difficult for programmers to know what cs_main is meant to protect.
  • Some methods have static analysis instrumentation included for use with clang that may help avoid problems, others do not.
  • Some values that need protection via cs_main are directly accessed, making static analysis instrumentation cumbersome or impossible.

Synchronization is difficult, and often implemented wrong. There are ways to ease the burden, but such strategies require understanding and compliance by all who implement and use it.

A quick search led me to this example. Granted, it seems to only be used by Verus, but it will illustrate what I believe to be an issue...

CBlockIndex *get_chainactive(int32_t height)

The method CBlockIndex *get_chainactive(int32_t height) locks cs_main, but not early enough. chainActive.LastTip() calls std::vector.size() which is not thread-safe. This is undefined behavior. I understand that on most processors (and all the ones we deal with) reading an integer is an atomic operation. But WORD alignment, cache questions, memory fencing, and the simple fact that this is considered undefined behavior means we should correct it.

Moving the lock to the beginning of the method is one option. But doing this everywhere would probably lead to the application performing worse than if it were designed to be single-threaded.

Another possible solution is for chainActive to provide a method to retrieve the requested CBlockIndex and use an internal mutex to protect any operation on the vector vChain (vChain is already private, so no external accessors). But that often requires two sets of methods. One to do a simple "get me the CBlockIndex" and another for "get me the CBlockIndex and don't let anyone else do anything with the vector until I'm done". Recursive mutexes could be used instead, but that comes with a performance hit.

So the short answer is to fix these things when we find them, I understand that. The long answer is to rework the purpose of cs_main (i.e. make it responsible for less) before performance takes a hit, or programmers who don't know better (myself included) write new code that causes this GitHub issue to repeat ad infintum.

@jmjatlanta
Copy link

jmjatlanta commented Aug 27, 2021

I made a quick template to brute-force check for other non-lock situations at runtime using AssertLockHeld. You can see it at 7d7eabb

Just swap out the CChain in main.cpp and main.h as was done in that commit. Running komodod gives me

./komodod.sh
chainparams_commandline called
call komodo_args.(./src/komodod) NOTARY_PUBKEY.()
initialized  at 1630086509
nMaxConnections 384
.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Assertion failed: lock mutex not held in chain.h:686; locks held:

And for those curious, here is the backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ff0b94e1859 in __GI_abort () at abort.c:79
#2  0x000055ac749eeaf7 in AssertLockHeldInternal (pszName=pszName@entry=0x55ac752524ba "mutex", pszFile=pszFile@entry=0x55ac7515b30a "chain.h", nLine=nLine@entry=686, cs=0x55ac7556c3a0 <cs_main>)
    at sync.cpp:188
#3  0x000055ac74572b25 in MultithreadedCChain<AnnotatedMixin<boost::recursive_mutex> >::SetTip (pindex=0x55ace5089ca0, this=0x55ac7556c320 <chainActive>) at chain.h:686
#4  LoadBlockIndexDB () at main.cpp:6267
#5  0x000055ac74573072 in LoadBlockIndex () at main.cpp:6579
#6  0x000055ac744e5501 in AppInit2 (threadGroup=..., scheduler=...) at init.cpp:1716
#7  0x000055ac74474656 in AppInit (argc=<optimized out>, argv=<optimized out>) at bitcoind.cpp:235
#8  0x000055ac74474f02 in main (argc=5, argv=0x7ffe8fcbd858) at bitcoind.cpp:263

@dimxy
Copy link
Collaborator Author

dimxy commented Aug 27, 2021

I agree with the most of above and about get_chainactive in specific.
I also wondered why chainActive access in LoadBlockIndexDB aren't protected, probably there is no multithreading yet there.
In the tokel repo I enabled AssertLockHeld in komodo_xxx functions accessing chainActive (that should be protected with cs_main as it is commented at the declaration) so I added some LOCK(cs_main) in komodo_nSPV_fullnode.h for nspv server calls.
Also I added LOCK(cs_main) in komodo_nextheight() as it fails immediately (could you please take a look):

int32_t komodo_nextheight()
{
    CBlockIndex* pindex;
    int32_t ht;
    {
        LOCK(cs_main); // dimxy added to protect chainActive concurrent use. 
        pindex = chainActive.LastTip();
    }
    if (pindex != nullptr && (ht = pindex->GetHeight()) > 0)
        return (ht + 1);
    else
        return (komodo_longestchain() + 1);
}

I will add your fix for get_chainactive as well.
But it seems there are many other cases with missed locks...

@jmjatlanta
Copy link

I added the protections in commit jmjatlanta@186661c

There are a few more TODO items for PR #485 which I hope to resolve in short order.

I agree there are surely many other areas where locks are missed. We'll create issues/PRs when we find them.

@dimxy
Copy link
Collaborator Author

dimxy commented Aug 29, 2021

The long answer is to rework the purpose of cs_main (i.e. make it responsible for less) before performance takes a hit

probably we could think of using the shared lock on reading and exclusive lock on writing. Wondering why this was not done in the original code

@jmjatlanta
Copy link

Wondering why this was not done in the original code

If we could move to C++17 (preferred IMO, just not sure clients would agree), read/write locks are part of the standard. If not, such things have been available in boost since 1.35.

As I am sure you are aware, such mechanisms carry overhead. But for lock contention issues (which can often be the case for frequent reads and occasional writes), there is sometimes a performance gain. I would imagine we do a lot more reading than writing to things like the block index, so I'd say your idea is a good one at least in some areas.

With so many references to cs_main in the code, I suspect that locking mechanism is either

  1. Attempting to protect too much, in which case we need more granular control (more mutexes). or
  2. Locking is done too high up in the software stack, and should be moved further down. or
  3. A combination of the two points above. <- my guess or
  4. What we have now is the best way to do it, and we need to document it with reasons why. <- possible, but not probable

@TheComputerGenie
Copy link

Wondering why this was not done in the original code

afaik, locks have only ever historically been used "sparingly" and added as necessity dictated rather than as a matter of style and practice

@dimxy
Copy link
Collaborator Author

dimxy commented Sep 1, 2021

yes, however komodo introduced a NSPV protocol which implies many client connections to nspv nodes doing multiple read requests (in comparison to ordinary nodes) so we should think of adjusting the locking strategy

@TheComputerGenie
Copy link

While I do think it's something to look at, my main point was to answer the "question" of "why".
That being said, I do urge caution so as to not over-do it and add locks in places that may work in situationX but cause situationY to fail or crash. The hardest part of changing calls (in komodo more specifically than any other daemon I can think of) is the complexity of situations that arise from it having no less than 5 different use function models: LabsNN, NN, "normal client", some obscure params choices, etc. (some of which get over looked in using/testing it for one or two specific things).

who-biz pushed a commit to who-biz/komodo that referenced this issue Mar 14, 2023
@dimxy
Copy link
Collaborator Author

dimxy commented Mar 29, 2023

many missing locks were added in #559

@dimxy dimxy closed this as completed Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants