-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Populate Soroban in-memory state in a background thread #5035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I think it's worth prototyping this, especially if it cleans up some of the weirdness in
This is definitely worth doing. I don't love the implicit state tied into |
marta-lokhova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a stab at this! I agree with most concerns listed in the PR description. Added some suggestions on how to address these.
| // other methods accessing the stream while populating in memory soroban | ||
| // state | ||
| XDRInputFileStream stream; | ||
| stream.open(mBucket->getFilename().string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems suspicious: race accessing the stream suggests we're using the same stream for eviction scanning and state population, which doesn't sound right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eviction scan also opens its own stream
stellar-core/src/bucket/BucketSnapshot.cpp
Lines 282 to 285 in 9ed2f84
| // Open new stream for eviction scan to not interfere with BucketListDB load | |
| // streams | |
| XDRInputFileStream stream{}; | |
| stream.open(mBucket->getFilename()); |
I believe this races with getEntryAtOffset.
| // more and let the node gracefully go into catchup. | ||
| releaseAssert(mLastQueuedToApply >= lcl); | ||
| if (nextToApply - lcl >= MAX_EXTERNALIZE_LEDGER_APPLY_DRIFT) | ||
| if (nextToApply - lcl >= MAX_EXTERNALIZE_LEDGER_APPLY_DRIFT && false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& false disables the condition, so this should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a temporary hack so that we don't fall into long form catchup after populating in-memory state.
src/catchup/ApplyLedgerWork.cpp
Outdated
| ZoneScoped; | ||
| if (mApp.getLedgerManager().isBooting()) | ||
| { | ||
| mApp.postOnBackgroundThread( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time we get to applyLedger, we should never be in rebuilding state. Applying ledger enforces that the state is complete and valid. Let's implement the wait in specific places, where we anticipate core to be in rebuilding state. Specifically, on startup (loadLastKnownLedger) and when catchup is done (setLastClosedLedger).
src/ledger/LedgerManagerImpl.h
Outdated
| Config const& config); | ||
|
|
||
| State mState; | ||
| bool mIsBooting{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please introduce a new state to State enum instead of this bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably good to define/enforce valid state machine transitions as well: booting -> sync, sync <-> catchup, boot -> catchup, boot -> rebuild, catchup -> rebuild, sync <-> rebuild.
src/ledger/LedgerManagerImpl.h
Outdated
|
|
||
| State mState; | ||
| bool mIsBooting{false}; | ||
| std::condition_variable mBootingCV; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mutex and cv aren't needed here: LM posts rebuild task to background, then background posts a callback to main. The callback can set LM state back to either "synced" or "catching up". Then LedgerApplyManager can apply all buffered ledgers whenever a new buffered ledger comes in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, these are only here for the waitForBoot that ApplyLedgerWork was using, but rethinking that path, anyway.
| assertSetupPhase(); | ||
| // We don't use assertSetupPhase() because we don't expect the thread | ||
| // invariant to hold | ||
| releaseAssert(mPhase == Phase::SETTING_UP_STATE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the case, assertSetupPhase should be changed to support the new feature.
src/ledger/LedgerManagerImpl.cpp
Outdated
| mBootingLock.unlock(); | ||
| mApp.postOnBackgroundThread( | ||
| [=] { | ||
| mApplyState.populateInMemorySorobanState(snapshot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ensure that mApplyState is never touched while LM is in rebuilding state. I think markEndOfSetupPhase already does that, but wanted to confirm.
| if (mApp.getLedgerManager().isBooting()) | ||
| { | ||
| mTMP_TODO = true; | ||
| return ProcessLedgerResult::WAIT_TO_APPLY_BUFFERED_OR_CATCHUP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New state in ProcessLedgerResult is needed: WAIT_FOR_STATE_REBUILD or something like that.
| mSyncingLedgers.emplace(lastReceivedLedgerSeq, ledgerData); | ||
| mLargestLedgerSeqHeard = | ||
| std::max(mLargestLedgerSeqHeard, lastReceivedLedgerSeq); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mTMP_TODO is not needed: you can anchor the replay condition on LM's state. Specifically, we should never ever attempt ledger replay when LM is rebuilding. Could we add an early exit here to and log something like "not attempting application: LM is rebuilding".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was so we could hit the fast catchup path in the case where we're populating in-memory state. Named poorly (because it was pretty hacky), but it is measuring whether we were booting in the last call to this method. This way the first time we are able to apply ledgers, we try to: right now that path (tryApplySyncingLedgers) is only hit when we receive the ledger that's next in line (to handle holes in reception). In this version, I didn't put the early exit just to keep the same mSyncingLedgers log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see - yeah let's clean this up. I think simplest would be to make tryApplySyncingLedgers no-op when there are gaps and always call it anyway (it might already be doing that).
src/ledger/LedgerManagerImpl.cpp
Outdated
| mBootingLock.lock(); | ||
| mIsBooting = true; | ||
| mBootingLock.unlock(); | ||
| mApp.postOnBackgroundThread( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that using low-priority thread isn't right here. How about ledgerApplyThread? We should never be applying during rebuild anyway.
|
Still thinking on the state machine. In particular, |
e839487 to
cfa8c6d
Compare
| // to apply mSyncingLedgers and possibly get back in sync | ||
| if (!mCatchupWork && lastReceivedLedgerSeq == *mLastQueuedToApply + 1) | ||
| if (!mCatchupWork && | ||
| mSyncingLedgers.begin()->first == *mLastQueuedToApply + 1 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this condition is quite right, although the previous condition also doesn't work if we want to apply the LedgerApplyManagerImpl ledgers once we've gone from BOOTING to BOOTED
|
Putting some notes on the various states. Fundamentally, while we're rebuilding/populating in-memory Soroban state, we don't want to apply. There are some considerations around what to do when certain transitions happen when we are still populating the in-memory state (e.g., what should we do if we're catching up at the time). The two previous commits do two different takes on this. I think both are somewhat incomplete, although, it is perhaps also worth considering what set if this state is "internal" to Places where
Places (outside of
Prior to this PR we have three states: booting, catching up, and in sync. This PR adds the The commit one before the current revision allows catchup to happen while we're still doing the initial rebuild, which leads to some additional complexity ( |
cfa8c6d to
e7fc15c
Compare
|
Notes on most recent commit:
|
2adc5e6 to
4ace1fd
Compare
Resolves #4902.