Populate Soroban in-memory state in a background thread #5035

drebelsky · 2025-11-25T18:14:37Z

Resolves #4902.

bboston7 · 2025-11-26T19:57:44Z

Add a form of catchup that works for this use case

I think it's worth prototyping this, especially if it cleans up some of the weirdness in LedgerApplyManager.

Maybe add another explicit state to LedgerManager(Impl)

This is definitely worth doing. I don't love the implicit state tied into mBooting when we already have a State enum.

marta-lokhova

Thanks for taking a stab at this! I agree with most concerns listed in the PR description. Added some suggestions on how to address these.

marta-lokhova · 2025-12-01T18:28:44Z

src/bucket/BucketSnapshot.cpp

+    // other methods accessing the stream while populating in memory soroban
+    // state
+    XDRInputFileStream stream;
+    stream.open(mBucket->getFilename().string());


this seems suspicious: race accessing the stream suggests we're using the same stream for eviction scanning and state population, which doesn't sound right.

Eviction scan also opens its own stream

stellar-core/src/bucket/BucketSnapshot.cpp

Lines 282 to 285 in 9ed2f84

// Open new stream for eviction scan to not interfere with BucketListDB load

// streams

XDRInputFileStream stream{};

stream.open(mBucket->getFilename());

I believe this races with getEntryAtOffset.

marta-lokhova · 2025-12-01T18:39:21Z

src/catchup/LedgerApplyManagerImpl.cpp

        // more and let the node gracefully go into catchup.
        releaseAssert(mLastQueuedToApply >= lcl);
-        if (nextToApply - lcl >= MAX_EXTERNALIZE_LEDGER_APPLY_DRIFT)
+        if (nextToApply - lcl >= MAX_EXTERNALIZE_LEDGER_APPLY_DRIFT && false)


&& false disables the condition, so this should be removed.

Yeah, this was a temporary hack so that we don't fall into long form catchup after populating in-memory state.

marta-lokhova · 2025-12-01T18:46:15Z

src/catchup/ApplyLedgerWork.cpp

    ZoneScoped;
+    if (mApp.getLedgerManager().isBooting())
+    {
+        mApp.postOnBackgroundThread(


By the time we get to applyLedger, we should never be in rebuilding state. Applying ledger enforces that the state is complete and valid. Let's implement the wait in specific places, where we anticipate core to be in rebuilding state. Specifically, on startup (loadLastKnownLedger) and when catchup is done (setLastClosedLedger).

marta-lokhova · 2025-12-01T18:51:53Z

src/ledger/LedgerManagerImpl.h

                                    Config const& config);

    State mState;
+    bool mIsBooting{false};


Please introduce a new state to State enum instead of this bool.

Probably good to define/enforce valid state machine transitions as well: booting -> sync, sync <-> catchup, boot -> catchup, boot -> rebuild, catchup -> rebuild, sync <-> rebuild.

marta-lokhova · 2025-12-01T18:57:15Z

src/ledger/LedgerManagerImpl.h


    State mState;
+    bool mIsBooting{false};
+    std::condition_variable mBootingCV;


mutex and cv aren't needed here: LM posts rebuild task to background, then background posts a callback to main. The callback can set LM state back to either "synced" or "catching up". Then LedgerApplyManager can apply all buffered ledgers whenever a new buffered ledger comes in.

Yeah, these are only here for the waitForBoot that ApplyLedgerWork was using, but rethinking that path, anyway.

marta-lokhova · 2025-12-01T19:00:04Z

src/ledger/LedgerManagerImpl.cpp

-    assertSetupPhase();
+    // We don't use assertSetupPhase() because we don't expect the thread
+    // invariant to hold
+    releaseAssert(mPhase == Phase::SETTING_UP_STATE);


If this is the case, assertSetupPhase should be changed to support the new feature.

marta-lokhova · 2025-12-01T19:01:29Z

src/ledger/LedgerManagerImpl.cpp

+        mBootingLock.unlock();
+        mApp.postOnBackgroundThread(
+            [=] {
+                mApplyState.populateInMemorySorobanState(snapshot,


We should ensure that mApplyState is never touched while LM is in rebuilding state. I think markEndOfSetupPhase already does that, but wanted to confirm.

marta-lokhova · 2025-12-01T19:08:03Z

src/catchup/LedgerApplyManagerImpl.cpp

+    if (mApp.getLedgerManager().isBooting())
+    {
+        mTMP_TODO = true;
+        return ProcessLedgerResult::WAIT_TO_APPLY_BUFFERED_OR_CATCHUP;


New state in ProcessLedgerResult is needed: WAIT_FOR_STATE_REBUILD or something like that.

marta-lokhova · 2025-12-01T19:11:01Z

src/catchup/LedgerApplyManagerImpl.cpp

    mSyncingLedgers.emplace(lastReceivedLedgerSeq, ledgerData);
    mLargestLedgerSeqHeard =
        std::max(mLargestLedgerSeqHeard, lastReceivedLedgerSeq);



mTMP_TODO is not needed: you can anchor the replay condition on LM's state. Specifically, we should never ever attempt ledger replay when LM is rebuilding. Could we add an early exit here to and log something like "not attempting application: LM is rebuilding".

Yeah, this was so we could hit the fast catchup path in the case where we're populating in-memory state. Named poorly (because it was pretty hacky), but it is measuring whether we were booting in the last call to this method. This way the first time we are able to apply ledgers, we try to: right now that path (tryApplySyncingLedgers) is only hit when we receive the ledger that's next in line (to handle holes in reception). In this version, I didn't put the early exit just to keep the same mSyncingLedgers log message.

Oh I see - yeah let's clean this up. I think simplest would be to make tryApplySyncingLedgers no-op when there are gaps and always call it anyway (it might already be doing that).

marta-lokhova · 2025-12-01T19:18:05Z

src/ledger/LedgerManagerImpl.cpp

+        mBootingLock.lock();
+        mIsBooting = true;
+        mBootingLock.unlock();
+        mApp.postOnBackgroundThread(


I agree that using low-priority thread isn't right here. How about ledgerApplyThread? We should never be applying during rebuild anyway.

drebelsky · 2025-12-05T00:25:24Z

Still thinking on the state machine. In particular, mShouldReportOnMain and the difference between LM_BOOTING_STATE and LM_BOOTING_CATCHUP_STATE feels suboptimal. The main complication comes from offline catchup.

drebelsky · 2025-12-08T17:00:59Z

src/catchup/LedgerApplyManagerImpl.cpp

    // to apply mSyncingLedgers and possibly get back in sync
-    if (!mCatchupWork && lastReceivedLedgerSeq == *mLastQueuedToApply + 1)
+    if (!mCatchupWork &&
+        mSyncingLedgers.begin()->first == *mLastQueuedToApply + 1 &&


I don't think this condition is quite right, although the previous condition also doesn't work if we want to apply the LedgerApplyManagerImpl ledgers once we've gone from BOOTING to BOOTED

drebelsky · 2025-12-08T17:08:39Z

Putting some notes on the various states. Fundamentally, while we're rebuilding/populating in-memory Soroban state, we don't want to apply. There are some considerations around what to do when certain transitions happen when we are still populating the in-memory state (e.g., what should we do if we're catching up at the time). The two previous commits do two different takes on this. I think both are somewhat incomplete, although, it is perhaps also worth considering what set if this state is "internal" to LedgerManagerImpl and shouldn't (necessarily) be exposed via getState() (e.g., the distinction between the various phases of boot)

Places where setState is called:

LedgerManagerImpl::moveToSynced: outside of tests, called in
- LedgerManagerImpl::ledgerCloseComplete (shouldn't happen if we're not applying)
- HerderImpl::bootstrap when mConfig.FORCE_SCP is set, I'm not quite sure how this should interact with what's there
- HerderImpl::setInSyncAndTriggerNextLedger (called in ApplicationImpl::manualClose when mConfig.FORCE_SCP and mConfig.MANUAL_CLOSE are unset)
LedgerManagerImpl::startCatchup (only called during offline catchup)
LedgerManagerImpl::loadLastKnownLedgerInternal (called in startup path)
LedgerManagerImpl::setLastClosedLedger

  // NB: this method is a sort of half-apply that runs on main thread and
  // updates LCL without apply having happened any txs. It's only relevant
  // when finishing the _bucket-apply_ phase of catchup (which is not
  // transaction-apply, it's like "load any bucket state into the DB").

LedgerManagerImpl::valueExternalized, when call to LedgerApplyManager::processLedger() returns ProcessLedgerResult::WAIT_TO_APPLY_BUFFERED_OR_CATCHUP

Places (outside of LedgerManagerImpl) where getState is called

LedgerApplyManager::processLedger
HerderImpl::setInSyncAndTriggerNextLedger
ApplicationImpl::getState
LedgerManager::isBooting (called in ApplyLedgerWork)
LedgerManager::isSynced used in
- HerderImpl::lastClosedLedgerIncreased (under a releaseAssert)
- HerderImpl::setupTriggerNextLedger (under a releaseAssert)
- HerderImpl::triggerNextLedger
- setAuthenticatedLedgerHashPair
- Peer::recvMessage

Prior to this PR we have three states: booting, catching up, and in sync. This PR adds the BOOTED state so that we don't mark as "in sync" or "catching up" or apply ledgers until we've finished populating state. Note also that we need to distinguish between booting and booting (populating state) during catchup for LedgerApplyManagerImpl::processLedger (hence LM_BOOTING_CATCHUP_STATE).

The commit one before the current revision allows catchup to happen while we're still doing the initial rebuild, which leads to some additional complexity (mShouldReportOnMain so we don't inadvertently switch to the wrong state when the initial rebuild finishes). The current revision avoids this by never going into catch up unless we're already booted (although, this leads to some oddity in that startCatchup will still start the catch-up flow). So, I suppose one pertinent question is whether in the case that a node is offline for a while (so that it will definitely have to do catchup) if the extra delay to wait for state rebuild first is too long.

drebelsky · 2025-12-11T20:43:20Z

Notes on most recent commit:

After discussion, we chose to focus on the restart case—that is, we're only concerned about online non-bucket apply catchup for buffering ledgers (this simplifies the number of states from ~8 -> 4)
I left the customization of the invariant assertion in LedgerManagerImpl::ApplyState::populateInMemorySorobanState because it felt cleaner than the alternative to me (otherwise, we have to modify threadInvariant to have a reference to LedgerManager to check if the state is currently BOOTING, but mAppConnector won't let us get ledger manager when we're not on the main thread)
Instead of blocking startCatchup (offline catchup), Currently, waitForLedgerManager is duplicated between ApplyBucketsWork and ApplyLedgersWork. I couldn't think of a good/clean way to block on waiting for booted inside startCatchup/catchup, so I moved the logic here, instead. I wasn't sure that this made sense to refactor into a method on LedgerManager, but I'm happy to change that if others think that's better. Also worth noting that practically, the case shouldn't get hit in ApplyBucketsWork (if we're applying buckets, we're starting from genesis, and downloading/verifying buckets should be more expensive than populating 0 in-memory state).
I couldn't recall what we had decided to do for LedgerApplyManagerImpl::MAX_EXTERNALIZE_LEDGER_APPLY_DRIFT

drebelsky changed the title ~~Populate Soroban in-memory state in a background thread~~ DRAFT: Populate Soroban in-memory state in a background thread Nov 25, 2025

marta-lokhova requested changes Dec 1, 2025

View reviewed changes

marta-lokhova reviewed Dec 1, 2025

View reviewed changes

drebelsky force-pushed the bg-pop-soroban branch from e839487 to cfa8c6d Compare December 5, 2025 01:19

drebelsky commented Dec 8, 2025

View reviewed changes

drebelsky force-pushed the bg-pop-soroban branch from cfa8c6d to e7fc15c Compare December 11, 2025 20:35

drebelsky marked this pull request as ready for review December 11, 2025 20:43

drebelsky changed the title ~~DRAFT: Populate Soroban in-memory state in a background thread~~ Populate Soroban in-memory state in a background thread Dec 11, 2025

drebelsky requested review from bboston7 and marta-lokhova December 11, 2025 21:56

drebelsky added 5 commits December 15, 2025 14:00

Populate Soroban in-memory state in background on restart

97f19a9

Fix HerderImpl::bootstrap

c720373

Only run async flow when running normally

db2f445

Handle new state in ApplicationImpl::getState

00f54c0

Allow bucket-apply before Application::start()

4ace1fd

drebelsky force-pushed the bg-pop-soroban branch from 2adc5e6 to 4ace1fd Compare December 15, 2025 22:01

	// Open new stream for eviction scan to not interfere with BucketListDB load
	// streams
	XDRInputFileStream stream{};
	stream.open(mBucket->getFilename());

Populate Soroban in-memory state in a background thread #5035

Are you sure you want to change the base?

Populate Soroban in-memory state in a background thread #5035

Uh oh!

Conversation

drebelsky commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboston7 commented Nov 26, 2025

Uh oh!

marta-lokhova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marta-lokhova Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drebelsky commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drebelsky commented Dec 8, 2025

Uh oh!

drebelsky commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drebelsky commented Nov 25, 2025 •

edited

Loading

marta-lokhova Dec 1, 2025 •

edited

Loading

drebelsky commented Dec 5, 2025 •

edited

Loading

drebelsky commented Dec 11, 2025 •

edited

Loading