Fix thread-unsafe Dictionary in telemetry (#12867)#13317
Draft
JanProvaznik wants to merge 1 commit intodotnet:mainfrom
Draft
Fix thread-unsafe Dictionary in telemetry (#12867)#13317JanProvaznik wants to merge 1 commit intodotnet:mainfrom
JanProvaznik wants to merge 1 commit intodotnet:mainfrom
Conversation
TelemetryForwarderProvider is a singleton per node. With /m /mt flags, multiple RequestBuilder threads concurrently write to shared Dictionary fields on WorkerNodeTelemetryData, causing corruption during Dictionary.Resize. The fix has two parts: 1. Add lock-based thread safety to WorkerNodeTelemetryData and ProjectTelemetry dictionary access. 2. Refactor RequestBuilder.UpdateStatisticsPostBuild to accumulate telemetry into a local WorkerNodeTelemetryData per project build, then merge once via ITelemetryForwarder.MergeWorkerData. This mirrors the OOP node pattern (local accumulate, merge once) and reduces lock acquisitions from ~10k to ~32 for a typical 32-project build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik
commented
Mar 3, 2026
Member
Author
JanProvaznik
left a comment
There was a problem hiding this comment.
@AR-May it's assigned on you and this looks promising. ptal if you want to use this or be inspired how it could be fixed
| // Telemetry for non-sealed subclasses of Microsoft-owned MSBuild tasks | ||
| // Maps Microsoft task names to counts of their non-sealed usage | ||
| private readonly Dictionary<string, int> _msbuildTaskSubclassUsage = new(); | ||
| private readonly object _subclassUsageLock = new(); |
Member
Author
There was a problem hiding this comment.
Suggested change
| private readonly object _subclassUsageLock = new(); | |
| private readonly LockType _subclassUsageLock = new(); |
|
|
||
| internal class WorkerNodeTelemetryData : IWorkerNodeTelemetryData | ||
| { | ||
| private readonly object _lock = new(); |
Member
Author
There was a problem hiding this comment.
Suggested change
| private readonly object _lock = new(); | |
| private readonly LockType _lock = new(); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #12867 —
Dictionarycorruption crash in telemetry when building with/m /mt.Problem
In out-of-proc mode (
/monly), each worker node runs in a separate process with its ownTelemetryForwarderProvidersingleton, so itsWorkerNodeTelemetryDatadictionaries are only accessed by one thread. At end-of-build, each node sends a singleWorkerNodeTelemetryEventArgsmessage and the main node merges them one at a time. No contention, no problem.In in-proc multithreaded mode (
/m /mt), all in-proc nodes share a single process and therefore a singleTelemetryForwarderProvidersingleton. MultipleRequestBuilderinstances run on dedicated threads (DedicatedThreadsTaskScheduler) and callAddTask/AddTargetconcurrently on the sameDictionary<>fields. This causesDictionary.Resizecorruption — anArgumentExceptionor silent data loss on every build with enough parallelism.Reproduction: 20 non-SDK .NET Framework library projects + 1 exe referencing all of them, built with
MSBuild.exe Repro.sln /m /mt. Crashed 12/12 times before this fix.Approach
Rather than switching to
ConcurrentDictionary(which has per-operation stripe-lock overhead and requiresAddOrUpdatewith lambda allocations for the check-then-act pattern), this PR makes the/mtpath match the OOP architecture:Before:
RequestBuilder.UpdateStatisticsPostBuildcalledtelemetryForwarder.AddTask()/AddTarget()once per target and once per task — each call individually hitting the shared singleton dictionary. For a project with 300 targets and 15 tasks, that was 315 lock-needing mutations on the shared state per project.After:
UpdateStatisticsPostBuildcreates a localWorkerNodeTelemetryData, accumulates all targets and tasks into it (zero contention — single owner, no lock needed), then callstelemetryForwarder.MergeWorkerData(localData)once. This does a singleAdd()to merge into the shared singleton under one lock acquisition. For 32 projects, that is 32 lock acquisitions total.Additionally,
WorkerNodeTelemetryData.AddTask/AddTarget/Addare now protected bylockto guard the merge path and any remaining direct callers.ProjectTelemetry._msbuildTaskSubclassUsagegets the same treatment.Changes
WorkerNodeTelemetryData.cslockaroundAddTask/AddTarget/Add; extracted privateUnsafevariants to avoid nested lockingProjectTelemetry.cslockaround_msbuildTaskSubclassUsagedictionary accessITelemetryForwarder.csMergeWorkerData(IWorkerNodeTelemetryData)to interfaceTelemetryForwarderProvider.csMergeWorkerDatainTelemetryForwarderandNullTelemetryForwarderRequestBuilder.csUpdateStatisticsPostBuildto local-accumulate + single merge; constructsTaskOrTargetTelemetryKeydirectlyWorkerNodeTelemetryData_Tests.csVerification
/m /mtincremental build × 20 runs