Skip to content

Fix thread-unsafe Dictionary in telemetry (#12867)#13317

Draft
JanProvaznik wants to merge 1 commit intodotnet:mainfrom
JanProvaznik:fix/telemetry-thread-safety-12867
Draft

Fix thread-unsafe Dictionary in telemetry (#12867)#13317
JanProvaznik wants to merge 1 commit intodotnet:mainfrom
JanProvaznik:fix/telemetry-thread-safety-12867

Conversation

@JanProvaznik
Copy link
Member

@JanProvaznik JanProvaznik commented Mar 3, 2026

Summary

Fixes #12867Dictionary corruption crash in telemetry when building with /m /mt.

Problem

In out-of-proc mode (/m only), each worker node runs in a separate process with its own TelemetryForwarderProvider singleton, so its WorkerNodeTelemetryData dictionaries are only accessed by one thread. At end-of-build, each node sends a single WorkerNodeTelemetryEventArgs message and the main node merges them one at a time. No contention, no problem.

In in-proc multithreaded mode (/m /mt), all in-proc nodes share a single process and therefore a single TelemetryForwarderProvider singleton. Multiple RequestBuilder instances run on dedicated threads (DedicatedThreadsTaskScheduler) and call AddTask/AddTarget concurrently on the same Dictionary<> fields. This causes Dictionary.Resize corruption — an ArgumentException or silent data loss on every build with enough parallelism.

Reproduction: 20 non-SDK .NET Framework library projects + 1 exe referencing all of them, built with MSBuild.exe Repro.sln /m /mt. Crashed 12/12 times before this fix.

Approach

Rather than switching to ConcurrentDictionary (which has per-operation stripe-lock overhead and requires AddOrUpdate with lambda allocations for the check-then-act pattern), this PR makes the /mt path match the OOP architecture:

Before: RequestBuilder.UpdateStatisticsPostBuild called telemetryForwarder.AddTask()/AddTarget() once per target and once per task — each call individually hitting the shared singleton dictionary. For a project with 300 targets and 15 tasks, that was 315 lock-needing mutations on the shared state per project.

After: UpdateStatisticsPostBuild creates a local WorkerNodeTelemetryData, accumulates all targets and tasks into it (zero contention — single owner, no lock needed), then calls telemetryForwarder.MergeWorkerData(localData) once. This does a single Add() to merge into the shared singleton under one lock acquisition. For 32 projects, that is 32 lock acquisitions total.

Additionally, WorkerNodeTelemetryData.AddTask/AddTarget/Add are now protected by lock to guard the merge path and any remaining direct callers. ProjectTelemetry._msbuildTaskSubclassUsage gets the same treatment.

Changes

File Change
WorkerNodeTelemetryData.cs Added lock around AddTask/AddTarget/Add; extracted private Unsafe variants to avoid nested locking
ProjectTelemetry.cs Added lock around _msbuildTaskSubclassUsage dictionary access
ITelemetryForwarder.cs Added MergeWorkerData(IWorkerNodeTelemetryData) to interface
TelemetryForwarderProvider.cs Implemented MergeWorkerData in TelemetryForwarder and NullTelemetryForwarder
RequestBuilder.cs Refactored UpdateStatisticsPostBuild to local-accumulate + single merge; constructs TaskOrTargetTelemetryKey directly
WorkerNodeTelemetryData_Tests.cs 3 new concurrency regression tests (AddTask, AddTarget, Add merge) using 20 threads + Barrier

Verification

Scenario Before After
E2E: 20-project /m /mt incremental build × 20 runs 12/12 crashes 20/20 pass
Unit tests: concurrent AddTask/AddTarget/Add N/A 3/3 pass

TelemetryForwarderProvider is a singleton per node. With /m /mt flags,
multiple RequestBuilder threads concurrently write to shared Dictionary
fields on WorkerNodeTelemetryData, causing corruption during
Dictionary.Resize.

The fix has two parts:

1. Add lock-based thread safety to WorkerNodeTelemetryData and
   ProjectTelemetry dictionary access.

2. Refactor RequestBuilder.UpdateStatisticsPostBuild to accumulate
   telemetry into a local WorkerNodeTelemetryData per project build,
   then merge once via ITelemetryForwarder.MergeWorkerData. This mirrors
   the OOP node pattern (local accumulate, merge once) and reduces lock
   acquisitions from ~10k to ~32 for a typical 32-project build.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Member Author

@JanProvaznik JanProvaznik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AR-May it's assigned on you and this looks promising. ptal if you want to use this or be inspired how it could be fixed

// Telemetry for non-sealed subclasses of Microsoft-owned MSBuild tasks
// Maps Microsoft task names to counts of their non-sealed usage
private readonly Dictionary<string, int> _msbuildTaskSubclassUsage = new();
private readonly object _subclassUsageLock = new();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private readonly object _subclassUsageLock = new();
private readonly LockType _subclassUsageLock = new();


internal class WorkerNodeTelemetryData : IWorkerNodeTelemetryData
{
private readonly object _lock = new();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private readonly object _lock = new();
private readonly LockType _lock = new();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ProjectTelemetry subclass tracking is not thread safe

1 participant