Add support for masked histograms #6695

jhapradip · 2025-05-03T04:39:06Z

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

ThomasRaoux

looks cool, few comments

lib/Conversion/TritonGPUToLLVM/HistogramOpToLLVM.cpp

ThomasRaoux · 2025-05-03T14:12:36Z

lib/Conversion/TritonGPUToLLVM/HistogramOpToLLVM.cpp

+      // mask out the values for which input mask is invalid
+      binMask = b.and_(binMask, inputMaskBit);


it looks like every loop iteration will AND the same value?

inputMaskBit is loop invariant, but binMask is updated within the loop.

right but we are just masking out bits in the loop? Would it be equivalent to apply it once to mask outside the loop?

python/triton/language/core.py

python/src/ir.cc

include/triton/Dialect/Triton/IR/TritonOps.td

ThomasRaoux · 2025-05-04T23:31:25Z

lib/Conversion/TritonGPUToLLVM/HistogramOpToLLVM.cpp

+      // mask out the values for which input mask is invalid
+      binMask = b.and_(binMask, inputMaskBit);


right but we are just masking out bits in the loop? Would it be equivalent to apply it once to mask outside the loop?

ThomasRaoux · 2025-05-04T23:33:22Z

lib/Dialect/TritonGPU/IR/Ops.cpp

+    auto mask = op.getMask();
+    if (mask)
+      return failure();


I don't have a great solution but not having this optimization is likely to cause poor performance whenever the histogram with mask is used.
One thing we could do is create a convert_layout for the mask instead, still not always ideal but I wonder if it is more likely to be better overall

You are right about the other one. I fixed the code. Thanks for your suggestion.

I am not very familiar with how layout conversions are handled. The current code assumes that both operands are of the same layout so that layout of the mask matches the layout of src. I am not entirely sure how Triton picks the layout for these. My understanding is the current code removes any conversions because it can compute the histogram in any order. When the mask is applied, the operation can still happen in any layout as long as both of these operands match in layout.

I think the current code ensures this by keeping any conversions if the mask is present and this can be inefficient as you pointed out. If both operands are being converted, I think you are suggesting that we should only convert the mask. If so, is there a good way to match the layout of the src? Do we implement this manually or is there a trait to indicate this? Another case I am curious about is if the src and mask have the same original layout and then both have (the same) conversion applied to each. Is this something we need to handle or would it be handled automatically by other layout passes?

I guess I am looking for idiomatic ways to represent layout constraints in Triton. Secondarily, I would like to confirm what should be converted. I assume ideally only the mask should be converted.

@ThomasRaoux I revisited this PR and as per your suggestion added code to convert layout of mask to match the layout of src. As mentioned in the comments, I think the best we can do is a single conversion to have the mask match the src. Please let me know if I should do something else.

ThomasRaoux

Thanks, one more comment on the pattern with convert layout. After that it should be good to land.

ThomasRaoux · 2025-06-02T06:50:21Z

lib/Dialect/TritonGPU/IR/Ops.cpp

+    // Retrieve ancestor of a value before all conversions
+    auto getAncestorBeforeConversions = [](auto value) {
+      auto numConversions = 0;
+      while (auto convert = value.template getDefiningOp<ConvertLayoutOp>()) {
+        numConversions++;
+        value = convert.getSrc();
+      }
+      return std::make_pair(value, numConversions);
+    };
+
+    auto [src, numSrcConversions] = getAncestorBeforeConversions(op.getSrc());
+
+    // If there is no mask, replace the src directly
+    if (!op.getMask()) {
+      if (!numSrcConversions)
+        return failure();
+
+      rewriter.replaceOpWithNewOp<triton::HistogramOp>(
+          op, op->getResult(0).getType(), src, op.getMask());
+      return success();
+    }
+
+    // When mask is present, we want a single conversion on mask to match src's
+    // layout. If there are more conversions, delete them.
+    auto [mask, numMaskConversions] =
+        getAncestorBeforeConversions(op.getMask());
+
+    auto sharedType = getI1SameShape(src.getType());
+    if (numSrcConversions || numMaskConversions > 1) {
+      rewriter.setInsertionPoint(op);
+      mask = rewriter.create<ConvertLayoutOp>(op.getLoc(), sharedType, mask);
+      rewriter.replaceOpWithNewOp<triton::HistogramOp>(
+          op, op->getResult(0).getType(), src, mask);
+      return success();
+    } else {
      return failure();
-    rewriter.replaceOpWithNewOp<triton::HistogramOp>(
-        op, op->getResult(0).getType(), convert.getSrc());
-    return mlir::success();
+    }


this looks overly complicated, can we just keep the original logic and if a mask exist create a convert layout for it?
Also could you add a simple lit test for this?

I simplified the convert logic for the mask and added a lit test.

Currently, Triton handles tensors with dimensions that are powers of two. This limitation makes it difficult to create histograms for irregularly sized tensors. Other operations such as load and store work around this limitation using a mask parameter. This commit adds similar support for histograms. Fixes triton-lang#4825.

ThomasRaoux

LGTM, thanks for the great work

jhapradip requested a review from ptillet as a code owner May 3, 2025 04:39

ThomasRaoux reviewed May 3, 2025

View reviewed changes

jhapradip force-pushed the main branch from 7e0563b to eb75882 Compare May 4, 2025 07:08

ThomasRaoux reviewed May 4, 2025

View reviewed changes

jhapradip force-pushed the main branch from eb75882 to 608ab93 Compare May 5, 2025 02:07

jhapradip force-pushed the main branch from 608ab93 to 5f30745 Compare June 2, 2025 06:10

ThomasRaoux reviewed Jun 2, 2025

View reviewed changes

jhapradip added 3 commits June 3, 2025 12:04

Add support for masked histograms in interpreter

b3ba0ad

Add tests for masked histogram

b558b8c

jhapradip force-pushed the main branch from 5f30745 to b558b8c Compare June 3, 2025 19:05

ThomasRaoux approved these changes Jun 3, 2025

View reviewed changes

ThomasRaoux merged commit 2a10b48 into triton-lang:main Jun 3, 2025
8 checks passed

		// mask out the values for which input mask is invalid
		binMask = b.and_(binMask, inputMaskBit);

Add support for masked histograms #6695

Add support for masked histograms #6695

Uh oh!

Conversation

jhapradip commented May 3, 2025

New contributor declaration

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants