@@ -709,22 +709,101 @@ pub trait ScalarUDFImpl: Debug + DynEq + DynHash + Send + Sync {
709709 Ok ( ExprSimplifyResult :: Original ( args) )
710710 }
711711
712- /// Returns the [preimage] for this function and the specified scalar value, if any.
712+ /// Returns a single contiguous preimage for this function and the specified
713+ /// scalar expression, if any.
714+ ///
715+ /// Currently only applies to `=, !=, >, >=, <, <=, is distinct from, is not distinct from` predicates
716+ /// # Return Value
717+ ///
718+ /// Implementations should return a half-open interval: inclusive lower
719+ /// bound and exclusive upper bound. This is slightly different from normal
720+ /// [`Interval`] semantics where the upper bound is closed (inclusive).
721+ /// Typically this means the upper endpoint must be adjusted to the next
722+ /// value not included in the preimage. See the Half-Open Intervals section
723+ /// below for more details.
724+ ///
725+ /// # Background
726+ ///
727+ /// Inspired by the [ClickHouse Paper], a "preimage rewrite" transforms a
728+ /// predicate containing a function call into a predicate containing an
729+ /// equivalent set of input literal (constant) values. The resulting
730+ /// predicate can often be further optimized by other rewrites (see
731+ /// Examples).
732+ ///
733+ /// From the paper:
734+ ///
735+ /// > some functions can compute the preimage of a given function result.
736+ /// > This is used to replace comparisons of constants with function calls
737+ /// > on the key columns by comparing the key column value with the preimage.
738+ /// > For example, `toYear(k) = 2024` can be replaced by
739+ /// > `k >= 2024-01-01 && k < 2025-01-01`
740+ ///
741+ /// For example, given an expression like
742+ /// ```sql
743+ /// date_part('YEAR', k) = 2024
744+ /// ```
745+ ///
746+ /// The interval `[2024-01-01, 2025-12-31`]` contains all possible input
747+ /// values (preimage values) for which the function `date_part(YEAR, k)`
748+ /// produces the output value `2024` (image value). Returning the interval
749+ /// (note upper bound adjusted up) `[2024-01-01, 2025-01-01]` the expression
750+ /// can be rewritten to
751+ ///
752+ /// ```sql
753+ /// k >= '2024-01-01' AND k < '2025-01-01'
754+ /// ```
755+ ///
756+ /// which is a simpler and a more canonical form, making it easier for other
757+ /// optimizer passes to recognize and apply further transformations.
758+ ///
759+ /// # Examples
713760 ///
714- /// A preimage is a single contiguous [`Interval`] of values where the function
715- /// will always return `lit_value`
761+ /// Case 1:
716762 ///
717- /// Implementations should return intervals with an inclusive lower bound and
718- /// exclusive upper bound.
763+ /// Original:
764+ /// ```sql
765+ /// date_part('YEAR', k) = 2024 AND k >= '2024-06-01'
766+ /// ```
767+ ///
768+ /// After preimage rewrite:
769+ /// ```sql
770+ /// k >= '2024-01-01' AND k < '2025-01-01' AND k >= '2024-06-01'
771+ /// ```
719772 ///
720- /// This rewrite is described in the [ClickHouse Paper] and is particularly
721- /// useful for simplifying expressions `date_part` or equivalent functions. The
722- /// idea is that if you have an expression like `date_part(YEAR, k) = 2024` and you
723- /// can find a [preimage] for `date_part(YEAR, k)`, which is the range of dates
724- /// covering the entire year of 2024. Thus, you can rewrite the expression to `k
725- /// >= '2024-01-01' AND k < '2025-01-01' which is often more optimizable.
773+ /// Since this form is much simpler, the optimizer can combine and simplify
774+ /// sub-expressions further into:
775+ /// ```sql
776+ /// k >= '2024-06-01' AND k < '2025-01-01'
777+ /// ```
778+ ///
779+ /// Case 2:
726780 ///
781+ /// For min/max pruning, simpler predicates such as:
782+ /// ```sql
783+ /// k >= '2024-01-01' AND k < '2025-01-01'
784+ /// ```
785+ /// are much easier for the pruner to reason about. See [PruningPredicate]
786+ /// for the backgrounds of predicate pruning.
787+ ///
788+ /// The trade-off with the preimage rewrite is that evaluating the rewritten
789+ /// form might be slightly more expensive than evaluating the original
790+ /// expression. In practice, this cost is usually outweighed by the more
791+ /// aggressive optimization opportunities it enables.
792+ ///
793+ /// # Half-Open Intervals
794+ ///
795+ /// The preimage API uses half-open intervals, which makes the rewrite
796+ /// easier to implement by avoiding calculations to adjust the upper bound.
797+ /// For example, if a function returns its input unchanged and the desired
798+ /// output is the single value `5`, a closed interval could be represented
799+ /// as `[5, 5]`, but then the rewrite would require adjusting the upper
800+ /// bound to `6` to create a proper range predicate. With a half-open
801+ /// interval, the same range is represented as `[5, 6)`, which already
802+ /// forms a valid predicate.
803+ ///
804+ /// [PruningPredicate]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html
727805 /// [ClickHouse Paper]: https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
806+ /// [image]: https://en.wikipedia.org/wiki/Image_(mathematics)#Image_of_an_element
728807 /// [preimage]: https://en.wikipedia.org/wiki/Image_(mathematics)#Inverse_image
729808 fn preimage (
730809 & self ,
0 commit comments