Skip to content

Commit 00c7155

Browse files
committed
Merge remote-tracking branch 'upstream/main' into json_array_support
2 parents f1ed4a8 + f997169 commit 00c7155

File tree

3 files changed

+92
-27
lines changed

3 files changed

+92
-27
lines changed

datafusion/expr/src/udf.rs

Lines changed: 90 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -709,22 +709,101 @@ pub trait ScalarUDFImpl: Debug + DynEq + DynHash + Send + Sync {
709709
Ok(ExprSimplifyResult::Original(args))
710710
}
711711

712-
/// Returns the [preimage] for this function and the specified scalar value, if any.
712+
/// Returns a single contiguous preimage for this function and the specified
713+
/// scalar expression, if any.
714+
///
715+
/// Currently only applies to `=, !=, >, >=, <, <=, is distinct from, is not distinct from` predicates
716+
/// # Return Value
717+
///
718+
/// Implementations should return a half-open interval: inclusive lower
719+
/// bound and exclusive upper bound. This is slightly different from normal
720+
/// [`Interval`] semantics where the upper bound is closed (inclusive).
721+
/// Typically this means the upper endpoint must be adjusted to the next
722+
/// value not included in the preimage. See the Half-Open Intervals section
723+
/// below for more details.
724+
///
725+
/// # Background
726+
///
727+
/// Inspired by the [ClickHouse Paper], a "preimage rewrite" transforms a
728+
/// predicate containing a function call into a predicate containing an
729+
/// equivalent set of input literal (constant) values. The resulting
730+
/// predicate can often be further optimized by other rewrites (see
731+
/// Examples).
732+
///
733+
/// From the paper:
734+
///
735+
/// > some functions can compute the preimage of a given function result.
736+
/// > This is used to replace comparisons of constants with function calls
737+
/// > on the key columns by comparing the key column value with the preimage.
738+
/// > For example, `toYear(k) = 2024` can be replaced by
739+
/// > `k >= 2024-01-01 && k < 2025-01-01`
740+
///
741+
/// For example, given an expression like
742+
/// ```sql
743+
/// date_part('YEAR', k) = 2024
744+
/// ```
745+
///
746+
/// The interval `[2024-01-01, 2025-12-31`]` contains all possible input
747+
/// values (preimage values) for which the function `date_part(YEAR, k)`
748+
/// produces the output value `2024` (image value). Returning the interval
749+
/// (note upper bound adjusted up) `[2024-01-01, 2025-01-01]` the expression
750+
/// can be rewritten to
751+
///
752+
/// ```sql
753+
/// k >= '2024-01-01' AND k < '2025-01-01'
754+
/// ```
755+
///
756+
/// which is a simpler and a more canonical form, making it easier for other
757+
/// optimizer passes to recognize and apply further transformations.
758+
///
759+
/// # Examples
713760
///
714-
/// A preimage is a single contiguous [`Interval`] of values where the function
715-
/// will always return `lit_value`
761+
/// Case 1:
716762
///
717-
/// Implementations should return intervals with an inclusive lower bound and
718-
/// exclusive upper bound.
763+
/// Original:
764+
/// ```sql
765+
/// date_part('YEAR', k) = 2024 AND k >= '2024-06-01'
766+
/// ```
767+
///
768+
/// After preimage rewrite:
769+
/// ```sql
770+
/// k >= '2024-01-01' AND k < '2025-01-01' AND k >= '2024-06-01'
771+
/// ```
719772
///
720-
/// This rewrite is described in the [ClickHouse Paper] and is particularly
721-
/// useful for simplifying expressions `date_part` or equivalent functions. The
722-
/// idea is that if you have an expression like `date_part(YEAR, k) = 2024` and you
723-
/// can find a [preimage] for `date_part(YEAR, k)`, which is the range of dates
724-
/// covering the entire year of 2024. Thus, you can rewrite the expression to `k
725-
/// >= '2024-01-01' AND k < '2025-01-01' which is often more optimizable.
773+
/// Since this form is much simpler, the optimizer can combine and simplify
774+
/// sub-expressions further into:
775+
/// ```sql
776+
/// k >= '2024-06-01' AND k < '2025-01-01'
777+
/// ```
778+
///
779+
/// Case 2:
726780
///
781+
/// For min/max pruning, simpler predicates such as:
782+
/// ```sql
783+
/// k >= '2024-01-01' AND k < '2025-01-01'
784+
/// ```
785+
/// are much easier for the pruner to reason about. See [PruningPredicate]
786+
/// for the backgrounds of predicate pruning.
787+
///
788+
/// The trade-off with the preimage rewrite is that evaluating the rewritten
789+
/// form might be slightly more expensive than evaluating the original
790+
/// expression. In practice, this cost is usually outweighed by the more
791+
/// aggressive optimization opportunities it enables.
792+
///
793+
/// # Half-Open Intervals
794+
///
795+
/// The preimage API uses half-open intervals, which makes the rewrite
796+
/// easier to implement by avoiding calculations to adjust the upper bound.
797+
/// For example, if a function returns its input unchanged and the desired
798+
/// output is the single value `5`, a closed interval could be represented
799+
/// as `[5, 5]`, but then the rewrite would require adjusting the upper
800+
/// bound to `6` to create a proper range predicate. With a half-open
801+
/// interval, the same range is represented as `[5, 6)`, which already
802+
/// forms a valid predicate.
803+
///
804+
/// [PruningPredicate]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html
727805
/// [ClickHouse Paper]: https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
806+
/// [image]: https://en.wikipedia.org/wiki/Image_(mathematics)#Image_of_an_element
728807
/// [preimage]: https://en.wikipedia.org/wiki/Image_(mathematics)#Inverse_image
729808
fn preimage(
730809
&self,

datafusion/optimizer/src/simplify_expressions/udf_preimage.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@ use datafusion_expr_common::interval_arithmetic::Interval;
2626
/// range for which it is valid) and `x` is not `NULL`
2727
///
2828
/// For details see [`datafusion_expr::ScalarUDFImpl::preimage`]
29-
///
3029
pub(super) fn rewrite_with_preimage(
3130
preimage_interval: Interval,
3231
op: Operator,

datafusion/physical-optimizer/src/projection_pushdown.rs

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ use datafusion_common::tree_node::{
3232
};
3333
use datafusion_common::{JoinSide, JoinType, Result};
3434
use datafusion_physical_expr::expressions::Column;
35-
use datafusion_physical_expr_common::physical_expr::PhysicalExpr;
35+
use datafusion_physical_expr_common::physical_expr::{PhysicalExpr, is_volatile};
3636
use datafusion_physical_plan::ExecutionPlan;
3737
use datafusion_physical_plan::joins::NestedLoopJoinExec;
3838
use datafusion_physical_plan::joins::utils::{ColumnIndex, JoinFilter};
@@ -349,8 +349,7 @@ impl<'a> JoinFilterRewriter<'a> {
349349
// Recurse if there is a dependency to both sides or if the entire expression is volatile.
350350
let depends_on_other_side =
351351
self.depends_on_join_side(&expr, self.join_side.negate())?;
352-
let is_volatile = is_volatile_expression_tree(expr.as_ref());
353-
if depends_on_other_side || is_volatile {
352+
if depends_on_other_side || is_volatile(&expr) {
354353
return expr.map_children(|expr| self.rewrite(expr));
355354
}
356355

@@ -431,18 +430,6 @@ impl<'a> JoinFilterRewriter<'a> {
431430
}
432431
}
433432

434-
fn is_volatile_expression_tree(expr: &dyn PhysicalExpr) -> bool {
435-
if expr.is_volatile_node() {
436-
return true;
437-
}
438-
439-
expr.children()
440-
.iter()
441-
.map(|expr| is_volatile_expression_tree(expr.as_ref()))
442-
.reduce(|lhs, rhs| lhs || rhs)
443-
.unwrap_or(false)
444-
}
445-
446433
#[cfg(test)]
447434
mod test {
448435
use super::*;

0 commit comments

Comments
 (0)