Skip to content

HIVE-29461: Iceberg: HIVE_METASTORE_WAREHOUSE_EXTERNAL is ignored when initializing HiveCatalog#6454

Open
1fanwang wants to merge 4 commits intoapache:masterfrom
1fanwang:HIVE-29461-iceberg-external-warehouse
Open

HIVE-29461: Iceberg: HIVE_METASTORE_WAREHOUSE_EXTERNAL is ignored when initializing HiveCatalog#6454
1fanwang wants to merge 4 commits intoapache:masterfrom
1fanwang:HIVE-29461-iceberg-external-warehouse

Conversation

@1fanwang
Copy link
Copy Markdown

What changes were proposed in this pull request?

Add HiveCatalog.EXTERNAL_WAREHOUSE_LOCATION ("external-warehouse") as a canonical catalog property and propagate it from HiveCatalog.initialize(name, properties) to hive.metastore.warehouse.external.dir, mirroring the existing handling for CatalogProperties.WAREHOUSE_LOCATION (including LocationUtil.stripTrailingSlash).

The two in-tree callers are unified onto the new constant:

  • HMSCatalogFactory#createCatalog drops its redundant configuration.set(...) workaround that was needed because the property was not being read from the properties map.
  • IcebergSummaryHandler#initialize switches from the unhyphenated "externalwarehouse" key (which was effectively dead code — HiveCatalog.initialize never read it) to the canonical name.

Why are the changes needed?

HiveCatalog.initialize(name, properties) propagated CatalogProperties.URI and CatalogProperties.WAREHOUSE_LOCATION to the Hadoop configuration but silently dropped any external-warehouse value, so getExternalWarehouseLocation() and convertToDatabase() failed with:

Warehouse location is not set: hive.metastore.warehouse.external.dir=null

whenever a caller reached HiveCatalog through the standard Iceberg Catalog.initialize(name, properties) API without separately mutating the Configuration first.

The two existing callers worked around the gap differently and used different property keys:

  • HMSCatalogFactory.java (line 86-91 before this change) put "external-warehouse" in the properties map AND re-set the value on the Configuration before initialize(), with a comment that explicitly acknowledged the API gap: // HiveCatalog reads this property directly from Configuration, not from properties map.
  • IcebergSummaryHandler.java (line 67 before this change) put "externalwarehouse" (no hyphen) in the properties map but did not perform the Configuration.set workaround — it only happened to work because the prior setConf(configuration) call carried the value through when MetastoreConf.WAREHOUSE_EXTERNAL was already set on the underlying Hadoop conf.

External integrations using the standard Iceberg Catalog.initialize(name, properties) API (e.g. Polaris) had no documented way to pass the external warehouse location and tripped the NPE. After this change the property is documented on HiveCatalog, propagated via initialize(), and the two key spellings are unified.

Does this PR introduce any user-facing change?

Yes — adds a new public catalog property external-warehouse (HiveCatalog.EXTERNAL_WAREHOUSE_LOCATION) that callers can pass into HiveCatalog.initialize(name, properties) to set the external warehouse location. Existing flows that set hive.metastore.warehouse.external.dir directly on the Configuration continue to work unchanged.

How was this patch tested?

Added testInitializeCatalogWithExternalWarehouseProperty in TestHiveCatalog, mirroring the existing testInitializeCatalogWithProperties. The test verifies that both WAREHOUSE_LOCATION and the new EXTERNAL_WAREHOUSE_LOCATION propagate to their respective Hadoop conf keys with trailing-slash stripping.

Local Maven test runs were blocked on unrelated lock contention from concurrent builds in my environment (maven-remote-resources-plugin: Could not acquire lock(s)), so I'm relying on CI for full verification. Happy to address any failures.

…n initializing HiveCatalog

HiveCatalog.initialize() propagated CatalogProperties.WAREHOUSE_LOCATION to the
Hadoop configuration but ignored the equivalent external-warehouse property, so
getExternalWarehouseLocation() and convertToDatabase() failed with an NPE
("Warehouse location is not set: hive.metastore.warehouse.external.dir=null")
whenever a caller reached HiveCatalog through the standard Iceberg
Catalog.initialize(name, properties) API without separately mutating the
Configuration. Callers worked around this either by re-setting the value on the
Configuration before initialize() (HMSCatalogFactory) or by relying on a
side-channel setConf() carrying the value through (IcebergSummaryHandler), and
in both cases used a different, undocumented property key
("external-warehouse" vs "externalwarehouse").

Add HiveCatalog.EXTERNAL_WAREHOUSE_LOCATION ("external-warehouse") as the
canonical catalog property and propagate it from initialize() to
hive.metastore.warehouse.external.dir, mirroring the existing handling for
WAREHOUSE_LOCATION (including LocationUtil.stripTrailingSlash). Update the two
in-tree callers to use the constant and drop HMSCatalogFactory's redundant
configuration.set() workaround. The unhyphenated key in IcebergSummaryHandler
was effectively dead code (HiveCatalog never read it) and is replaced rather
than retained.

Adds testInitializeCatalogWithExternalWarehouseProperty mirroring
testInitializeCatalogWithProperties to lock in the propagation behavior.
@1fanwang 1fanwang force-pushed the HIVE-29461-iceberg-external-warehouse branch from 78e0f09 to 9846386 Compare April 27, 2026 09:01
…actory

deniskuzZ pointed out (PR review on HMSCatalogFactory.java) that
HMSCatalogFactory.createCatalog already calls hiveCatalog.setConf(configuration)
right before initialize(), and HiveCatalog reads WAREHOUSE_EXTERNAL straight
from Configuration via getExternalWarehouseLocation(). The properties.put
for external-warehouse was therefore redundant on this path even after the
HiveCatalog#initialize fix; setConf already supplies the value.

Removes the redundant block here. The HiveCatalog#initialize change still
fixes the IcebergSummaryHandler path, which initializes by properties only
without a setConf call.
@1fanwang
Copy link
Copy Markdown
Author

@deniskuzZ Right — pushed c727def to drop the whole if (configExtWarehouse != null) block. setConf(configuration) here already gives HiveCatalog the value via getExternalWarehouseLocation() reading from Configuration. The HiveCatalog#initialize change still covers the IcebergSummaryHandler path, which initializes by properties only without a setConf call.

…s.WAREHOUSE_LOCATION

deniskuzZ flagged on PR review that the external-warehouse properties.put
shouldn't have been removed in the prior commit (c727def) -- the new
HiveCatalog.EXTERNAL_WAREHOUSE_LOCATION constant is the contract HiveCatalog
exposes for callers initializing by properties (IcebergSummaryHandler), and
HMSCatalogFactory should keep using it for consistency with the other
catalog properties set in the same block, even though setConf() on this
specific path also supplies the value.

Restores the if-block. Also applies their suggestion at the existing
warehouse line: switch the literal "warehouse" to CatalogProperties.WAREHOUSE_LOCATION.
@1fanwang
Copy link
Copy Markdown
Author

Sorry @deniskuzZ — misread your earlier "do we even need this?" as "remove it". Restored the external-warehouse properties.put in ce92efc, and applied your CatalogProperties.WAREHOUSE_LOCATION suggestion to the warehouse line in the same commit.

@deniskuzZ
Copy link
Copy Markdown
Member

Sorry @deniskuzZ — misread your earlier "do we even need this?" as "remove it". Restored the external-warehouse properties.put in ce92efc, and applied your CatalogProperties.WAREHOUSE_LOCATION suggestion to the warehouse line in the same commit.

@1fanwang i meant do we need to call setConf at all, since we pass everything thought the properties?

@1fanwang
Copy link
Copy Markdown
Author

Ahh sorry, second misread. I think setConf is still needed here because HMSCatalogFactory's configuration is the fully-populated HiveConf — HMS thrift auth, SSL/timeout, kerberos, the SERVLET_ID_KEY, etc. Without setConf, HiveCatalog falls back to its default Configuration and loses all of that. The new properties-only path makes sense for callers that initialize purely from a property map (IcebergSummaryHandler), but on this path setConf is providing more than the URI/warehouse/external-warehouse triple. Open to dropping it if there's a flow I'm missing — does HiveCatalog reconstruct the HMS-specific config from properties alone somewhere I haven't traced?

@1fanwang
Copy link
Copy Markdown
Author

The one wrinkle on dropping setConf is CLIENT_POOL_CACHE_KEYS = "conf:SERVLET_ID_KEY" a few lines up — that tells HiveCatalog to read SERVLET_ID_KEY from Configuration when computing the per-test cache key, and without setConf HiveCatalog falls back to its default Configuration that doesn't have SERVLET_ID_KEY (so cache-key uniqueness breaks for the embedded tests). Leaning toward a separate follow-up JIRA for the setConf cleanup so this PR stays focused on HIVE-29461 — let me know if you'd rather roll it in.

@deniskuzZ
Copy link
Copy Markdown
Member

The one wrinkle on dropping setConf is CLIENT_POOL_CACHE_KEYS = "conf:SERVLET_ID_KEY" a few lines up — that tells HiveCatalog to read SERVLET_ID_KEY from Configuration when computing the per-test cache key, and without setConf HiveCatalog falls back to its default Configuration that doesn't have SERVLET_ID_KEY (so cache-key uniqueness breaks for the embedded tests). Leaning toward a separate follow-up JIRA for the setConf cleanup so this PR stays focused on HIVE-29461 — let me know if you'd rather roll it in.

Should we introduce single constructor for that. Something similar to HadoopCatalog:
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java#L137-L140

@1fanwang
Copy link
Copy Markdown
Author

Yeah, a HiveCatalog(Configuration, Map<String, String>) constructor mirroring HadoopCatalog L137-140 is the right shape. That's an iceberg-side change though, since HiveCatalog lives in iceberg-core. I can file the iceberg issue/PR for the constructor and a Hive follow-up JIRA for adopting it (rolling in the setConf cleanup at the same time), unless you're already tracking the iceberg side somewhere.

@1fanwang
Copy link
Copy Markdown
Author

https://ci.hive.apache.org/job/hive-precommit/job/PR-6454/2/testReport/junit/org.apache.iceberg.metasummary/TestIcebergSummaryHandler/Testing___split_06___PostProcess___testInitialize/

Thanks @deniskuzZ I was trying to check the exact failure however not able to view, does this require some ACL?

 Oops!

Not Found

This page may not exist, or you may not have permission to see it.

Not Found [Jenkins].pdf

@1fanwang
Copy link
Copy Markdown
Author

Filed both:

  • iceberg #16134 — issue proposing HiveCatalog(Configuration, Map<String, String>) mirroring HadoopCatalog L137-140.
  • iceberg #16135 — implementation + test.

Hive follow-up JIRA for the HMSCatalogFactory adoption + the setConf cleanup will go up once the iceberg side lands. Keeping this PR focused on HIVE-29461.

…iterals

Address @deniskuzZ review on apache#6454: replace the literal 'uri' and
'warehouse' strings with CatalogProperties.URI and
CatalogProperties.WAREHOUSE_LOCATION so the Iceberg-side property names
stay consistent if they ever change upstream. CatalogProperties is
already imported.
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants