HIVE-29461: Iceberg: HIVE_METASTORE_WAREHOUSE_EXTERNAL is ignored when initializing HiveCatalog#6454
Conversation
…n initializing HiveCatalog
HiveCatalog.initialize() propagated CatalogProperties.WAREHOUSE_LOCATION to the
Hadoop configuration but ignored the equivalent external-warehouse property, so
getExternalWarehouseLocation() and convertToDatabase() failed with an NPE
("Warehouse location is not set: hive.metastore.warehouse.external.dir=null")
whenever a caller reached HiveCatalog through the standard Iceberg
Catalog.initialize(name, properties) API without separately mutating the
Configuration. Callers worked around this either by re-setting the value on the
Configuration before initialize() (HMSCatalogFactory) or by relying on a
side-channel setConf() carrying the value through (IcebergSummaryHandler), and
in both cases used a different, undocumented property key
("external-warehouse" vs "externalwarehouse").
Add HiveCatalog.EXTERNAL_WAREHOUSE_LOCATION ("external-warehouse") as the
canonical catalog property and propagate it from initialize() to
hive.metastore.warehouse.external.dir, mirroring the existing handling for
WAREHOUSE_LOCATION (including LocationUtil.stripTrailingSlash). Update the two
in-tree callers to use the constant and drop HMSCatalogFactory's redundant
configuration.set() workaround. The unhyphenated key in IcebergSummaryHandler
was effectively dead code (HiveCatalog never read it) and is replaced rather
than retained.
Adds testInitializeCatalogWithExternalWarehouseProperty mirroring
testInitializeCatalogWithProperties to lock in the propagation behavior.
78e0f09 to
9846386
Compare
…actory deniskuzZ pointed out (PR review on HMSCatalogFactory.java) that HMSCatalogFactory.createCatalog already calls hiveCatalog.setConf(configuration) right before initialize(), and HiveCatalog reads WAREHOUSE_EXTERNAL straight from Configuration via getExternalWarehouseLocation(). The properties.put for external-warehouse was therefore redundant on this path even after the HiveCatalog#initialize fix; setConf already supplies the value. Removes the redundant block here. The HiveCatalog#initialize change still fixes the IcebergSummaryHandler path, which initializes by properties only without a setConf call.
|
@deniskuzZ Right — pushed c727def to drop the whole |
…s.WAREHOUSE_LOCATION deniskuzZ flagged on PR review that the external-warehouse properties.put shouldn't have been removed in the prior commit (c727def) -- the new HiveCatalog.EXTERNAL_WAREHOUSE_LOCATION constant is the contract HiveCatalog exposes for callers initializing by properties (IcebergSummaryHandler), and HMSCatalogFactory should keep using it for consistency with the other catalog properties set in the same block, even though setConf() on this specific path also supplies the value. Restores the if-block. Also applies their suggestion at the existing warehouse line: switch the literal "warehouse" to CatalogProperties.WAREHOUSE_LOCATION.
|
Sorry @deniskuzZ — misread your earlier "do we even need this?" as "remove it". Restored the external-warehouse |
@1fanwang i meant do we need to call |
|
Ahh sorry, second misread. I think |
|
The one wrinkle on dropping |
Should we introduce single constructor for that. Something similar to HadoopCatalog: |
|
Yeah, a |
|
Thanks @deniskuzZ I was trying to check the exact failure however not able to view, does this require some ACL? |
|
Filed both:
Hive follow-up JIRA for the |
…iterals Address @deniskuzZ review on apache#6454: replace the literal 'uri' and 'warehouse' strings with CatalogProperties.URI and CatalogProperties.WAREHOUSE_LOCATION so the Iceberg-side property names stay consistent if they ever change upstream. CatalogProperties is already imported.
|



What changes were proposed in this pull request?
Add
HiveCatalog.EXTERNAL_WAREHOUSE_LOCATION("external-warehouse") as a canonical catalog property and propagate it fromHiveCatalog.initialize(name, properties)tohive.metastore.warehouse.external.dir, mirroring the existing handling forCatalogProperties.WAREHOUSE_LOCATION(includingLocationUtil.stripTrailingSlash).The two in-tree callers are unified onto the new constant:
HMSCatalogFactory#createCatalogdrops its redundantconfiguration.set(...)workaround that was needed because the property was not being read from the properties map.IcebergSummaryHandler#initializeswitches from the unhyphenated"externalwarehouse"key (which was effectively dead code —HiveCatalog.initializenever read it) to the canonical name.Why are the changes needed?
HiveCatalog.initialize(name, properties)propagatedCatalogProperties.URIandCatalogProperties.WAREHOUSE_LOCATIONto the Hadoop configuration but silently dropped any external-warehouse value, sogetExternalWarehouseLocation()andconvertToDatabase()failed with:whenever a caller reached
HiveCatalogthrough the standard IcebergCatalog.initialize(name, properties)API without separately mutating theConfigurationfirst.The two existing callers worked around the gap differently and used different property keys:
HMSCatalogFactory.java(line 86-91 before this change) put"external-warehouse"in the properties map AND re-set the value on theConfigurationbeforeinitialize(), with a comment that explicitly acknowledged the API gap:// HiveCatalog reads this property directly from Configuration, not from properties map.IcebergSummaryHandler.java(line 67 before this change) put"externalwarehouse"(no hyphen) in the properties map but did not perform theConfiguration.setworkaround — it only happened to work because the priorsetConf(configuration)call carried the value through whenMetastoreConf.WAREHOUSE_EXTERNALwas already set on the underlying Hadoop conf.External integrations using the standard Iceberg
Catalog.initialize(name, properties)API (e.g. Polaris) had no documented way to pass the external warehouse location and tripped the NPE. After this change the property is documented onHiveCatalog, propagated viainitialize(), and the two key spellings are unified.Does this PR introduce any user-facing change?
Yes — adds a new public catalog property
external-warehouse(HiveCatalog.EXTERNAL_WAREHOUSE_LOCATION) that callers can pass intoHiveCatalog.initialize(name, properties)to set the external warehouse location. Existing flows that sethive.metastore.warehouse.external.dirdirectly on theConfigurationcontinue to work unchanged.How was this patch tested?
Added
testInitializeCatalogWithExternalWarehousePropertyinTestHiveCatalog, mirroring the existingtestInitializeCatalogWithProperties. The test verifies that bothWAREHOUSE_LOCATIONand the newEXTERNAL_WAREHOUSE_LOCATIONpropagate to their respective Hadoop conf keys with trailing-slash stripping.Local Maven test runs were blocked on unrelated lock contention from concurrent builds in my environment (
maven-remote-resources-plugin: Could not acquire lock(s)), so I'm relying on CI for full verification. Happy to address any failures.