I see Iceberg as a nice complement to a real data warehouse, as a way to manage raw data files and ETL's that load them. Getting file updates as transactions a few times a day seems a lot cleaner.
But I constantly underestimate the tendency to take a minor capability and exaggerate it into a replacement for 50 years of heavily studied technology.
Even if people somehow get this to work like a real database, they will quickly become frustrated with the same features that the data lake avoids: access controls, locks, consistency, centralization, a schema, you name it.
> Even if people somehow get this to work like a real database, they will quickly become frustrated with the same features that the data lake avoids: access controls, locks, consistency, centralization, a schema, you name it.
Are you saying that Apache Iceberg does not have these features? Other than access controls, most of them are there. What Iceberg most certainly does not have is efficient update--you have to load in large batches to avoid metadata file explosion. That's actually fine for cold storage, without which analytic databases become economical at scale. And Iceberg does have features like time travel, which databases like ClickHouse lack. It's also shareable, it's not just an appendage of one particular data warehouse.
One real problem with Iceberg is that a lot of people (vendors mostly) are jumping on it as a way to market their products.
author here. I would actually second you.
My core belief is "it is possible to build true database-like functionalities on top of iceberg", but it is definitely not 'easier' than building them directly in a db (in fact, doing them while keep iceberg-compatible is tricky. yep, that's the cost of being open and general purpose)
I see Iceberg as a nice complement to a real data warehouse, as a way to manage raw data files and ETL's that load them. Getting file updates as transactions a few times a day seems a lot cleaner.
But I constantly underestimate the tendency to take a minor capability and exaggerate it into a replacement for 50 years of heavily studied technology.
Even if people somehow get this to work like a real database, they will quickly become frustrated with the same features that the data lake avoids: access controls, locks, consistency, centralization, a schema, you name it.
> Even if people somehow get this to work like a real database, they will quickly become frustrated with the same features that the data lake avoids: access controls, locks, consistency, centralization, a schema, you name it.
Are you saying that Apache Iceberg does not have these features? Other than access controls, most of them are there. What Iceberg most certainly does not have is efficient update--you have to load in large batches to avoid metadata file explosion. That's actually fine for cold storage, without which analytic databases become economical at scale. And Iceberg does have features like time travel, which databases like ClickHouse lack. It's also shareable, it's not just an appendage of one particular data warehouse.
One real problem with Iceberg is that a lot of people (vendors mostly) are jumping on it as a way to market their products.
author here. I would actually second you. My core belief is "it is possible to build true database-like functionalities on top of iceberg", but it is definitely not 'easier' than building them directly in a db (in fact, doing them while keep iceberg-compatible is tricky. yep, that's the cost of being open and general purpose)