A database kernel engineer's take on Apache Iceberg

3 points | by moonikakiss 21 hours ago

3 comments

kwillets 20 hours ago
I see Iceberg as a nice complement to a real data warehouse, as a way to manage raw data files and ETL's that load them. Getting file updates as transactions a few times a day seems a lot cleaner.
But I constantly underestimate the tendency to take a minor capability and exaggerate it into a replacement for 50 years of heavily studied technology.
Even if people somehow get this to work like a real database, they will quickly become frustrated with the same features that the data lake avoids: access controls, locks, consistency, centralization, a schema, you name it.
[-]
- hodgesrm 14 hours ago
  > Even if people somehow get this to work like a real database, they will quickly become frustrated with the same features that the data lake avoids: access controls, locks, consistency, centralization, a schema, you name it.
  Are you saying that Apache Iceberg does not have these features? Other than access controls, most of them are there. What Iceberg most certainly does not have is efficient update--you have to load in large batches to avoid metadata file explosion. That's actually fine for cold storage, without which analytic databases become economical at scale. And Iceberg does have features like time travel, which databases like ClickHouse lack. It's also shareable, it's not just an appendage of one particular data warehouse.
  One real problem with Iceberg is that a lot of people (vendors mostly) are jumping on it as a way to market their products.
- zhousun 18 hours ago
  author here. I would actually second you. My core belief is "it is possible to build true database-like functionalities on top of iceberg", but it is definitely not 'easier' than building them directly in a db (in fact, doing them while keep iceberg-compatible is tricky. yep, that's the cost of being open and general purpose)