Lakehouse 2.0: The Open System That Lakehouse 1.0 Was Meant to Be | Part 1 by Modern Data 101
- The Contradiction of Vision & RealityThe
Start with high human involvement: In the early stages, have domain experts evaluate a significant percentage of outputs. Study alignment patterns: Rather than automating evaluation, focus on understanding where automated evaluations align with human judgment and where they diverge. This helps you identify which types of cases need more careful human attention. Use strategic sampling: Rather than evaluating every output, use statistical techniques to sample outputs that provide the most information, particularly focusing on areas where alignment is weakest. Maintain regular calibration: Even as you scale, continue to compare automated evaluations against human judgment regularly, using these comparisons to refine your understanding of when to trust automated evaluations.
Your manager isn’t a mind reader.
You can’t expect guidance if you don’t come with a direction.
Your growth is a shared effort, but it starts with you.
Rubber duck debugging like a pro: I can often solve my problem by drafting a forum post without posting it. The effort required to articulate the salient details of the system and the problem, without looking dumb, is higher than the effort I have usually put in at the point I decide I need help. Corollary: making a forum post without sounding like I haven’t done my homework also tends to put me over my time/energy budget for solving a seemingly-trivial problem.
Behold the trail of crumbs: I find that writing and diagramming, while helpful for many troubleshooting projects, are essential for multi-session troubleshooting projects. I overestimate how much I will remember about the context, as well as how soon I will get around to continuing the project. A troubleshooting notes file, no matter how obvious or incomplete the information in it seems at the time I write it, leaves a trail of crumbs that I can follow next time. (I have often repeated, verbatim, an entire troubleshooting process, found the problem — and then remembered I troubleshot the exact system, and arrived at the same conclusion, years ago; but there was some hiccup, and I failed to order or install the new part.)
Hardware support for low precision data types Design for asynchronous transfers from day 1 Dedicated hardware for tensor aware memory transfers Replace your cache hierarchy with an outsized scratchpad for AI inference For a single accelerator, turn the memory bandwidth up to 11 Design for scale-out from day 1 Dedicated communication hardware should complement compute hardware
Hash partitioning (based on column values) Even partitioning (by files or row counts) Random shuffle partitioning
Examples of this pitfall:
Use an agentic framework when direct API calls work. Agonize over what vector database to use when a simple term-based retrieval solution (that doesn’t require a vectordb) works. Insist on finetuning when prompting works. Use semantic caching.
To automatically evaluate AI applications, many teams opt for the AI-as-a-judge (also called LLM-as-a-judge) approach — using AI models to evaluate AI outputs. A common pitfall is forgoing human evaluation to rely entirely on AI judges.
Gotten my reading from +100 articles to 58! 🎉
Wow, need to test this out on my side projects.
Skimmed the article but although high level I find the main points very true. Understanding the system, being on top of the state of the art. That and the tips for head of data
Great episode! Nice to see that css is getting better and better
Wow, I’d love this one man saas but knowing how it can burn you out…
Nice episode on the long sought features for node that have been introduced on bun and demo (single file, typescript support and top await async are the big ones for me)
Great to see a new version coming along! Is pdb worth using with vs code? 🤔
As a google to owner would be good to know that my working hardware wouldn’t brick just because I don’t want to upgrade
Although I had my eyes already on rye I certainly didn’t know it was so full feature. Gotta try and change some of my projects to it ,
Python support slowly getting good
The report raises a good point about an increase in workload. With a big productivity boost bosses might start loading even bigger workloads than what we gained from the productivity boost
Ofc this isn’t good as people will feel overloaded
Super interesting IMO. Might give it a try on my team when we start deploying solutions to our clients
Would be interesting to test this out. Have an example of a dataset that can be queried using duckdb. Given a question understand if a query is correct how to fix it and improve it’s performance. One in Sal and another in pyspark (or ibis/pandas)
English sdk looks awesome but requires an OpenAI key. Could it be replaced with ollama?
Interesting topics to research a bit more on CAP alternatives
In summary, to have a good AI product we need to have data with quality which requires good data governance. With this data we need to define useful products that we can measure it’s value using data driven metrics. We must also ensure the product has good practices avoiding security or bias issues
Resumed by:
This is something I would love to implement. Allowing to define the metrics on which to evaluate a new feature, the expected hypothesis and revert the feature (i.e feature flags) automatically with a report on the experiment
Could I replace Athena with this? I think the main blocker for me is I want to work with S3. And need to check how it runs for a really large dataset…
Good tip. For any project define the metric of code coverage goals and start increasing on the project
This article wasn’t properly parsed by omnivore but the big takeaways:
After reading the statistics books I can see much more clear the value of proving a null hypophesis, this is the feeling I am getting out of the academia. We are seeing more research without any any added value. Goodhart’s law.
Sounds just like an airflow contender. with the plus of being able to run notebooks 🤔
Good theme for a blog post on the changes of spark 4, this is really useful for human errors (been there multiple times)
So this will be generated on the fly as views by the semantic layer?, This looked neat until the moment I understood that the semantic layer requires dbt cloud
Meta seems to be a couple of years ahead of the industry. The article doesn’t provide a lot of insights but gives a feeling of their model evaluation being mostly automated + having a good AI debug tool
First step in a long way before we can get run python wihtout GIL. Interested on seeing if libraries like pandas will be able to leverage multithreading eventually with this
Mixed feelings here. Great to see Open Lineage implemented at AWS. However it feels again that AWS just created the integration and won’t be driving the development of open lineage
What could be improved to help this kind of migrations be done in a matter of days?, Livy might be deprecated in favor of spark connect. With their migration to Spark 3 and eventually 3.5 (not clear on this article) they could be interested in moving new jobs to connect , Basically solved issues by using the old behaviours. These will need to be migrated eventually. Would need to better understand these features , This looks like an important detail. With no explicit order spark can have random order of rows?, Cool to see these migrations and teams using open source solutions. EMR although expensive with a good engineering team can prove to be quite cost effective
The need to define a data platform is something I see everywhere. It really looks like we are missing a piece here. Netflix maestro for example seems like a good contender for to solve the issue of (yet another data custom platform)
This articles brings me the question. Can we improve dbt by using WAP? How does the rollback mechanism work when a process fails?
Super interesting to see how we can enable data quality visibility
Good to use open data lakes showing the big cost and speed improvements
Why is this so hard? 😭