This article wasn’t properly parsed by omnivore but the big takeaways:
- We can add plugins for extended functionality to our spark server
- Using spark connect we can implement any library in any language we want and send grpc requests to the server (spark connect server needs to be running)
- spark connect works on >3.5. Should be much better on v4.0
- If I truly want to be good in spark I eventually need to relearn scala/java
- Glue is cool but it’s still on version 3.3. All these goodies will take too long to be implemented in glue
After reading the statistics books I can see much more clear the value of proving a null hypophesis, this is the feeling I am getting out of the academia. We are seeing more research without any any added value. Goodhart’s law.
Sounds just like an airflow contender. with the plus of being able to run notebooks 🤔
Good theme for a blog post on the changes of spark 4, this is really useful for human errors (been there multiple times)
So this will be generated on the fly as views by the semantic layer?, This looked neat until the moment I understood that the semantic layer requires dbt cloud
Meta seems to be a couple of years ahead of the industry. The article doesn’t provide a lot of insights but gives a feeling of their model evaluation being mostly automated + having a good AI debug tool
First step in a long way before we can get run python wihtout GIL. Interested on seeing if libraries like pandas will be able to leverage multithreading eventually with this
Mixed feelings here. Great to see Open Lineage implemented at AWS. However it feels again that AWS just created the integration and won’t be driving the development of open lineage
What could be improved to help this kind of migrations be done in a matter of days?, Livy might be deprecated in favor of spark connect. With their migration to Spark 3 and eventually 3.5 (not clear on this article) they could be interested in moving new jobs to connect , Basically solved issues by using the old behaviours. These will need to be migrated eventually. Would need to better understand these features , This looks like an important detail. With no explicit order spark can have random order of rows?, Cool to see these migrations and teams using open source solutions. EMR although expensive with a good engineering team can prove to be quite cost effective
The need to define a data platform is something I see everywhere. It really looks like we are missing a piece here. Netflix maestro for example seems like a good contender for to solve the issue of (yet another data custom platform)
This articles brings me the question. Can we improve dbt by using WAP? How does the rollback mechanism work when a process fails?
Super interesting to see how we can enable data quality visibility
Good to use open data lakes showing the big cost and speed improvements
Why is this so hard? 😭