Reading Update
Data Engineering
- An Introduction to Modern Data Lake Storage Layers - A good comparison, using apache spark on how to create and run some operations on tables using Hudi, Iceberg and deltalake
- Apache Spark Performance Boosting - A very good article on the main performance issues we can get on apache Spark. As a relative beginner this might be a good go-to resource for me
- Plumbing with Airbyte - Airbyte is becoming a really good replacement for Fivetran and I’m keeping close tabs on it
- Rebundling the Data Platform - Dagster shows a very interesting take of asset based dags instead of the traditional tasks
- Kicking the tires on dbt Metrics - A good look into the potential of dbt metrics
- Integrated Audits: Streamlined Data Observability with Apache Iceberg - Apache Iceberg is a great tool and the time travel feature, as shown in this article, presents a great way to create a “branch” test the changes and then apply them
- Data diffs: Algorithms for explaining what changed in a dataset - This is a proposal to add a diff operator to SQL which, after reading this article would be an incredible addition
- Practical Schema Evolution with Avro - Useful guide from Elliot West with the different compatibilities of Avro and the type of schema change that can be done
Engineering
- shot-scraper: automated screenshots for documentation, built on Playwright - A very interesting tool using playwright to automate screenshots
- One Way Smart Developers Make Bad Strategic Decisions - A good article on why sometimes a complicated system might be int this way for a reason
Stay safe and have a nice week 🙃